This blog was written by Prassanna Ganesh Ravishankar, Senior Machine Learning Software Engineer at Papercup.
“In the beginning was the code” (nice TED talk here).
Way back in the 1980s, the client-server programming paradigm came into being. This changed the long-standing paradigm of a monolithic application or tool being served to users packaged with application logic. There was now a need to decompose applications into “backend” and “front end” logic.
Fast-forward to the late 1990s and the early 2000s, there was the rise of Web 1.0, and standards started emerging. The earliest websites were static websites, but we slowly started seeing the existence of dynamic web content with rich Javascript. Around this time, we see the earliest appearances of “The Stack” (such as the LAMP stack) – a suite of languages and paradigms serving different pieces of a website (such as databases, backend, and frontend). Even though we had a full stack here, developers at this time were simply called Web Developers.
Slowly, we see the arrival of two-way communication and rich web content with collaboration – such as social networks, comments on blogs, and tagging of information. These became emerging standards for Web 2.0, which ushered in a new generation of standards of HTML5, CSS3, and ever-growing Javascript standards. Now we’re in the late 2000s
This surfaced a new kind of application called “Web Apps,” which slowly started replacing desktop applications. With this came different meanings to the stack, such as Ruby on Rails, Laravel, Python Django, LAMP, XAMPP, and more! This meant there was more fragmentation in the ecosystem and several “communities” of developers.
This was when we had the unfolding of the Full Stack Developer. Instead of hiring different backend, frontend, and database developers, organizations started focussing on hiring full-stack developers who had experience across the choice of the tech stack that the organization had built up. In the early days of Web 2.0, most web applications were just hosted on rented Linux boxes by a hosting service provider (like GoDaddy).
In parallel, two things happened; Javascript everywhere and the cloud. We had the materialization of the “Javascript” stack. Javascript (NodeJS) on the backend, Javascript on the front end. Simultaneously, AWS just blew up (The deceptively simple origins of AWS).
The new full-stack developer beyond 2015 started looking like this.
Let’s laterally shift to talk about Machine Learning. Machine Learning, unlike popular knowledge, is old – it is a very old field, older than the web itself. It was coined in 1959 in a paper titled Some Studies in Machine Learning Using the Game of Checkers.
The community made systematic strides forward – with the invention of the first neural network soon after in 1960, and that too in physical form. Since this was also around the birth of computing, most of the community was focused on Algorithms, such as A* and breadth first search. The focus of this artificial intelligence phase was not on data but instead on efficient solutions to known problems. Meanwhile, the “learn from data” paradigm was inching forward – with the creation of Backpropagation, which was borrowed from control theory (Gradient Theory of Optimal Flight Paths). This resulted in the creation of the first practical neural network – LeNet (Backpropagation applied to handwritten zip code recognition). The neural network community continues to move, mostly laterally instead of forward, albeit slowly, because soon enough, they hit compute limits. ConvNets also suffered from AT&T breaking up (Yann LeCun’s rant on ConvNets being snatched away from him).
While the neural network slowed down, “machine learning” as a term started gaining popularity. The focus was to go back into fundamental statistics such as hyperplanes (SVM), Bayesian approaches,distributions and graphical models. What could not be achieved in an end-to-end system, such as Neural networks, was now achieved by handcrafting features. Feature detection and extraction became commonplace words, with algorithms such as HOG, LBP, Sift and many more being published. A lot of emphases was placed on “understanding data” and improving the amount of information that could be extracted from raw data. However, this approach was very man-made and handcrafted for specific purposes – and the extraction of information was not guided by the distribution of data. Proxy methods such as Bag of words materialized wherein many of these handcrafted features were extracted, and a clustering based mechanism acted as an adaptation to a given dataset. This phase also created a distinct separation between feature extraction phases and machine learning phases, and a distinct pipeline emerged. One drawback of this generation was that the machine learning algorithms did not scale with data. As the quantity of data being collected increased, these pipelines plateaued without being able to “absorb” large-scale data.
In the background, Python was created in the early 90s and went on to gain popularity in the late 2000s
It all started in 2012. Three essential ingredients were in place. One – A large-scale dataset was created by Fei-Fei Li and Co. called ImageNet in 2009. Two – ConvNet’s patent had lapsed in 2007, and future iterations of Convolutional networks could be invented permissibly. Three – Gaming picked up with amazing games like Crysis and Skyrim requiring demanding GPU cards. What happened, as a result, was the organic combination of these raw materials. AlexNet came into being, essentially summarized as follows: “Let’s take a convolutional network and make it deeper because now we have more compute, and let’s train it on a supermassive dataset because now we have ImageNet”. They achieved amazing results on ImageNet and ushered in a new generation – which took a while to pick up because of the cost of computing. What this new deep learning generation achieved was the ability to scale the performance of one’s model to large data.
In parallel, in the rest of the computing world, several things were happening.
This new phase where we are right now is all about leveraging the cloud for incredible applications and machine learning models. This means that machine learning models had to be inherently distributable. This triggered the creation of frameworks such as Horovod and Spark, and PyTorch natively started supporting distributed training. These frameworks also meant that the fixed cost for new companies dramatically reduced while more of the cloud was adopted. Models that might have required five weeks to be trained on a big GPU now require just one day and ~50 GPUs. These constructs meant that the end-to-end training time for models was dramatically reduced, which translated into more iterations of models in production. Paradigms such as Data-centric AI arose, which put the focus back on data and the movement of data. Likewise, cloud-native machine learning frameworks were born – such as Sagemaker and Databricks.
A notable development started happening in deep learning research – where machine learning research was more active in companies than academia.
Finally, coming to the purpose of this blog post is the description of the full-stack-machine-learning person. As Machine learning evolved in tandem with large-scale web infrastructure, Machine learning started leveraging the cloud and beyond. Likewise, since most of the machine learning today is happening in the industry, a lot of practical machine learning is driven by product and, therefore, value generation. This also means that the end goal is not a research paper. Hence, the code that backs the machine learning model has to be optimized, maintainable, scalable, and deployed as a product or into a product, creating the need for a skill level that is “full stack” in ML and well-versed across the stack.
Tangentially, a lot of the machine learning is now happening on IOT devices (such as Alexa and Nest), which means there is a need for a distinct optimization(or quantization) phase to make ML inference cheap and fast. As a product becomes more mature and processes become more fixed, data pipelines may be built to automate as many processes as possible.
Today’s full-stack ML engineer works across the stack and can work from ideation all the way to deployment. In today’s complex cloud infrastructure world, this means getting down and dirty with the cloud and distributed computing. Let me walk you through the layers (libraries/frameworks linked are representative examples)
Replace X library/framework/language by your-favourite-alternative