In the world of computer science, there are so many tools and technologies out there, that it can get overwhelming rather quickly. Before you've completed learning how to use a tool, one of the big tech companies has open-sourced yet another cool project. The problem is: how do you keep up? One of the things that has helped me is to focus more on the "why" rather than the "how". It's impossible to know how to use every single tool out there, and so it's reasonable to learn how to use them when you need to, say on the job, or when you're working on a side project. But spending time understanding the "why", the motivation of the project goes a long way. it enables you to quickly understand the problems that it's solving, why other similar projects are built, how to compare and contrast them, and most importantly, how to pick the right tool for the job at hand. (Yes, I used to be a Python guy and build everything in python, but I've now realised that it's not about the language, it's about understanding the problem at hand, and finding the right tool that can solve the problem and fits well into your existing architecture 🥲 ).
So, let's take a 10000ft view of the journey all the way from supercomputing to spark, and see why some of the technologies were built, what their limitations were, and how they sparked the need for the next generation of tools to be built.
Back in the day, the only approach for scaling was vertical. If you had a computationally intensive task, you would beef up the machine. In other words, supercomputing. Of course, that works well, to an extent. But that has its limitations: - To take maximum advantage of the high-performance hardware, you have to write highly efficient code, which requires expertise and sometimes comes with sacrificed readability. - Data size started becoming huge, which means that to complete within a reasonable amount of time, even with high-performant computers, a single machine wouldn't be sufficient.
Enter, cluster computing. A lot of cheap, unreliable machines that can be used to achieve reliable computing. Well, that seems reasonable! But of course, it's not paradise-land yet. With solutions to problems come more, and better problems. Yes, you could now leverage a cluster, but that means writing code that could achieve fault tolerance and debugging parallel programs which requires a lot of low-level programming expertise. As a software engineer, maybe all you want to do is compute the counts of words that appear in a large set of documents, but now you have to worry about parallelizing it, and making your code fault-tolerant.
Hello, map-reduce/hadoop. The motivation here was to create a higher level of abstraction which hides all these complexities and allows the programmer to concentrate on the problem at hand. If you could model your problem into a map and a reduce procedure, you could make use of the map-reduce infrastructure that orchestrates the processing on distributed servers, managing all the communication between different nodes, in an reliable and fault-tolerant manner. In a distributed setting, network latencies can become the bottleneck, so map-reduce solved this with the help of a distributed file system called Hadoop Distributed File System (HDFS) wherein data is first partitioned on multiple nodes in a clusster and computation is moved to data, rather than the other way around. In other words, the nodes in the cluster only operate on the data that is present on their nodes, which avoids moving data around the cluster, thereby reducing latency. While its proponents thought it was a "paradigm-shift", it had its criticisms, notably from Michael Stonebraker (a specialist in database systems, and Turing Award winner). Map-reduce was attempting to solve the problem of processing big data at scale, so some of the criticisms were: - Why not use one of the then-existing high-performance parallel database systems instead of map-reduce? The database community had spent years optimizing and scaling data processing operations. Map-reduce, in fact, paled in comparison of execution times. - The lessons learnt in database community, such as "schemas are good" or "high-level access languages like SQL are good" were not incorporated by map-reduce. - It did not support indexes, or transactions.
While these were all valid points, the map-reduce community sure had counter arguments. First, it was not meant to be a DBMS, it was an algorithmic technique. Further, it was built to perform data processing on cheap, and possibly unreliable computers, in a reliable manner (as opposed to the parallel RDBMS systems, which assumed reliable hardware for the most part). Ignoring this criticisms for a moment, map-reduce also had other limitations. Firstly, map-reduce writes to disk at every step, which isn't very efficient. Further, It had a limited set of operators - map, reduce. Map-reduce was built around an acyclic data flow model, which was not conducive for the types of applications that were starting to be built - like machine learning algorithms and interactive data analysis. That is, you could not reuse a working set of data across multiple parallel operations. If you had to run multiple analytical queries on the same data using map-reduce, you would have to load it from disk every time, which again, isn't very efficient.
Introducing, Spark. Spark is a cluster computing framework that solved these problems. It avoids writing to disk for intermediate operations, and does it in-memory instead. It does this using something called a Resilient Distributed Dataset (RDD), which is an immutable collection of data that is partitioned into chunks and distributed across the nodes in a Spark cluster. All spark functions operate on RDDs. There are a couple of key features of an RDD. One, RDDs can be cached in memory of the nodes, which makes it suitable for iterative tasks(like the data analysis usecase we discussed above). Two, Spark keeps a record of the lineage of an RDD transformations, which allows it to reconstruct the RDD if lost, thereby making the otherwise unreliable memory operations, fault tolerant. As a result, spark faster than map-reduce, and is today a popular choice for large-scale data analytics!
So, where are we, today with big data infrastructures? It's interesting to note that we started with map-reduce trading-off traditional database wisdom for simplicity and scale, but eventually we are now proceeding to incorporate many of the database techniques and address Michael Stonebraker's criticisms. For example, Spark now uses Dataframes instead of RDDs, which have schemas. Spark's query engine has a planning and optimization layer, and file formats like Parquet have adopted columnar compression techniques used in analytical databases. I guess that was the only way to move forward - iteratively, rather than focusing on solving all problems at once.
It's fascinating to fathom the path we've taken. We've come a long way from being able to process kilobytes on a single machine to petabytes on 1000s of machines. And, I'm excited to see how we advance further!
Further Reading:
- MapReduce vs Michael Stonebraker
- Most of my learnings are from my masters course - LSDE at VU, Amsterdam