tech ramblings

Cloud isn't just someone else's computer

Ok, i mean, it IS indeed just someone else's computer. For the most part, it's either owned by Amazon, Google or Microsoft, and you're just running your code on their machines. But.. the change is much more nuanced than just a computer that you rent as opposed to using your own. Softwares have undergone significant changes to adapt to the cloud paradigm, in order to leverage the elasticity and commodity hardware available. I want to illustrate this adaptation by taking the example of snowflake - the elastic data warehouse.

Snowflake

Back in 2013-14, cloud computing had started gaining traction, but traditional data warehouses were struggling to catch on. This was because they simply weren't built for the cloud; they were designed for fixed resources using a shared-nothing architecture, and weren't keeping up with the requirements of the industry such as support for semi-structured data, and freshness of data. Alternatives to traditional DWH systems was the hadoop ecosystem with spark for analytics. However, they lacked both the efficiency as well as the simplicity that one would ideally want from a native cloud data warehouse solution.

Therefore, the folks at snowflake decided to build an enterprise-ready data warehouse, native to the cloud. It would be relational, highly available, and support support semi-structured data. But most importantly, it would decouple compute and storage. Ok, but why?

Decoupling Compute and Storage

Traditional data warehouses were all set up with a shared-nothing architecture. Essentially, each of the nodes in the distributed system do not share any storage component such as memory or disk. The data is partitioned across each of the nodes, so that each node only operates on the data present locally. The advantage in such a setup is that you minimize data movement across the network, which makes the overall system scalable.

Of course, as Thomas Sowell says, "There are no solutions. There are only trade-offs"; the shared-nothing is not without its disadvantages. Compute and storage is coupled. As a result, fault tolerance becomes more difficult because a failed node can make some data unavailable. Maintenance becomes complicated because addition/deletion of nodes will require reorganizing and repartitioning of the data, which will not be possible without incurring a performance cost. Furthermore, resource consumption is also rather inefficient in such a setup. Different tasks have different workloads, and as a result have different ideal system configurations. For example, bulk ingestion would require a configuration with high IO bandwidth and low compute, whereas complex queries may require high compute and low IO bandwidth. And, in a shared-nothing setup, since each node is responsible for both compute as well as the storage (coupling), they would all have to operate on all types of workloads, resulting in a lower average resource utilisation.

Snowflake architecture

Therefore, the snowflake system decided to ditch the shared-nothing model in favour of a multi-cluster shared-data architecture. A storage service such as S3 is used, and each node essentially has access to all the data, thereby decoupling storage from compute. Compute on demand on the cloud means you can spin up a node with a different configuration for different types of workloads, increasing efficiency. Further, fault tolerance becomes easier, as the nodes are stateless. You can bring up a new node if one fails, only incurring a small temporary performance cost.

There is a downside in that this setup causes a dramatic rise in data movement across the network, but it is offset by the use of caches in local disks, and the efficiency brought about by the ability of elastically scale resources as and when required (i.e. you can scale the compute nodes up and down automatically, depending on the load).

All in all, it's really interesting to see how Snowflake not only adapted the data warehouse to the cloud paradigm, but redesigned it to make full use of it. As a result, for you and me as the end user, today we can go from zero to snowflake in less than a couple of hours.