The first I came across Snowflake was a few years ago when I was learning how to use Attunity Replicate, that time I established a data load into this cloud data warehouse. This time I have decided to explore what Snowflake is about.
I watched some introduction videos, then created a trial account to see it in action. I thought that many years of experience in data warehousing should help me to understand how it works in no time, but I was wrong: Snowflake is a “re-invented data warehouse”. Although Snowflake documentation uses familiar terminology, terms may mean something else here. These slides from Kent Graziano, a well-known technical evangelist in data warehousing world, can help you start with understanding the architecture and general direction of thinking. Traditional data warehousing definitely has problems, which with the current technologies could be addressed and resolved in many creative ways.
Traditionally organisations only have one data warehouse, which is called enterprise to describe the scope of the data it stores. Of course, being a single place for all data answers, EDW is also a bottleneck: it should cope with many simultaneous connections, it takes ages to design the model that has an enterprise-wide meaning, also it takes time to load all the types of source data into this structure. Snowflake’s solution for that is virtual warehouses. They are easily scaleable, flexible, and there could be multiple of them, all running on the same data. In essence, this solution is actually a data lake and cloud compute service, but simplified and fully managed, so even people with no technical skills can easily use it. One important detail is Snowflake’s warehouse is only a compute container, nothing else. It doesn’t store any data or business rules.
At first it is confusing how to put data into Snowflake. Traditionally, an organisation has a few transactional databases for their core business operations, plus they have some other data sources, like files and non-structured data sources. Snowflake is designed to work with files. So, there is still need for an ETL tool to load data from multiple different sources into the Snowflake database, or at least some tool to move data to S3 buckets from where Snowflake can pick it up.
Is it really Data Warehouse as a Service?
I would like to challenge Snowflake’s claim to be a DWaaS – Data Warehouse as a Service.
One of the benefits of data warehouses is maintaining the history of data. We want to see if values for the same instance of the data has changed over time. For example, the person’s address changes quite often; if an organisation wants to report on the history of the property’s ownership, they should store them attributed with the dates when the ownership has changed. I have found this blog about building a slowly changing dimension type 2, which is “extremely easy” according to them, although it requires a 2-part blog with pages of code. Of course, third-party tools like Attunity Replicate are capturing all history by default. Time-travel functionality allows CDC functionality (Delta load), but is not a replacement of SCD2 from business perspective as the retention period is limited.
Another benefit of a data warehouse is data integration from multiple sources, enabling a central view across the enterprise. For example, each of the organisation’s source system has something like a customers table; centralised customers view should gather all people in one place and only once. That’s where ETL traditionally becomes complicated. This problem didn’t go anywhere with the Snowflake, someone still needs to write an algorithm for people matching, and make it available to everyone, so it doesn’t get written again in another warehouse with different results. Integration logic could be very complex and requires a lot of computing resources. Although Snowflake scales easily up and, this is quite important, down, I can’t see the reason for this to be done every time someone queries the data; it would be more efficient if computed once, i.e. stored physically.
Data warehouses improve data quality by providing consistent codes and descriptions, flagging or fixing bad data. I.e. subject matter experts provide the rules which improve the data quality when applied to the data; and the result of it is made available to all users in the centralised data storage. Data needs to be loaded into the data warehouse, cleansed, integrated, transformed for the enterprise representation with ETL. Snowflake resources claim that it “eliminates the need for lengthy and labour intensive ETL processes by making data easily accessible via data sharing”. They are actually talking about outgoing ETL for extracting data from data warehouse into the location where it could be picked up by external consumer. This is different to ETL, and trust me, it’s just a technical problem, which is less labour intensive compare to the tasks described above.
The misconception about data warehouses comes from the cloud providers themselves, like AWS and Azure. They only talk about the moving parts of DW architecture, e.g. here they are collect, store, analyse, with many possible cloud products for implementation of each part. There’s nothing in this definition talking about transforming, cleansing and integration of the data, but these parts are also important data warehousing concepts, they take time and collective effort of data specialists and business users to make data usable across enterprise. I.e. data warehouse implementation is not only about technology, and could not be achieved through throwing different cloud products into one bowl with no “boring” and “lengthy” ETL. Studies show that data scientists, which are expensive specialists, spent 80% of time on data formatting, cleaning, deduplication and transformation – all the tasks which were skipped as apparently “ETL is no longer needed!” Just imagine, if data scientists could get all the data they need from the data warehouse, unchanged but prepared for analysis; that would be a dream.
We found out what Snowflake isn’t; now let’s see what functions Snowflake has.
Snowflake is a cloud platform that provides fully managed storage and multiple scalable and available on-demand compute instances. Snowflake has a feature of automatic suspend if warehouse is idle for some time and resume in seconds when the query is executed against it. As compute is quite expensive, this feature allows a lot of savings with autosuspend. Also, warehouse size could be changed without an outage.
Snowflake encrypts all the data by default. Data encryption is one of the best described topics in the documentation.
Another important question is how is the data backed up? Instead of classic back ups, Snowflake has Time travel and Fail-safe functions. The former allows you to access and restore a table or database which has been changed or deleted by error, i.e. data snapshot. By default, Snowflake allows 1 day time travel, Enterprise edition enables 90 days retention. Fail-safe is similar concept, but only Snowflake support can restore this data. Fail safe retention period is 7 days after the time-travel period finishes. It only costs you in storage, which is very cheap. I couldn’t find it in the documentation, so I’m assuming that geo-replication is not an option.
Snowflake could be queried by many of the modern Business Intelligence tools like any other database engine. Query language is a good old SQL. Also, data could be securely shared through the read-only access to specific tables and views; you’ll find this feature is great if you had to work on files exchange via FTP or Azure storage like myself before.
One of the important considerations for choosing the platform is pricing. Snowflake price depends on many factors. Let’s calculate the cost for the typical New Zealand-size enterprise data warehouse.
- Cloud platform is AWS Sydney region
- We choose Enterprise bundle for multi-cluster and extra encryption functionality
- 2TB of data pre-paid = $50
- 1 XL-size warehouse for the data load which is running 3 hours a day (let’s say there’s a need for complex ETL on the way in), 16 credits/hour = $5,832
- 1 L-size warehouse for user access, running for 8 hours every business day, 8 credits/hour = $5,184
- 1 S-size warehouse for testing and development, running as needed, let’s say 6 hours every business day, 2 credits/hour = $972
The scenario described above costs $12,038 per month plus ETL tool licence if you have one. Data transfer is free within the region. Usually, reporting tool loads data into memory, therefore, data access warehouse is used only for the new data loads, e.g. for 1-2 hours a day. I count it as 1 warehouse, but it probably makes sense to create a few of them for the best performance at the concurrent use. As I have mentioned above, Snowflake suspends the warehouse if it’s not used for longer than 10 minutes (or even 5 if you set it up) and resumes it once the query is issued in seconds. So, I believe, it will not get used for the whole 8 hours during the business day for an enterprise of the described size.
Wrapping up, unless you have a good old team of data modellers, ETL developers, testers and business analysts, who can build a data warehouse on the Snowflake platform, the best you can achieve is a governed data lake (as they call it in this article), which isn’t a bad thing at all. According to Kent Graziano, it is possible to use traditional data warehouse modelling techniques on the Snowflake platform. To implement it, the team would also need one of the mature ETL tools, like BimlFlex, Talend, Matillion, etc. However, Snowflake is not forcing any warehouse techniques and happily stores and queries any data, which makes it a powerful tool for many data-related purposes.