What is a Data Lake?
In it’s most basic form it’s a generic data repository – it can store large quantities of data from many different sources in their original formats. Some examples of the type of data that can be stored are:
- Machine-generated data (Internet of Things, Log files, Sensor readings)
- Human-generated data (Tweets, blogs, emails)
- Traditional operational data (Sales, Inventory, Ticketing)
- Images, Audio and Video.
How do we design one?
For this section, I am making extravagant use of the information from Melissa Coates website sqlchick.com which has a fantastic collection of articles about Data Warehousing and Data Lakes. Case in point, here is a great diagram outlining some suggested zones that can be established in a Data lake and the consumers that may make use of them.
A sound structure is important, otherwise, we get a data swamp instead of a data lake. An example of some structure that can be imposed upon the suggested zones above is that for batch loaded data from a Customer Relationship Management system:
Raw Data > Organizational Unit > Subject Area > Original Data Source > Object > Date Loaded > File(s).
This would look like the following in a file/folder structure:
Raw Data Sales Salesforce CustomerContacts 2017 12 20171205 CustCct.txt
Since the Raw Data zone should only be accessible to few people, most data retrieval and requests would be against a Curated Data zone, which could be organized in the following manner:
Curated Data > Purpose > Type > Snapshot Date (if applicable) > File(s)
Which for our given example could look like the following in a file/folder structure:
Curated Data Summarized 2017_01_01 SalesTrend.txt
Obviously, this is just an example, and I have given no ideas or examples of how you get from the Raw Data zone to the Curated Zone. An upcoming blog will go over the Azure Data Lake Analytics suite and how we can use it with the Azure Data Lake Store to perform movements etc of data, but for now, if you wanted to pull things in and out (in serious bulk) then jump down to the last section in this post.
How much will it cost?
As mentioned in a previous post there is a handy calculator available to find out how much an Azure Data Lake Store may cost. To find the Data Lake Store you want to look at the products in the Analytics group on the calculator. Click on Data Lake Store and you will get the default setup showing you the prices and combinations in your local currency.
So what options should we use? Well, let’s stick with the default Region and keep the Pricing Type as Pay-as-you-go since that’s what our subscription looks like currently. Storage wise let’s go with 10GB since we’re supposed to be ready to receive lots of stuff. Let’s take a stab and say that we would use 100,000 write transactions and 50,000 read transactions per month. So what are we looking at? ~ $1.26 per month – not bad.
In this case, the parameters used were just made up for this example. If you’re interested in the available pricing tiers they are outlined in detail on the Azure website. You won’t see a rate change for storage until you exceed 100 TB, and you have to exceed 5,000 TB (or 1,000 TB on the commitment plan) before you need to have a little chat with Microsoft. Read/Write transaction rates are the same across all storage sizes.
Watch me make a splash…
Enough about what could be, let’s get on with making our lake.
- Login to your Azure portal and click on the +New button.
- Click on the [Data + Analytics] group and select [Data Lake Store].
- Now we need to fill in the required attributes:
Name – this will be the name of the service when we want to refer to it.
Subscription – for this situation I’m using my pay-as-you-go subscription.
Resource group – and again I’m going to use the existing one that is associated with my subscription. You can create a new one if you wish.
Location – let’s go with the Default of East US 2 (unless you have a reason to use another one).
Pricing package – let’s stick with Pay-as-You-Go for now (unless you already have a monthly commitment size in mind).
Encryption Settings – Leave these as enabled – if you know what you’re doing feel free to choose the settings appropriate for your situation.
- Check the Pin to dashboard box and click on Create.
- Validation and Deployment starts – which will take a few minutes to complete. At the end of it you should have your Data Lake Store on your dashboard and it may have automatically opened the overview to show you its wares.
As you can see that there are quite a few settings you can play with, the most important ones are to do with Access to the data lake (Access control (IAM), Firewall, Locks). Under the Quick Start group, there is a link (Managing user and ACLs) that gives you much more information (and guidance) about how to secure your data within a data lake store.
So we’ve got a data lake store – how do we go about putting in some structure? Well, let’s start by manually creating the two examples we discussed earlier.
- Bring up the Data Explorer for our data lake store. This can be done from the Portal Dashboard by right-clicking on the data lake store tile and choosing Data explorer. Another way to do it is from the detailed data lake store window showing the overview tab – click on the Data explorer option.
- Once we are in the data explorer we can create the two sample hierarchies discussed above by treating it like the regular old windows file explorer.
- When you have the desired structure you simply use the Upload option to place data files into the relevant folder in the store.
This is not a very effective approach when you want to put in place a large-scale structure, and/or have date labelled folder hierarchies created on an automatic basis.
Stocking the Lake…
In most cases, you most definitely don’t want to be loading data manually into an Azure Data Lake Store. You are probably after an approach which will work with your chosen group of technologies and code wizards. At this point we really have a buffet of choices – looking at the documentation for Azure Data Lake Store we can do the following operational activities with their listed available languages/interfaces (which I don’t guarantee is exhaustive):
- Account Management Operations – Azure Powershell, .NET SDK, REST API, Python.
- Filesystem Operations – Azure Powershell, Java SDK, .NET SDK, REST API, Python.
- Load and move data – Azure Powershell, Azure Data Factory, AdlCopy (Storage Blob to Lake store), Distcp (HDInsight storage cluster), Sqloop (Azure SQL Database), Azure Import/Export Service (for large offline files), SSIS (using the Azure feature pack).
So lots of different ways to accomplish what you’re after. In my next article on Azure Data Lake, I’ll show how we can go fishing in the lake.
Until de next de time, bork, bork, bork!