Whilst waiting for an invite to play with the preview version of AWS Data Lake formation I thought that I would get started with seeing how I could consume the data from a data lake and play with the Redshift Spectrum product from Amazon Web Services.
This is a part of AWS Redshift (the data warehouse product) that allows you to query across the contents of an Amazon S3 bucket without having to load it into intermediary or dedicated storage.
Ok, so why would I want to do that?
Well if you have a lot of stuff stored in your buckets and you want to investigate it before adding into Redshift proper then you can perform queries, joins, aggregates alongside existing loaded Redshift data. Also because it’s AWS everything is scalable and you’re only paying when you’re running the queries.
You make a compelling point – what’s the catch?
Well you need a Redshift Cluster to use Redshift Spectrum so you’re going to be paying for that, and the other end of the piece of string is your S3 storage but that will already be a sunk cost. You also need to have the Redshift cluster and your S3 storage in the same AWS region. The online documentation (Getting started) mentions that an SQL client is required also, but as you’ll see in the following example there is now a nice handy query tool online as part of the Redshift experience that makes life easier.
Got an example?
Most certainly here’s one I’ve prepared earlier.
- An S3 bucket with two folders – one containing a list of Crossfit athletes and the other containing results from the 2015 Crossfit Open. The files in these folders are straightforward csv format.
- A Redshift cluster in the same region as the S3 bucket – nothing special about this one I just created one with regular defaults.
- Followed the instructions in the Getting Started guide about creating and assigning IAM roles so that the Redshift cluster can access the S3 bucket. This is a little overwhelming at first if you haven’t done this, so go slowly and methodically, and don’t forget to copy the ARN (Amazon Resource Name) to a clipboard/notepad since we’ll use it shortly.
First something cool
I mentioned above about an online query tool that’s now part of the Redshift console, and to spare you a thousand words – here’s a picture.
As you can see it’s nice and simple and it does the job nicely for the following tasks:
Create an external schema to access the S3 bucket
This creates an external database/schema that allows you to use it alongside other such entities in your Redshift catalog. In the following script I’m creating a database called spectrum (a stretch I know) and associating it with our previously created IAM role:
create external schema spectrum from data catalog database 'spectrumdb' iam_role '<<The ARN identifier that you previously put aside somewhere>>' create external database if not exists;
Create external tables for each S3 folder
For this simple example I’m only working with a single csv file in each folder, but the way that you define the external tables allows you to have many files (of the same structure hopefully) that the Spectrum engine will treat as a single table.
create external table spectrum.athletes_csv( athlete_id integer, name varchar(100), region varchar(100), team varchar(100), affiliate varchar(100), gender varchar(20), age smallint, height smallint, weight smallint, fran smallint, helen smallint, grace smallint, filthy50 smallint, fgonebad smallint, run400 smallint, run5k smallint, candj smallint, snatch smallint, deadlift smallint, backsq smallint, pullups smallint, train varchar(200), background varchar(200), experience varchar(200), schedule varchar(200), howlong varchar(100), retreived_datetime timestamp) row format delimited fields terminated by ',' stored as textfile location 's3://<<S3 bucket>>/crossfit/athletes' table properties ('numRows'='423006');
create external table spectrum.leaderboard_csv( year smallint, division smallint, stage smallint, athlete_id integer, rank integer, score smallint, retrieved_datetime timestamp, scaled boolean) row format delimited fields terminated by ',' stored as textfile location 's3://<<S3 bucket>>/crossfit/results' table properties ('numRows'='1572222');
Bravo, but what about…?
Handling header rows
When I first looked at the contents of the tables above I noticed that the header rows were being returned as part of the data – oops.
To correct this is actually quite simple – drop the table and recreate – but this time add an additional table property that skips the first line of each file.
table properties ('numRows'='423006', 'skip.header.line.count'='1')
Is it just csv files?
Most definitely not, this page outlines what you can use with some helpful tips for getting the best out of everything. Here’s a quick summary of the formats allowing both structured and unstructured data:
- Optimized row columnar (ORC)
Very good, so where can I go for further information?
Well as mentioned already in this post there is the Getting Started Guide. This article gives some great background and breakdown of the why and how of Redshift Spectrum. Here’s a handy reference with the best practices, and finally if you’re wanting to build a proof of concept for something you’re interested in (or your pointy haired boss has a new shiny) then you’ll want to refer to this and that.
Until de next time
Bork, bork, bork… Brent.