In this blog I am going to step you through a straightforward way to look at unfamiliar spreadsheet based data. For the purpose of this blog an unfamiliar source refers to any data set where external stakeholders have entered data into a spreadsheet manually or by an extract process alien to you. The expectation is that you are unfamiliar with the contents of the spreadsheet, don’t know the shape of the data, and are wary of the data quality.
For the following steps I am using data unfamiliar to me sourced from the Statistics NZ web page – specifically the Quarterly Regional Building Consents data released in October 2015 http://www.stats.govt.nz/tools_and_services/releases_csv_files.aspx – I am not wary of the data quality.
In most situations any data received on a spreadsheet will be reviewed in situ, by eye or using basic spreadsheet functionality.
I am going to take this opportunity see if Trifacta can help make this process easier. Trifacta is the brain child of a collaboration between UC Berkeley and Stanford and is described as an open source data wrangling tool. The purpose of the initial collaboration being to create quick and simple analytical tool – increasing productivity and enabling faster and better business decisions.
Two things to note: first, you can only feed 100MB into Trifacta, in .csv or .json format, so it is a great “once over” tool to get an idea on the structure of data ahead of any planned and structure analytic development as well as identifying the shape of the data to get a “feel” for what the data shows. Secondly, Trifacta works when connected to the internet, this is because Trifacta monitors the metadata information from your data set and actions undertaken by the user. This is cleverly used by Trifacta to propose actions (called transforms in Trifacta) as you delve into your data. You’ll see what I mean shortly.
Before we get started, if you want to follow my steps you’ll need to download Trifacta. Please do this from: https://www.trifacta.com/products/why-trifacta/
Once installed your Trifacta screen should resemble this one:
I want to create a new dataset from the Statistics NZ web page saved in .csv format.
Give the data set and name and description – then Create and Transform.
The resulting data set in Trifacta will pop up.
The columns appear as named in the source data, the data types appear to the left of the column name – where ‘ABC’ means string, ‘#.#’ means decimal and ‘#’ means integer. If you pass your cursor over any element on your screen in Trifacta a pop-up will appear to give definitions and values.
I have selected the highest bar from the graph under [Data_value]. As this bar darkens, so do the corresponding variables in the other columns highlighting the values that occur for this [Data_value]. You will see that this lowest [Data_value] occurs evenly along the period data.
The values in [Data_value] are strongly skewed and as you scan along the columns you can see there a three types of [Units].
Scrolling down the data we will find that the [Units] change from Number, to Dollars and then to Percent. We will proceed looking at the Number units, so need to filter out all others.
The statement to keep only ‘Number’ from the [Units] column is entered in the Transform Editor.
Not only does Trifacta prompt you to assist with statement writing, it also offers suggestions (Suggested Transforms) based on actions undertaken by other analysts with similar metadata.
When you use Trifacta it sends the metadata from your data set back to Trifacta which amasses information and identifies possible transform suggestions. This can be quite interesting if you have been an analyst for a while, just to see what transforms are suggested. If you have not been an analyst for very long consider it computer mentoring with the added bonus that Trifacta shows expected results from each suggestion as a sanity check.
The shape of the data in [Data_value] shows a more reasonable distribution now that we are not looking across varied units.
To have a more detailed look at the contents of [Data_value] select the ‘Column Details’ from the pull down menu. Trifacta shows you basic statistics from the selected column as well as a frequency chart and distribution of values. As before if you select a particular column in the graph, the corresponding values from other columns will be shown in the graphs on the left hand side of the screen.
Most of the other columns in this data set contain only one variable. This does not make them useful for analysis so we ‘drop’ them. You can use the transform suggestions to do this, or you can manually type the transform in the transform editor.
The resulting statements are appended in the order completed on the right hand side of the screen. Each row can be modified if you change your mind.
Continue looking at the column details until you are happy that you have a better “feel” for the data.
If you want to export your results click ‘Generate Results’ and wait while the Results page is generated.
Click summary to display the results – of interest to me for this exercise is the middle section relating to [Data_value].
This blog has offered you a very quick and brief step through looking at an unfamiliar data set using Trifacta. There is a lot more functionality within Trifacta that you can explore at your leisure. Have fun.
It’s all about the data – Mel