Preparing Excel Data for Analytics

by | Feb 9, 2017

Grabbing data off an Excel spreadsheet can be a breeze or it can be a complete pain in the butt.  Whether you’re a student or a well-weathered business intelligence professional you just have to grit your teeth, get a fresh mug of tea/coffee/hot chocolate and get on with it.
Frankly, if the data you needed to get at was in a nice table you wouldn’t be looking for advice on what to do. So let’s assume your data has arrived readable to all but a computer.  If you have access to SAS then read on.
Deciding to automate grabbing data from the Statistics New Zealand Income Survey was easy for me.
Each survey has 12 spreadsheets for the years 2011 – 2015 (60 spreadsheets).  Some of these spreadsheets holding data back to 2007.  That’s a lot to manually manipulate.
I also want to easily refresh the data every year for as long as Statistics run the survey.  And I’m picky!  My way isn’t the only way, but it does give me the data how I want it.
It’s important to save each spreadsheet using General formatting to a .csv file.
If you forget to do this the numbers will burst across multiple columns because the source table has comma thousand separators.
You’ll probably make this mistake a few times.  Or a couple more … but you’ll remember.
Using a data step I’ve input table 2 over columns var1 – var15 (gold 1).
I’ve explicitly added columns with qualitative data specific for the table (purple 2).
Scanning the source qualitative column (var1) I created one more qualitative column corresponding to the quantitative data (pink 3).
I then delete the unnecessary rows and columns, if there’s no meaningful qualitative or quantitative data then I’m not interested (teal 4a and 4b).
Voila, one table complete!
Too easy right?  Yip, definitely!  It doesn’t end there.
Statistics New Zealand’s Income Survey data is a good metaphor for our life in business intelligence.  Just when you’ve mastered one set of categories they go and create a few more.

The survey is aggregated using 9 categories.  Reality insists that these categories change within the current survey’s 5-year existence.
To deal with this I’ve added a reference ID to standardise the third (qualitative) category.
Highest Qualification is the most changeable category – specifically when the Ministry of Education changed from 5th/6th Form Certificate and University Entrance into NCEA levels 1-3.
Start and end dates, in the reference table, show the time period each ID reference definition is valid for.
These dates relate to the quarter when each response has been added to the income survey and don’t identify the actual qualification date changes.  Not by several years.
I’m using the following hyperlink to share my final dataset with you.
This data will be used in my next blogs on business intelligence maturity.
There are a couple more things I need to mention:

  • There are occasions where the quantitative data has been refreshed and therefore changes from year to year.  I’ve kept the most recent figures believing them to be the most accurate (e.g., the figures given below, in Table 2, for June 2014 have been updated in the June 2015 data).

Have fun.
Mel blogs about analytics, analytical tools and managing better business intelligence. 
This work is based on/includes Statistics New Zealand’s data which are licensed by Statistics New Zealand for re-use under the Creative Commons Attribution 4.0 International licence.
Want to read more? Try “Data:  The Facts” or more from Mel.
We’ve got the best SAS Administrators in New Zealand ready to help you with your SAS environment.

  1. Shane Gibson

    I just use Trifacta far more visual, intuitive and way faster.

Leave a Reply to Melanie Butler Cancel reply

Your email address will not be published. Required fields are marked *