To Be a Data Scientist

by | May 3, 2018

I’m normally cynical about data analysts who have started calling themselves Data Scientists.

My view stems from roles where I have managed data warehouse teams and business intelligence data analysts.  In extreme situations, we’d take on a lot of legwork for the said “Data Scientist” before the Data Scientist delivered anything of “value”.
Last year following a six month break from work due to a concussion, I had a fundamental need to ensure my brain was still working.  If you’ve ever been concussed (and been lucky enough to recover) you will know the feeling of absolute despair when you are completely unable to do the most mundane tasks. I couldn’t remember when tradespeople said they would arrive, I even struggled to complete basic Sudoku’s – independent evaluation was needed for me to not feel a fraud.
I seized this opportunity to train as a Data Scientist.
PLEASE NOTE: My education is statistics and operations research focused. I have also been very fortunate in the programmers and information systems people I have worked with and learned from over the years.  I should already have the skillset required to call myself a Data Scientist.  Being a pragmatist I wanted a piece of paper (or in this case a weblink) to verify these skills had been tested and confirmed.
After trawling the internet I decided on the Johns Hopkins Bloomberg School of Public Health Data Science Specialization.
Johns Hopkins Bloomberg School of Public Health was the first institution of its type in this world, it is the largest school of public health in the world and has research practitioners working on fields as wide-ranging as epigenetics, mental health, and tobacco control.  I assumed, correctly, that our assignments would be interesting.
I have now completed the specialization (I’m making myself put that “z” in for continuity), I have my certificate, and my brain works as well as (possibly better than) it used to.
Some thoughts on the Data Scientist Specialization to help you decide whether it’s for you:

  1. It is a lot of work. I was working full-time (over 40 hours a week), this meant completing the coursework took up a substantial amount of my weekends.
  2. There are nine routine modules with a project based tenth module.
  3. You enrol in each module separately, so if you need a break for whatever reason you can, but once enrolled you have one month to complete each module.
  4. Marking is completed by your peers, which I thought was fantastic. You complete your module your way and then you get to see how other people had completed theirs.  There was a lot of diversity in thinking and execution, attention to detail, and work ethic.  It was wonderful to experience, and I did pick up a few tricks from others which was an unexpected gift (and I hope not an unethical confession).
  5. The first modules have a lot of students. As you make your way from about module 5 the number of students drop off. This could be because you can pick and choose modules to a certain extent (there are pre-requisites for some) and people may be spacing their education over years.  This drop-off effects the online discussion and the number of people available for marking.  NOTE: make sure you finish your assignment on time so that it gets marked.
  6. I learned a lot. Yes, I knew some things already, but the amount I learned made this more valuable to me than I was expecting. I would highly recommend this course to anyone looking to cement their right to be called a Data Scientist, and to anyone who isn’t interested in being a Data Scientist but wants to be introduced to R, or Reproducible Research, or Regression etc.

Looking back, my favourite module was Reproducible Research.  This module outlines a “process” for research that can also be applied to all forms of data analytics and business intelligence.  It cements the fundamental steps and checks that should be attained before any data analytics is released (internally or externally). It is a stand-alone module, so no prerequisites to complete. The Reproducible Research module incorporates discussion of failure.  It happens, if it’s in good faith we should learn and get on with it, if it’s not in good faith then there should be action.  This is fundamental.
I am a firm believer in peer review, open and honest conversation, collaboration and continuous improvements.  What you think is true this year may be shown to be incorrect next year, and the data analyst should not be blamed for learning on the job.  In this sense everything data analysts do is research.
A timely reminder came from Roger Peng – one of the lecturers for the Specialization – in a twitter post released this morning discussing What can we learn from data analysis failures.  It also includes links to a very important case study of failure, the “Duke Saga”. Check it out.
My second favourite thing, not a module, definitely a thing, is SWIRL (Statistics with Integrated R Learning).  As you learn R through the course your tutorials are to use SWIRL to cement whatever topic you are learning.  I didn’t have this way back when I first learnt R, it is a guided R trainer within R.  Toward the end of the course you are introduced to writing SWIRL modules yourself.  Very cool indeed. My objective is to try to figure out how to use SWIRL to assess R competency.  Unless one of you has already done that in which case please share and let me know.
Have fun.
Data – Mel.
Mel blogs about analytics, analytical tools and managing better business intelligence. 

Next, read “Preparing Excel Data for Analytics“, “Data: The Facts”, or more from Mel.

If you’d like to learn more about data and business intelligence, come to one of our courses! (Mel is one of our awesome facilitators!)
We run regular business intelligence courses in both Wellington and Auckland.

Submit a Comment

Your email address will not be published. Required fields are marked *