Do you care about the path if the destination is the same? Photo: Wikimedia, cc-by-sa-2.0.
Earlier this month I had a good old-fashioned rant about Machine Learning (ML). I wanted to simply compare recently-released Amazon Machine Learning with already-mature Azure ML. Then I decided I should probably define what ML is, and then things got out of hand. To summarise my many points in just three:
- We can use data to learn about the world around us, but we need to be a little careful. Statistics, which is the science of data, is how we know how careful to be, and demonstrate that we were.
- None of the above is very sexy, so computer scientists coined “machine learning” to describe the use of probabilistic (ie ‘might be wrong’) algorithms to make predictions in software.
- Whether you call it statistics or machine learning, it’s really the same thing. “Data science” is the tough-to-achieve combination of both with deep subject matter expertise.
The key difference between the two is the goal they place foremost: statisticians care more about uncovering the underlying data-generating process, whereas machine learners care about accurate predictions to unseen data. You can’t do one without the other, but you must emphasise one over the other.
So which camp do the major cloud ML offerings fall in to? Hint: they fall differently!
Amazon Machine Learning is prediction built for developers, not data scientists, and that’s great
Clearly targeted at developers who just want prediction, without all that data science drama.
Right off the bat, AWS makes it very clear that their target audience is developers who are otherwise bamboozled by how to make predictions when uncertainty is involved. The language is all about adding “smarts” to your application, without adding complexity or cost. That Amazon has gone down this route shows just how intimidating data science/statistics/machine learning can be to even very technically capable people. As well as specific technical jargon, data science throws in plenty of hand-waving/folklore/voodoo to the mix (just read any interview with a top Kaggler!) A kinder way to say this is there’s plenty of “art” in data science.
The resources AWS provides to support Amazon ML are written with this in mind, and I think some of their descriptions of complex concepts are very elegant. Take for example some of their problems statements for ML:
Examples of binary classification problems:
Will the customer buy this product or not buy this product?
Examples of multiclass classification problems:
Is this product a book, movie, or clothing?
Examples of regression problems:
For this product, how many units will sell?
Those are all really nice, clear business problems. Data scientists often can’t count clarity among our strengths. They also provide a neat mapping to the three classes of algorithm AML supports: binary classification (which of two possibilities is this new thing?), multiclass classification (which of many possibilities is this new thing?), regression problems (which numerical value is implied by these new things?). Statistics-inclined data scientists may look at that and poke holes, but it does cover most of what most developers will need.
How does Amazon Machine Learning work?
So how does AML look under the covers? I logged in to one of our AWS environments, switched to the US East (N. Virginia) region where ML is currently available, and loaded up the service. I opted for the “Standard setup” to see what it looks like to the brand-new user.
The interface is very simple, which is good for the target audience.
Along the top is the AML data processing flow, from input data, through schema definition (tell AWS what the data mean), selecting a target variable to predict and a row ID to separate the observations, and review. Efficient and free of unnecessary clutter (again, some data scientists will say “missing key features”). I opted to use the sample banking.csv data in provided by Amazon in S3, and also downloaded it so I can use the same thing in Azure ML later on.
After loading your datasource, AML takes a shot at interpreting the data type of each column (options are binary, categorical, numeric, text) and provides some samples so you can correct errors. After that you choose a target, ie the thing you want the machine learning algorithm to predict.
Note AML chooses the algorithm class (in this case Binary classification) based on the target.
In this case, as I’m using AML’s inbuilt data source, it suggested I choose “y”. It is a 0/1 binary indicator of whether the customers in the banking.csv file responded to a previous marketing campaign. These existing records of customer behaviour allow us to infer a link between customer characteristics and likelihood of subscribing to a new product. In other tools we might need to specify an algorithm/model to use, but AML handles this automatically: binary target means binary classification algorithm(s) will be used in the background. After confirming this selection, I opted out of choosing a row ID as the dataset doesn’t contain an obvious good case.
After a quick review, it’s on to setting the parameters of the ML model. I went with the defaults, which use 70% of the specified datasource to train the model, and 30% to evaluate its performance (also known as cross-validation). The point of this is to avoid “over-fitting” the model, in which it can predict the target variable too well on these known data, to the detriment of predictions on unseen data (the goal of the whole exercise). You can customise this split, which also opens up the option to process your data before it goes in to the model using JSON (more on these “data recpies” here), as well as other pre-processing options.
It’s alive! The model summary screen shows my model deployed and running.
After a final review, you finish preparation and AML generates the model, giving it a unique ID that will be familiar in format to users of other AWS products. While the status is “Pending”, AML is off training your model on the 70% of the data reserved for that purpose, and evaluating its performance on the remainder of the data (30% evaluation set), pretending that it doesn’t know the true value of y (the target) for the latter cases. This cross-validation construct tells us how the model performs, by comparing its specific predictions about these individual observations from the evaluation set with their known, true target values. With default options this takes a while (~2 minutes for training, ~2 minutes for evaluation) because I didn’t allocate much memory and the number of columns in the dataset (“features” in machine learning speak or “variables” in statistics lingo) is quite large.
Now I built a model, what do I do with it?
When both model training and the very important first evaluation have finished, you’ll see this very AWS-looking screen. Again, a good thing, just very different!
When “prediction for developers” really means it (lucky there’s a blue button for me).
This is where the target audience for AML becomes crystal clear, if it hasn’t already. It’s also where more statistics-trained people like me might get a bit paralysed by fear. So many long text strings! So many colours and tool-tips, and only a few friendly acronyms like AUC. Where are all my post-estimation statistics? This goes in general for the presentation of AML, as soon as you get through those cozy model creation wizards: every model is just another AWS resource like all the rest. To data scientists that will be terrifying, but it absolutely will work for developers who want to treat prediction like any other software module.
Statistics-inclined data scientists, formerly called “analytical modellers”, like to treat models as pets. We even say “my model”! We carefully curate data to feed models, ensuring a balanced diet. Then we lovingly train them through their early stages of growth, teaching them how to run on more and more data, faster and faster. Only when they are “perfect” do we release them to the big ugly world. Like it or not, the world where we have time to do all that is shutting down. We now have more data, more complex business questions, not to mention computer scientists chasing us down with these fancy machine learning algorithms. Complaining that ML just reimplements what we’ve done for years won’t save us, we need to evolve.
Amazon ML’s developer-centric delivery of prediction drags model-making in to the world of cloud computing: treat your servers (models) like cattle, not pets. In this world, a model is nothing more than a set of estimates of how some characteristics of a thing (customer, product) relate to something we care about (customer response to a marketing campaign, number of product sales). If we send this object some new information, it should just spit out a prediction!
As it turns out, my model (which is really AML’s model!) of how customers will respond to a marketing campaign did pretty well. Probably the coolest feature for data scientists is that “adjust score threshold” button.
Click this picture for a more detailed look.
It’s another example of AML nailing something very complex statistically, and presenting it in a sensible format. There is no perfect predictive model, because there is always a tradeoff to make when you predict the future. Basically, it comes down to which type of mistake worries you more: stating something will happen when it won’t (false positives), vs failing to state something that will happen (false negatives). The slider representation on that screen represents this beautifully: you must trade lower performance on one type of mistake for improved performance on the other.
Once you’re comfortable with the threshold (a decision driven by your use-case and data), you’re ready to use the model to predict new things! This can be done back on the model summary screen three images above, either with Batch mode (upload a new datasource the model hasn’t seen before), or in real-time (send rows of data to the model as a cattle-like AWS object and get back predictions). AML provides data for this purpose, under s3://aml-sample-data/banking-batch.csv which it validates on load. You provide an S3 destination for your predictions, review, and set it going. Here’s an edited sample of what I got back:
bestAnswer,score 0,3.145665E-2 1,8.014306E-1 0,4.785213E-1 0,1.044454E-2 1,5.309388E-1 1,8.867225E-1 (4119 records scored in 4 seconds)
For each row in the new data unseen by the model, bestAnswer indicates the prediction of the ML algorithm(s) used, with the score from the model. It would also include a rowID variable if I had specified one at the beginning. Probably a good idea for productionising the model, as using the predictions downstream would then be dead-easy for a developer. Again, no fancy visualisation or other data science gravy here: just what a developer would need.
Conclusion: AML opens a new predictive frontier
I started this feature dive hoping to compare Amazon ML with Azure ML, but I ran out of time yet again! Never mind, I’m building up a series here. A hint for next time, Azure ML takes a very different approach to delivering prediction, which makes data scientists quite comfortable but is perhaps less suited for the “no-frills prediction” I think many developers are after. It’s an interesting contrast, and definitely a good thing for the market. As in anything, a good prediction is a good prediction, so the more paths we can open up the better.
Until next time, keep asking better questions!