After weeks or even months of data discovery, exploration, cleansing, determining potential variables to use as well undergoing variable transformations you are now in a position to create a model.
Your model will predict or explain:
- Numeric targets such as the energy consumption used each month, how many call centre reps are required at any particular time, the total expected revenue each month for the upcoming year
- Decision targets such as fraud detection – is an insurance claim legitimate, credit risk – will this customer default on their payments, customer retention – will this customer stay.
- Categorical targets such as predicting the flavour of ice-cream a customer will choose, what colour car will sell the most
It may seem that with all the software out there to do the grunt work its simply a matter of pointing & clicking and voila out comes a model. However it is still important to know when to use what, how to interpret the results & to explain it to others, limitations and assumptions.
The most common methodologies include:
- Decision Trees which are useful for clustering/segmentation as well as predicting numeric, categorical or decision targets. While they are easy to interpret and handle missing values well they do not work well on small samples
- Linear Regression is often used to predict continuous target variables. Linear regression requires the input variables to be linearly related to the target and the residuals (difference between actual & predicted) to be normally distributed. The equation produced is easy to interpret however some sort of intervention is required to handle missing values. Linear regression may lead to over fitting.
- Logistic Regression is similar in many ways to linear regression. Instead of predicting continuous target variables it is used to determine the probability of an event happening. Like linear regression it requires a method to deal with missing values and can lead to over fitting. Large sample sizes are needed to use logistic regression.
- Neural Networks is a very powerful technique based on work done looking at inter-connections between nerve cells. They can be very resource intensive and difficult to interpret.
- K-means algorithm is often used clustering – (ie) dividing your data so that members of a group contain similar characteristics. A number of cluster centres are selected and data is assigned based on which centre it is closest to. This then requires the centres to be updated and data-reassigned. The process continues until the cluster centres remain static. As the algorithm is based on the distance of data points to the cluster centres the input data needs to have the same scale of measurement.
A Google search on analytic techniques shows that there are many more options than given here. Most of the more complex/sophisticated techniques have their roots in the methods described above so it makes good sense to ensure you thoroughly understand them.
No matter what level you are at, the following are ‘must dos’ for any model creation:
- Split your dataset into 3 random datasets
- A training data set. This is what you’ll do most of the figuring out on
- A validation set. This is what you’ll use to determine which model performs better
- A test data set to run your final model over to ensure it’s still ok on unseen data
- Make sure you try a number of different methodologies appropriate to target. This enables you to determine if the model is any good – even if you have no intention of using a neural network (as its difficult to explain) its still useful to include as a comparator model.
- Feel free to use clustering/segmentation as a part of the inputs into a model as well as an output to help visualise results. Members in a cluster or segment used as parts of the input data do not have to remain together in the cluster or segment used to describe the results
Now all you need to know is how to figure out if a model is any good or not …….