This course will provide a conceptual overview and practical hands-on experience of a wide range of key tools, techniques and processes. At the heart of the data mining toolkit is the suite of predictive modelling methods.

Accordingly, the course will develop your literacy in the strengths, characteristics and correct application of a range of predictive modelling techniques, from relatively simple linear models through to complex and powerful Random Forests, Support Vector Machines, Decision Trees, Tree Boosting Machines and Neural Networks will be covered along the way. It will also teach you the correct framing of predictive modelling problems, suitably preparing data, evaluating model accuracy and stability, interpreting results and interrogating models. The two key styles of predictive modelling – operational for targeting and explanatory for insights – will be described and distinguished.


As well as predictive modelling, the course will cover a range of other key data mining tools, including:

  • Data exploration and visualisation: univariate summaries, correlation matrices, heat maps, hierarchical clustering.

  • Cluster analysis – used for customer segmentation and anomaly detection

  • Other “unsupervised” outlier detection tools.

This course will primarily be taught using Rattle, a graphical interface for predictive modelling and data science in R. You will be exposed to “Big Data” techniques as applied to machine learning and deployed on Cloud Computing platforms.

Additional topics

The following additional topics may be covered depending on the pace and interests of the class:

  • Link and network analysis visualisation – which provide a simple and compelling way to communicate and analyse relationships, and are commonly applied in forensics, human resources and law enforcement.

  • Association analysis – used in retail market basket analysis and the assessment of risk groupings.

  • Frequent item set analysis.

Day 1

  • An overview of key terms: what do data science, machine learning, AI and deep learning actually mean?

  • An intuitive and original introduction to what a machine learning model is, and what it does.

  • Practical exercise: Exploratory data analysis–summaries, visualisation, bar charts, pair plots and correlation plots.

  • Key terms: What is data? What is a model? What is a record? Field. Training set. Target variable. Missing value.

  • Introduction to predictive modelling: What is a decision tree model, how is one built, how does it make predictions and what else can be done with it?

  • Practical exercise: Building a decision tree model for classification.

  • Decision trees for regression (estimation of amounts), and practical exercise.

  • Linear regression models, and practical exercise.

  • Generalised linear models (logistic regression) for classification, and practical exercise.

  • Most important part of the course: How are predictive models evaluated? What is the KPI of predictive modelling?

  • What is the one thing that all practitioners, managers and stakeholders of machine learning must know? And what makes the definition, measurement and improvement of this KPI tricky ?

  • An intuitive, visual explanation of the problem of overfitting and the importance of out-of-sample testing.

  • Creating training/validation spits.

  • Using out-of-sample testing to evaluate models and select a final model.

  • The importance of a three-way training/validation/test split.

  • Accuracy measures for classification modelling.

  • Practical exercises: build multiple classification models, assess them on out-of-sample data and select the best final model out of a range of models including random forests, gradient boosting and support vector machines. Repeat as a model optimisation task to build the most accurate possible decision tree.

Day 2

  • Model deployment. Practical exercise: make new predictions on a developed model.

  • Model stability and degradation: the importance of rebuilding models and out-of-time testing.

  • Advanced classification topics: selecting a classification threshold using ROC curve charts.

  • Calculation of the area under the ROC curve as a classification error measure.