How We Work

The CRISP-DM Methodology

To maximise a successful outcome for our client’s predictive analytics projects, we work collaboratively with them in a well-defined and structured manner according to the six stages of the Cross-industry Process for Data Mining (CRISP-DM) methodology. The CRISP-DM approach was initially developed by members of NCR, SPSS and DaimlerChrysler to provide a standard approach to large-scale data mining projects and according to kdnuggets.com, remains the leading approach to data analytics. On behalf of our client, we project manage the whole lifecycle including production of relevant documentation.

We give a brief overview of the approach below, noting that the degree of agility between phases and the standardisation within each stage is determined by the needs of the organisation and requirements of the project. For a detailed treatment of the CRISP-DM methodology then please review Chapman et al. 2000.

This initial phase focuses on understanding the project objectives and requirements from a business perspective, and then converting this into a predictive analytics problem definition, and a preliminary plan designed to achieve the objectives. This step is critical to ensure we understand WHY we are embarking on a predictive analytics project. To aid this step, after meetings with relevant stakeholders we apply methods such as balanced score cards, root cause analysis, force field analysis, and requirements modelling. A resulting project plan, team and best practices will be established at the end of this stage.

An initial collection of raw data is undertaken resulting in a series of reports detailing information such as: data sources, nature of data (incl. integer, float, structured/unstructured), correlations and data quality issues. Some initial analysis and data visualisation may take place to discover useful insights and relevant data subsets.

The final datasets that will be used by the machine learning models are defined in this phase. Data preparation tasks are performed multiple times to maximise the likelihood of the models detecting important features and thus to learn to classify, predict or structure information. The types of activities that occur at this stage are definition of data subsets (eg insample and outsample), pattern class balancing, data cleansing (eg dealing with erroneous or missing values), attribute scaling, outlier and auto-correlation removal, dimensionality reduction, converting continuous variables to discrete ones.

A variety of suitable machine learning techniques are selected to perform the relevant task(s) eg pattern classification, numerical prediction, association or clustering. Depending on the task, this could take the form of a linear regression model for simple numerical prediction, recurrent neural networks for non-linear time-series prediction; either logistic regression, random forests or convolutional networks for pattern classification; or k-means clustering for clustering data into groups based on a measure of similarity. A variety of model fitting methodologies will be used at this stage such as cross-validation, bagging and boosting and combinations thereof. Models will be evaluated on one or more validation sets using an appropriate metric such as the root mean squared error, area under the curve, or %improvement over benchmark. Some techniques require data to be in a certain form or follow a particular distribution, therefore, revisiting the data preparation stage is required.

The models that make it to this stage are of high quality as indicated by performance on the validation data. During evaluation, the models are examined in more detail typically using an outsample of representative data previously unseen by the models. The behaviour of the model is observed by numerous stakeholders against key business objectives to determine whether all important issues have been addressed (eg does the model allow too many false negatives, which for a medical diagnostics system would be potentially intolerable). Decision documentation will be produced outlining reasons for accepting or rejecting further use of the model.

The value and insights generated by the model must be organised and presented in an appropriate manner for the client. The complexity of the deployment phase is determined by the nature of the project. For example, it can be can be as simple as a report or revised policies to enable staff to undertake their tasks more efficiently or make better-informed decisions. However, if the application was for recognising suspicious telecommunication activity or fraudulent customer behaviour then appropriate software applications may have to be developed. Typically a deployment plan will be established, detailing how the model will be deployed and the impact on existing staff roles and organisational processes.