The CRISP-DM Methodology
Modeling
A variety of suitable machine learning techniques are selected to perform the relevant task(s) eg pattern classification, numerical prediction, association or clustering. Depending on the task, this could take the form of a linear regression model for simple numerical prediction, recurrent neural networks for non-linear time-series prediction; either logistic regression, random forests or convolutional networks for pattern classification; or k-means clustering for clustering data into groups based on a measure of similarity. A variety of model fitting methodologies will be used at this stage such as cross-validation, bagging and boosting and combinations thereof. Models will be evaluated on one or more validation sets using an appropriate metric such as the root mean squared error, area under the curve, or %improvement over benchmark. Some techniques require data to be in a certain form or follow a particular distribution, therefore, revisiting the data preparation stage is required.