The CRISP-DM Methodology
Data Preparation
The final datasets that will be used by the machine learning models are defined in this phase. Data preparation tasks are performed multiple times to maximise the likelihood of the models detecting important features and thus to learn to classify, predict or structure information. The types of activities that occur at this stage are definition of data subsets (eg insample and outsample), pattern class balancing, data cleansing (eg dealing with erroneous or missing values), attribute scaling, outlier and auto-correlation removal, dimensionality reduction, converting continuous variables to discrete ones.