A Beginner’s Overview of CRISP-DM

4 minute read

CRISP-DM stands for CRoss Industry Standard Process for Data Mining. The model divides a data science/data mining project into individual, clearly defined steps.

A Beginner's Overview of CRISP-DM
Photo by Jason Goodman on Unsplash

Each step focuses on a clearly defined task area. The individual steps are also called phases. CRISP-DM contains six process phases that can be run through several times. The six process phases are:

  • Business Understanding
  • Data Understanding
  • Data Preparation
  • Modeling
  • Evaluation
  • Deployment

CRISP-DM Process Model (see [1])

The process model aims to illustrate the whole workflow of a data science project. The first step is to understand the specific use case and the data. The focus here is on the concrete requirements and goals. Then you analyse, process and clean the data. Next, you use this data to develop an ML model. Finally, all stakeholders must decide if the solution can be operationalized.

CRISP-DM in detail

Business Understanding

At the beginning of a Data Science project, the focus is on understanding the use case. In addition, all stakeholders define the common goal and acceptance criteria for the evaluation. Most people underestimate business understanding. Please don’t do that! All stakeholders must work towards the common goal. The result of this step is a concrete project plan.

Data Understanding

In this phase, you examine the quality of the available data. You can use methods from the exploration data analysis to understand the data. The data understanding aims to check the feasibility of the project.

Data Preparation

Data preparation involves the creation of a data set that uses for the further phases. In this phase, you prepare and clean the data set. This phase consists of three tasks: Merging the data, data cleaning and feature engineering. You have to make sure that you merge the data from the different data sources in a meaningful way. This step is the basis for data cleaning and feature engineering. Data cleaning removes missing data points, outliers and incorrect data points. In practice, it helps to consult with the specialist department to find wrong data points. Then you can carry out feature engineering. Here, you construct additional features based on the experiences from the business understanding. Some procedures automatically extract features from existing data. In Python, you can use the package feature-engine.

Modeling

This step is the core of the CRISP-DM process model. In this phase, you use Machine learning (ML) methods to develop an ML model based on the prepared historical data. Usually, you divide the data into test and training data. You train the ML model with the training data and evaluate it with the test data. The goal is to achieve the highest possible prediction rate.

Evaluation

In the evaluation phase, the data science team presents the results from the previous two steps to the stakeholders. In this context, the stakeholders analyze the overall process from data understanding to modeling. As a result, they have to decide whether the developed solution is ready to operationalize or not. The operationalization takes place in the Deployment phase.

Deployment

In the deployment phase, you integrate the developed ML solution into the existing infrastructure. It is important that you continuously monitor the ML models.

Limitations of CRISP-DM

There are some gaps in the CRISP-DM model. The following points are not included in the CRISP-DM standard.

Data acquisition

The CRISP-DM standard assumes that the data is already available. In practice, however, the data is often not available or only excerpts are available.

No project management

The CRISP-DM model does not include project management. You need to use other methods for project management. Scrum is an established method in software development. You can combine the Scrum framework with CRISP-DM.

Conclusion

CRISP-DM is a structured approach that covers all phases of a Data Science project. Most companies use a data science process that is similar to the CRISP-DM standard. Such a process is necessary to carry out a Data Science project in a structured and efficient way.


💡 Do you enjoy our content and want to read super-detailed articles about data science topics? If so, be sure to check out our premium offer!


Thanks so much for reading. Have a great day!

References

  • [1] Wirth, R. and Hipp, J., 2000, April. CRISP-DM: Towards a standard process model for data mining. In Proceedings of the 4th international conference on the practical applications of knowledge discovery and data mining (Vol. 1, pp. 29–39).

Leave a comment