How to Fully Automate Data Cleaning with Python in 5 Steps

Data cleaning can be quite tedious and boring. But it doesn't have to be. Here's how you can automate most of the data cleaning steps with Python.

By Bala Priya C, KDnuggets Contributing Editor & Technical Content Specialist on March 17, 2025 in Python

How to Fully Automate Data Cleaning with Python in 5 Steps

Image by Author | Segmind SSD-1B Model

Data cleaning is often seen as a manual, time-consuming process that data scientists and analysts must trudge through before getting to the "real work" of analysis.

However, with Python libraries like pandas, we can automate many common cleaning tasks to create a reliable, reproducible pipeline. The key is to identify common data quality issues and implement systematic solutions that can handle various datasets consistently.

Let’s explore the steps to building such an automated data cleaning pipeline in this article.

▶️ Link to Google Colab notebook

First, import pandas:

import pandas as pd

Step 1: Run Basic Data Quality Checks

Before we begin any cleaning, we need to understand the quality of the data we're working with. So the first step involves assessing the current state of your data.

We need to identify:

Missing values in each column
Duplicate rows
Basic data characteristics

Let's start with some essential quality checks:

def check_data_quality(df):
    # Store initial data quality metrics
    quality_report = {
        'missing_values': df.isnull().sum().to_dict(),
        'duplicates': df.duplicated().sum(),
        'total_rows': len(df),
        'memory_usage': df.memory_usage().sum() / 1024**2  # in MB
    }
    return quality_report

This gives us a baseline understanding of our data's quality and helps identify the specific cleaning tasks we'll need to perform.

Step 2: Standardize Data Types

One of the most common issues in raw data is inconsistent data types. For example, dates might be stored as strings, or numeric values might include currency symbols, and the like.

So as the next step, we ensure all fields have the right/expected data types. This includes:

Converting string dates to datetime objects
Identifying and converting numeric strings to actual numbers
Ensuring categorical variables are properly encoded

def standardize_datatypes(df):
    for column in df.columns:
        # Try converting string dates to datetime
        if df[column].dtype == 'object':
            try:
                df[column] = pd.to_datetime(df[column])
                print(f"Converted {column} to datetime")
            except ValueError:
                # Try converting to numeric if datetime fails
                try:
                    df[column] = pd.to_numeric(df[column].str.replace('$', '').str.replace(',', ''))
                    print(f"Converted {column} to numeric")
                except:
                    pass
    return df

This step prevents type-related errors in subsequent analysis.

Step 3: Handle Missing Values

Missing values can significantly impact our analysis. Rather than dropping data records with missing values, we can use imputation strategies:

Using median imputation for numeric columns
Applying mode imputation for categorical data
Maintaining the statistical properties of the dataset while filling gaps

Here’s how we can impute missing values in both numeric and categorical columns:

from sklearn.impute import SimpleImputer

def handle_missing_values(df):
    # Handle numeric columns
    numeric_columns = df.select_dtypes(include=['int64', 'float64']).columns
    if len(numeric_columns) > 0:
        num_imputer = SimpleImputer(strategy='median')
        df[numeric_columns] = num_imputer.fit_transform(df[numeric_columns])
    
    # Handle categorical columns
    categorical_columns = df.select_dtypes(include=['object']).columns
    if len(categorical_columns) > 0:
        cat_imputer = SimpleImputer(strategy='most_frequent')
        df[categorical_columns] = cat_imputer.fit_transform(df[categorical_columns])
    
    return df

Here we use Scikit-learn’s SimpleImputer class, but you can also use the imputation functions from pandas.

Step 4: Detect and Handle Outliers

Outliers can skew our analysis, so we need to handle them carefully.

⚠️You need domain knowledge to decide on “what” might actually be outliers.

Here's an approach using the Interquartile Range (IQR) method:

Calculate Interquartile Range (IQR) for numeric columns
Identify values beyond 1.5 * IQR from quartiles
Apply capping to extreme values rather than removing them

This preserves data while managing extreme values.

def remove_outliers(df):
    numeric_columns = df.select_dtypes(include=['int64', 'float64']).columns
    outliers_removed = {}
    
    for column in numeric_columns:
        Q1 = df[column].quantile(0.25)
        Q3 = df[column].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        
        # Count outliers before removing
        outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)].shape[0]
        
        # Cap the values instead of removing them
        df[column] = df[column].clip(lower=lower_bound, upper=upper_bound)
        
        if outliers > 0:
            outliers_removed[column] = outliers
            
    return df, outliers_removed

You can learn about the common techniques to handle outliers.

Step 5: Validate the Results

After cleaning, we need to verify that our pipeline worked as expected:

Confirm no remaining missing values
Check for any remaining duplicates
Validate data integrity and consistency
Generate a comprehensive cleaning report


def validate_cleaning(df, original_shape, cleaning_report):
    validation_results = {
        'rows_remaining': len(df),
        'missing_values_remaining': df.isnull().sum().sum(),
        'duplicates_remaining': df.duplicated().sum(),
        'data_loss_percentage': (1 - len(df)/original_shape[0]) * 100
    }
    
    # Add validation results to the cleaning report
    cleaning_report['validation'] = validation_results
    return cleaning_report

Finally, let's put it all together in a complete pipeline:


def automated_cleaning_pipeline(df):
    # Store original shape for reporting
    original_shape = df.shape
    
    # Initialize cleaning report
    cleaning_report = {}
    
    # Execute each step and collect metrics
    cleaning_report['initial_quality'] = check_data_quality(df)
    
    df = standardize_datatypes(df)
    df = handle_missing_values(df)
    df, outliers = remove_outliers(df)
    cleaning_report['outliers_removed'] = outliers
    
    # Validate and finalize report
    cleaning_report = validate_cleaning(df, original_shape, cleaning_report)
    
    return df, cleaning_report

You can now test your pipeline on simple sample data.

Wrapping Up

As you can see, automating data cleaning not only saves time but also ensures consistency and reproducibility in your data preparation process. The pipeline I've shared handles common data quality issues while providing detailed reporting on the changes made.

You might need to adjust the cleaning strategies based on your domain knowledge and specific requirements.

Remember that while automation is helpful, you should always review the cleaning report and validate the results for your specific use case. Happy data cleaning!

Bala Priya C is a developer and technical writer from India. She likes working at the intersection of math, programming, data science, and content creation. Her areas of interest and expertise include DevOps, data science, and natural language processing. She enjoys reading, writing, coding, and coffee! Currently, she's working on learning and sharing her knowledge with the developer community by authoring tutorials, how-to guides, opinion pieces, and more. Bala also creates engaging resource overviews and coding tutorials.