How to Fully Automate Data Cleaning with Python in 5 Steps

Data cleaning can be quite tedious and boring. But it doesn't have to be. Here's how you can automate most of the data cleaning steps with Python.



How to Fully Automate Data Cleaning with Python in 5 Steps
Image by Author | Segmind SSD-1B Model

 

Data cleaning is often seen as a manual, time-consuming process that data scientists and analysts must trudge through before getting to the "real work" of analysis.

However, with Python libraries like pandas, we can automate many common cleaning tasks to create a reliable, reproducible pipeline. The key is to identify common data quality issues and implement systematic solutions that can handle various datasets consistently.

Let’s explore the steps to building such an automated data cleaning pipeline in this article.

▶️ Link to Google Colab notebook

First, import pandas:

import pandas as pd

 

Step 1: Run Basic Data Quality Checks

 
Before we begin any cleaning, we need to understand the quality of the data we're working with. So the first step involves assessing the current state of your data.

We need to identify:

  • Missing values in each column
  • Duplicate rows
  • Basic data characteristics

Let's start with some essential quality checks:

def check_data_quality(df):
    # Store initial data quality metrics
    quality_report = {
        'missing_values': df.isnull().sum().to_dict(),
        'duplicates': df.duplicated().sum(),
        'total_rows': len(df),
        'memory_usage': df.memory_usage().sum() / 1024**2  # in MB
    }
    return quality_report

 

This gives us a baseline understanding of our data's quality and helps identify the specific cleaning tasks we'll need to perform.

 

Step 2: Standardize Data Types

 
One of the most common issues in raw data is inconsistent data types. For example, dates might be stored as strings, or numeric values might include currency symbols, and the like.

So as the next step, we ensure all fields have the right/expected data types. This includes:

  • Converting string dates to datetime objects
  • Identifying and converting numeric strings to actual numbers
  • Ensuring categorical variables are properly encoded
def standardize_datatypes(df):
    for column in df.columns:
        # Try converting string dates to datetime
        if df[column].dtype == 'object':
            try:
                df[column] = pd.to_datetime(df[column])
                print(f"Converted {column} to datetime")
            except ValueError:
                # Try converting to numeric if datetime fails
                try:
                    df[column] = pd.to_numeric(df[column].str.replace('$', '').str.replace(',', ''))
                    print(f"Converted {column} to numeric")
                except:
                    pass
    return df

 

This step prevents type-related errors in subsequent analysis.

 

Step 3: Handle Missing Values

 
Missing values can significantly impact our analysis. Rather than dropping data records with missing values, we can use imputation strategies:

  • Using median imputation for numeric columns
  • Applying mode imputation for categorical data
  • Maintaining the statistical properties of the dataset while filling gaps

Here’s how we can impute missing values in both numeric and categorical columns:

from sklearn.impute import SimpleImputer

def handle_missing_values(df):
    # Handle numeric columns
    numeric_columns = df.select_dtypes(include=['int64', 'float64']).columns
    if len(numeric_columns) > 0:
        num_imputer = SimpleImputer(strategy='median')
        df[numeric_columns] = num_imputer.fit_transform(df[numeric_columns])
    
    # Handle categorical columns
    categorical_columns = df.select_dtypes(include=['object']).columns
    if len(categorical_columns) > 0:
        cat_imputer = SimpleImputer(strategy='most_frequent')
        df[categorical_columns] = cat_imputer.fit_transform(df[categorical_columns])
    
    return df

 

Here we use Scikit-learn’s SimpleImputer class, but you can also use the imputation functions from pandas.

 

Step 4: Detect and Handle Outliers

 
Outliers can skew our analysis, so we need to handle them carefully.

⚠️You need domain knowledge to decide on “what” might actually be outliers.

Here's an approach using the Interquartile Range (IQR) method:

  • Calculate Interquartile Range (IQR) for numeric columns
  • Identify values beyond 1.5 * IQR from quartiles
  • Apply capping to extreme values rather than removing them

This preserves data while managing extreme values.

def remove_outliers(df):
    numeric_columns = df.select_dtypes(include=['int64', 'float64']).columns
    outliers_removed = {}
    
    for column in numeric_columns:
        Q1 = df[column].quantile(0.25)
        Q3 = df[column].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        
        # Count outliers before removing
        outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)].shape[0]
        
        # Cap the values instead of removing them
        df[column] = df[column].clip(lower=lower_bound, upper=upper_bound)
        
        if outliers > 0:
            outliers_removed[column] = outliers
            
    return df, outliers_removed

 

You can learn about the common techniques to handle outliers.

 

Step 5: Validate the Results

 
After cleaning, we need to verify that our pipeline worked as expected:

  • Confirm no remaining missing values
  • Check for any remaining duplicates
  • Validate data integrity and consistency
  • Generate a comprehensive cleaning report

def validate_cleaning(df, original_shape, cleaning_report):
    validation_results = {
        'rows_remaining': len(df),
        'missing_values_remaining': df.isnull().sum().sum(),
        'duplicates_remaining': df.duplicated().sum(),
        'data_loss_percentage': (1 - len(df)/original_shape[0]) * 100
    }
    
    # Add validation results to the cleaning report
    cleaning_report['validation'] = validation_results
    return cleaning_report

 

Finally, let's put it all together in a complete pipeline:


def automated_cleaning_pipeline(df):
    # Store original shape for reporting
    original_shape = df.shape
    
    # Initialize cleaning report
    cleaning_report = {}
    
    # Execute each step and collect metrics
    cleaning_report['initial_quality'] = check_data_quality(df)
    
    df = standardize_datatypes(df)
    df = handle_missing_values(df)
    df, outliers = remove_outliers(df)
    cleaning_report['outliers_removed'] = outliers
    
    # Validate and finalize report
    cleaning_report = validate_cleaning(df, original_shape, cleaning_report)
    
    return df, cleaning_report

 

You can now test your pipeline on simple sample data.

 

Wrapping Up

 
As you can see, automating data cleaning not only saves time but also ensures consistency and reproducibility in your data preparation process. The pipeline I've shared handles common data quality issues while providing detailed reporting on the changes made.

You might need to adjust the cleaning strategies based on your domain knowledge and specific requirements.

Remember that while automation is helpful, you should always review the cleaning report and validate the results for your specific use case. Happy data cleaning!
 
 

Bala Priya C is a developer and technical writer from India. She likes working at the intersection of math, programming, data science, and content creation. Her areas of interest and expertise include DevOps, data science, and natural language processing. She enjoys reading, writing, coding, and coffee! Currently, she's working on learning and sharing her knowledge with the developer community by authoring tutorials, how-to guides, opinion pieces, and more. Bala also creates engaging resource overviews and coding tutorials.