🧹 10 Essential Data Cleansing Techniques Every Data Professional Should Know

Ahmed ElSangary

Published Jun 9, 2025

As a Microsoft Fabric Data Engineer, I know that clean data is the foundation of reliable analytics. Bad data leads to poor decisions, wasted time, and flawed AI models.

Here are 10 must-know data cleansing techniques, with simple examples you can apply in Fabric Lakehouse, Spark, and Power BI. As Data Engineer, clean data is your most critical deliverable—it powers accurate analytics, reliable AI models, and trustworthy business decisions.

Without proper data cleansing, even the most advanced pipelines and warehouses become "garbage in, garbage out" systems that erode organizational trust.

1. Standardization

Ensuring data follows a consistent format.

Problem: Inconsistent formats ("USA" vs. "U.S.A" vs. "United States").
Example: Converting dates to YYYY-MM-DD format across all records.

# Before standardization: various date formats

# "2023-05-15", "05/15/2023", "15-May-2023", "20230515"

// Standardize country names

Example (Power Query in Fabric):

= Table.ReplaceValue(

Source, each [Country], each if Text.Contains([Country], "USA") then "US" else [Country],

Replacer.ReplaceText, {"Country"})

2. Normalization

Transforming data to a common scale without distorting differences in values. Often used in numerical data for machine learning models.

Problem: Features with different scales (e.g., Age 0-100 vs. Income 0-1,000,000).

- Scaling numerical data to a standard range (e.g., 0 to 1, -1 to 1, or z-scores).

## Why Normalization Matters

✅ Improves Machine Learning Performance (algorithms like SVM, KNN, and neural networks work better with normalized data).

✅ Enables Fair Comparisons (when features are on different scales, e.g., income vs. age).

✅ Reduces Bias (prevents features with large ranges from dominating models).

✅ Speeds Up Training (optimization algorithms converge faster).

## Common Normalization Techniques

### 1. Min-Max Scaling

- Use case: Image pixel values (0-255 → 0-1).

### 2. Z-Score Standardization

- Transforms data to have mean = 0 and standard deviation = 1.

- Use case: When data follows a Gaussian distribution.

### 3. Decimal Scaling

- Moves the decimal point to scale values (e.g., 1500 → 1.5 if max is 2000).

### 4. Log Transformation

- Compresses large ranges (e.g., for skewed data like income).

### 5. Unit Vector Scaling (L2 Normalization)

- Scales data to have a magnitude of 1 (useful for text/data with varying lengths).

## Example in Python (Pandas & Scikit-Learn)

import pandas as pd

from sklearn.preprocessing import MinMaxScaler, StandardScaler

# Sample data

data = {'Age': [25, 47, 32, 61], 'Income': [40000, 80000, 55000, 120000]}

df = pd.DataFrame(data)

# Min-Max Scaling (0 to 1)

scaler = MinMaxScaler()

df_minmax = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)

# Z-Score Standardization (mean=0, std=1)

standardizer = StandardScaler()

df_zscore = pd.DataFrame(standardizer.fit_transform(df), columns=df.columns)

print("Min-Max Normalized:\n", df_minmax)

print("\nZ-Score Standardized:\n", df_zscore)

Example (Fabric Notebook - Scikit-Learn):

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

df[['Age', 'Income']] = scaler.fit_transform(df[['Age', 'Income']])

3. Deduplication (De-duplication)

Removing duplicate records from a dataset.

Problem: Duplicate records skew aggregations and ML models.
Example: Two customer entries with the same email or ID.
Example (Fabric SQL Endpoint):

-- Keep only the first occurrence

WITH Deduped AS (

SELECT *,

ROW_NUMBER() OVER (PARTITION BY CustomerID ORDER BY OrderDate) AS RowNum

FROM Sales

)

SELECT * FROM Deduped WHERE RowNum = 1;

4. Cross-Field Validation

Recommended by LinkedIn

Should I Choose Machine Learning or Big Data?

Bernard Marr 4 years ago

Dimension Reduction Linear Discriminant Analysis

360DigiTMG 1 year ago

From Data Science to Decision Science: The Strategic…

Ajay Pundhir 2 months ago

Problem: Logical inconsistencies (e.g., "Age=10" but "EmploymentStatus=Employed").

Example (Fabric SQL Rule):

-- Flag invalid records

SELECT *, CASE WHEN Age < 18 AND IsEmployed = 1 THEN 'Invalid: Underage Employment'

ELSE 'Valid' END AS ValidationStatus

FROM Users;

5. Validation / Text Cleaning (Noise Removal)

Checking that data meets certain rules or constraints.

Example: Ensuring emails have a proper format or ages are within logical bounds.
Problem: Extra spaces, special chars, or HTML tags in text.

Example (Fabric Notebook - Regex):

import re

df['CleanText'] = df['RawText'].str.replace(r'[^\w\s]', '') # Remove punctuation

6. Imputation / Filling Missing Values

Replacing missing data using a strategy (mean, median, default value, etc.).

Problem: Nulls, empty strings, or placeholders like "N/A" break calculations.

Solutions:

Drop rows/columns (if missing data is insignificant)
Fill with mean/median/mode (for numerical data)
Forward/backward fill (time-series data)

Example: Filling missing temperatures with the average of the day.

Example (PySpark in Fabric):

from pyspark.sql.functions import mean, col

# Fill missing 'Salary' with the mean

mean_salary = df.select(mean(col('Salary'))).collect()[0][0]

df_clean = df.na.fill({'Salary': mean_salary})

7. Parsing & Splitting Columns

Breaking data into parts or converting it into a more usable structure.

Problem: Combined fields (e.g., "First Last" in one column).
Example: Splitting full address fields into street, city, and postal code.

Example (Fabric Data Pipeline - Derived Column):

-- Split 'FullName' into 'FirstName' and 'LastName'

SELECT

SUBSTRING(FullName, 1, CHARINDEX(' ', FullName) - 1) AS FirstName,

SUBSTRING(FullName, CHARINDEX(' ', FullName) + 1, LEN(FullName)) AS LastName

FROM Customers;

8. Filtering / Outlier Removal

Identifying and removing data that falls outside of expected ranges.

Problem: Extreme values distort statistics.

Solutions:

Z-Score filtering (remove where |Z| > 3)
IQR method (remove points outside Q1-1.5IQR / Q3+1.5IQR)

Example: Removing negative ages or salaries over $10M in a retail dataset.

Example (Spark SQL):

-- Remove outliers using IQR

WITH Stats AS (

SELECT

PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY Revenue) AS Q1,

PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY Revenue) AS Q3,

(Q3 - Q1) * 1.5 AS IQR

FROM Sales

)

SELECT * FROM Sales

WHERE Revenue BETWEEN (Q1 - IQR) AND (Q3 + IQR);

9. Data Transformation / Fixing Data Types

Converting data into a new structure or format.

Problem: Numbers stored as strings, dates in wrong formats.
Example: Changing “Yes/No” to Boolean True/False.

Example (Power Query - Type Conversion):

= Table.TransformColumnTypes(

Source, {{"OrderID", Int64.Type}, {"OrderDate", type date}} )

10. Handling Inconsistent Categorical Data/ Data Mapping

Aligning data from one format or schema to another.

Problem: Typos or multiple categories for the same thing ("M" / "Male" / "m").
Example: Mapping "M" and "Male" both to "Male" for consistency.

Example (Fabric Spark):

from pyspark.sql.functions import when

df = df.withColumn(

"Gender", when(df.Gender.isin(["M", "Male"]), "Male")

.when(df.Gender.isin(["F", "Female"]), "Female") .otherwise("Other")

)

To view or add a comment, sign in

LinkedIn respects your privacy

🧹 10 Essential Data Cleansing Techniques Every Data Professional Should Know

Ahmed ElSangary

As a Microsoft Fabric Data Engineer, I know that clean data is the foundation of reliable analytics. Bad data leads to poor decisions, wasted time, and flawed AI models.

2. Normalization

3. Deduplication (De-duplication)

4. Cross-Field Validation

Recommended by LinkedIn

5. Validation / Text Cleaning (Noise Removal)

6. Imputation / Filling Missing Values

7. Parsing & Splitting Columns

8. Filtering / Outlier Removal

9. Data Transformation / Fixing Data Types

10. Handling Inconsistent Categorical Data/ Data Mapping

More articles by Ahmed ElSangary

Others also viewed

Navigating the Data Landscape: A Data Product Manager's Imperative

From Data to Decisions – How Machine Learning Enables Better Business Decisions

Computer Vision Classification: Cleaning Noisy and Mislabeled Data

The curse and cure of dimensionality

The Rise of the AI Analyst

Data Strategies Start With Defining What Problem You Want to Solve

Mastering Tree-Based Models: ID3, CART, and the Metrics Behind Them

Why Data Analysts Must Master AI

Am I Data?

Decoding Classification Algorithms: A Fun Guide to Finding Your Data's Perfect Match!

Explore content categories

As a Microsoft Fabric Data Engineer, I know that clean data is the foundation of reliable analytics. Bad data leads to poor decisions, wasted time, and flawed AI models.

2. Normalization

3. Deduplication (De-duplication)

4. Cross-Field Validation

Recommended by LinkedIn

5. Validation / Text Cleaning (Noise Removal)

6. Imputation / Filling Missing Values

7. Parsing & Splitting Columns

8. Filtering / Outlier Removal

9. Data Transformation / Fixing Data Types

10. Handling Inconsistent Categorical Data/ Data Mapping

More articles by Ahmed ElSangary

Is MS Fabric missing a catalog for…

The Role of the ERP Partner: Beyond…

Is Databricks Becoming a Cloud…

The Data Analyst’s Dilemma: Strong…

Max count of Tables in a Single JOIN

ETL & C: Evolving the Data Pipeline for…

🧹 Data Cleansing for Predictive…

# Why Use Spark in Microsoft Fabric?…

Securing SQL Server on Azure: Best…

Data Cleansing Skills Required to…

Others also viewed

Navigating the Data Landscape: A Data Product Manager's Imperative

From Data to Decisions – How Machine Learning Enables Better Business Decisions

Computer Vision Classification: Cleaning Noisy and Mislabeled Data

The curse and cure of dimensionality

The Rise of the AI Analyst

Data Strategies Start With Defining What Problem You Want to Solve

Mastering Tree-Based Models: ID3, CART, and the Metrics Behind Them

Why Data Analysts Must Master AI

Am I Data?

Decoding Classification Algorithms: A Fun Guide to Finding Your Data's Perfect Match!

Explore content categories