🧹 10 Essential Data Cleansing Techniques Every Data Professional Should Know
As a Microsoft Fabric Data Engineer, I know that clean data is the foundation of reliable analytics. Bad data leads to poor decisions, wasted time, and flawed AI models.
Here are 10 must-know data cleansing techniques, with simple examples you can apply in Fabric Lakehouse, Spark, and Power BI. As Data Engineer, clean data is your most critical deliverable—it powers accurate analytics, reliable AI models, and trustworthy business decisions.
Without proper data cleansing, even the most advanced pipelines and warehouses become "garbage in, garbage out" systems that erode organizational trust.
1. Standardization
Ensuring data follows a consistent format.
# Before standardization: various date formats
# "2023-05-15", "05/15/2023", "15-May-2023", "20230515"
// Standardize country names
Example (Power Query in Fabric):
= Table.ReplaceValue(
Source, each [Country], each if Text.Contains([Country], "USA") then "US" else [Country],
Replacer.ReplaceText, {"Country"})
2. Normalization
Transforming data to a common scale without distorting differences in values. Often used in numerical data for machine learning models.
- Scaling numerical data to a standard range (e.g., 0 to 1, -1 to 1, or z-scores).
## Why Normalization Matters
✅ Improves Machine Learning Performance (algorithms like SVM, KNN, and neural networks work better with normalized data).
✅ Enables Fair Comparisons (when features are on different scales, e.g., income vs. age).
✅ Reduces Bias (prevents features with large ranges from dominating models).
✅ Speeds Up Training (optimization algorithms converge faster).
## Common Normalization Techniques
### 1. Min-Max Scaling
- Use case: Image pixel values (0-255 → 0-1).
### 2. Z-Score Standardization
- Transforms data to have mean = 0 and standard deviation = 1.
- Use case: When data follows a Gaussian distribution.
### 3. Decimal Scaling
- Moves the decimal point to scale values (e.g., 1500 → 1.5 if max is 2000).
### 4. Log Transformation
- Compresses large ranges (e.g., for skewed data like income).
### 5. Unit Vector Scaling (L2 Normalization)
- Scales data to have a magnitude of 1 (useful for text/data with varying lengths).
## Example in Python (Pandas & Scikit-Learn)
import pandas as pd
from sklearn.preprocessing import MinMaxScaler, StandardScaler
# Sample data
data = {'Age': [25, 47, 32, 61], 'Income': [40000, 80000, 55000, 120000]}
df = pd.DataFrame(data)
# Min-Max Scaling (0 to 1)
scaler = MinMaxScaler()
df_minmax = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)
# Z-Score Standardization (mean=0, std=1)
standardizer = StandardScaler()
df_zscore = pd.DataFrame(standardizer.fit_transform(df), columns=df.columns)
print("Min-Max Normalized:\n", df_minmax)
print("\nZ-Score Standardized:\n", df_zscore)
Example (Fabric Notebook - Scikit-Learn):
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df[['Age', 'Income']] = scaler.fit_transform(df[['Age', 'Income']])
3. Deduplication (De-duplication)
Removing duplicate records from a dataset.
-- Keep only the first occurrence
WITH Deduped AS (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY CustomerID ORDER BY OrderDate) AS RowNum
FROM Sales
)
SELECT * FROM Deduped WHERE RowNum = 1;
4. Cross-Field Validation
Recommended by LinkedIn
Problem: Logical inconsistencies (e.g., "Age=10" but "EmploymentStatus=Employed").
Example (Fabric SQL Rule):
-- Flag invalid records
SELECT *, CASE WHEN Age < 18 AND IsEmployed = 1 THEN 'Invalid: Underage Employment'
ELSE 'Valid' END AS ValidationStatus
FROM Users;
5. Validation / Text Cleaning (Noise Removal)
Checking that data meets certain rules or constraints.
Example (Fabric Notebook - Regex):
import re
df['CleanText'] = df['RawText'].str.replace(r'[^\w\s]', '') # Remove punctuation
6. Imputation / Filling Missing Values
Replacing missing data using a strategy (mean, median, default value, etc.).
Problem: Nulls, empty strings, or placeholders like "N/A" break calculations.
Solutions:
Example (PySpark in Fabric):
from pyspark.sql.functions import mean, col
# Fill missing 'Salary' with the mean
mean_salary = df.select(mean(col('Salary'))).collect()[0][0]
df_clean = df.na.fill({'Salary': mean_salary})
7. Parsing & Splitting Columns
Breaking data into parts or converting it into a more usable structure.
Example (Fabric Data Pipeline - Derived Column):
-- Split 'FullName' into 'FirstName' and 'LastName'
SELECT
SUBSTRING(FullName, 1, CHARINDEX(' ', FullName) - 1) AS FirstName,
SUBSTRING(FullName, CHARINDEX(' ', FullName) + 1, LEN(FullName)) AS LastName
FROM Customers;
8. Filtering / Outlier Removal
Identifying and removing data that falls outside of expected ranges.
Solutions:
Example: Removing negative ages or salaries over $10M in a retail dataset.
Example (Spark SQL):
-- Remove outliers using IQR
WITH Stats AS (
SELECT
PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY Revenue) AS Q1,
PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY Revenue) AS Q3,
(Q3 - Q1) * 1.5 AS IQR
FROM Sales
)
SELECT * FROM Sales
WHERE Revenue BETWEEN (Q1 - IQR) AND (Q3 + IQR);
9. Data Transformation / Fixing Data Types
Converting data into a new structure or format.
Example (Power Query - Type Conversion):
= Table.TransformColumnTypes(
Source, {{"OrderID", Int64.Type}, {"OrderDate", type date}} )
10. Handling Inconsistent Categorical Data/ Data Mapping
Aligning data from one format or schema to another.
Example (Fabric Spark):
from pyspark.sql.functions import when
df = df.withColumn(
"Gender", when(df.Gender.isin(["M", "Male"]), "Male")
.when(df.Gender.isin(["F", "Female"]), "Female") .otherwise("Other")
)