Mithilesh Tata

Posted on Jun 25

Find & Delete Duplicate Files on Windows & Mac Programmatically

#programming #devops #productivity #tutorial

Duplicate files are a common nuisance for computer users, accumulating over time and consuming valuable storage space. Whether it's redundant photos, documents, or media, these duplicates can slow down your system and make file organization a nightmare. While manual deletion is an option for a few files, programmatically finding and deleting duplicates offers a much more efficient solution, especially for large datasets.

This article will guide you through various programmatic methods to identify and remove duplicate files on both Windows and Mac operating systems, concluding with a look at the professional Duplicate File Finder & Remover for a more user-friendly approach.

Understanding Duplicate Files

Before diving into the programmatic solutions, it's essential to understand how systems identify duplicate files. The most reliable method is to compare file content rather than just file names or sizes, as files with different names can have identical content and files with the same name might be other versions.

It is typically achieved by calculating a checksum or hash for each file. A hash function generates a unique fixed-size string (the hash value) for a given input (the file content). If two files have the same hash value, they are probably identical. Standard hashing algorithms include MD5, SHA-1, and SHA-256.

Programmatic Methods to Find & Delete Duplicates

On Windows
Windows offers several built-in tools and scripting capabilities for managing files, including identifying duplicates.

1. Using PowerShell

PowerShell is a robust command-line shell and scripting language that can be used to find and delete duplicate files based on their content hash.

Finding Duplicates: To list duplicate files based on their hash, open PowerShell as an administrator and use the following command:

Get-ChildItem -Path "C:\Your\Target\Folder" -File -Recurse | Group-Object -Property Length | Where-Object {$_.Count -gt 1} | Select-Object -ExpandProperty Group | Get-FileHash | Group-Object -Property Hash | Where-Object {$_.Count -gt 1} | ForEach-Object {$_.Group | Select-Object Path, Hash} | Out-File -FilePath "C:\Path\To\Duplicates.txt"

Replace "C:\Your\Target\Folder" with the directory you want to scan.
Replace "C:\Path\To\Duplicates.txt" with the desired path for the output file.

This command first groups files by size, then by their hash, and finally outputs the paths and hashes of all identified duplicates to a text file. Review this file carefully before proceeding with deletion.

Deleting Duplicates (with caution): To automatically delete duplicate files (keeping one instance), you can modify the command. Exercise extreme caution with this command, as it will permanently delete files. It's highly recommended to back up your data or at least review the output of the previous command thoroughly before executing this.

Get-ChildItem -Path "C:\Your\Target\Folder" -File -Recurse | Group-Object -Property Length | Where-Object {$_.Count -gt 1} | Select-Object -ExpandProperty Group | Get-FileHash | Group-Object -Property Hash | Where-Object {$_.Count -gt 1} | ForEach-Object {$_.Group | Select-Object -Skip 1} | Remove-Item -Force -WhatIf

The -WhatIf parameter is crucial here. It simulates the deletion without actually removing any files, allowing you to see what would be deleted. Once you are confident, remove -WhatIf to perform the actual deletion.

2. Using Command Prompt (CMD) - Limited Capability

While CMD is less robust for this task than PowerShell, you can use it to find files with similar names or extensions, often indicative of simple duplicates created by copying. However, it doesn't compare file content.

Finding files by pattern:

dir /s /b *.jpg > duplicates.txt

This command will list all .jpg files recursively within the current directory and its subdirectories, saving the output to duplicates.txt. You would then manually review this list.

Deleting files by pattern (with caution):

del /s /f "* - Copy.jpg"

This command would delete files ending with " - Copy.jpg". It is peculiar and only works for duplicates created by Windows' default "copy" naming convention.

On Mac
Similar to Windows, macOS provides powerful command-line tools in Terminal for advanced file management.

1. Using Terminal with find and md5 / shasum

The find command can locate files, and md5 or shasum can generate content hashes.

Finding Duplicates: Open Terminal (Applications > Utilities > Terminal) and navigate to the directory you want to scan using cd.

cd ~/Documents # Example: change to your Documents folder
find . -type f -exec md5 {} \; | awk -F '=' '{print $2 "\t" $1}' | sort | tee duplicates.txt | cut -f 1,2 -d ' ' | uniq -d | grep -hif - duplicates.txt > identified_duplicates.txt

This command chain does the following:

find . -type f -exec md5 {} \;: Finds all regular files in the current directory and its subdirectories and computes their MD5 hash.
awk -F '=' '{print $2 "\t" $1}': Reforms the output to put the hash first, then the filename.
sort: Sorts the output, grouping identical hashes together.
tee duplicates.txt: Saves the sorted output to duplicates.txt for review.
cut -f 1,2 -d ' ' | uniq -d: Extracts the hash and file path, then identifies lines where the hash appears more than once.
grep -hif - duplicates.txt: Filters the original duplicates.txt to show only the lines corresponding to duplicate hashes.
> identified_duplicates.txt: Saves the final list of duplicate files to identified_duplicates.txt.

Note: For stronger hashing, you can replace md5 with shasum -a 256 for SHA-256 hashes.

Deleting Duplicates (with caution): Once you have identified_duplicates.txt, you can use a script to delete all but one instance of each duplicate. Again, exercise extreme caution.

# IMPORTANT: Review identified_duplicates.txt carefully before running this!
# This example will delete ALL duplicates listed in the file, keeping only the first encountered.
# It's safer to manually select and delete after reviewing.

# Alternatively, you can use `fdupes` for a more interactive deletion (see next section)

# If you absolutely want to automate deletion from the list (USE WITH EXTREME CARE):
# This will delete ALL files listed in identified_duplicates.txt
# while read -r file; do
#     rm "$file"
# done < identified_duplicates.txt

A safer approach is to use a tool like fdupes.

2. Using fdupes (Recommended for Mac)

fdupes is a command-line utility specifically designed to find and optionally delete duplicate files. It's often preferred for its ease of use and interactive deletion options.

Installation (using Homebrew): If you don't have Homebrew installed, you can get it from https://brew.sh.

brew install fdupes

Finding Duplicates: To scan a directory recursively:

fdupes -r /path/to/directory

It will list all duplicate files grouped.

Deleting Duplicates: To delete duplicates, fdupes offers interactive prompts:

fdupes -d -r /path/to/directory

It will prompt you for each set of duplicates, asking which files to keep and which to delete.

To automatically keep the first file found and delete the rest without prompting:

fdupes -dN -r /path/to/directory

Use -dN with extreme caution, as it provides no interactive confirmation.

The Aryson Duplicate File Finder & Remover

While programmatic methods offer robust control, they require a certain level of technical comfort. For users seeking a more intuitive and user-friendly solution, third-party software like the Aryson Duplicate File Finder & Remover can be an excellent choice.

The Aryson Duplicate File Finder & Remover (or similar general-purpose tools from Aryson, if available for local files) is designed to simplify the process of identifying and removing redundant files. Key features typically include:

User-Friendly Interface: Provides an easy-to-navigate graphical interface, making it accessible for users without technical expertise.
Comprehensive Scanning: Scans various file types (documents, photos, videos, audio, etc.) across selected drives and folders.
Content-Based Duplication: Employs advanced algorithms (likely using hash comparisons) to accurately identify duplicates based on their content, not just names or sizes.
Preview and Selection: Allows users to preview duplicate files before deletion, providing control over which copies to keep and which to remove. It often includes side-by-side comparisons of images or media.
Customizable Scan Criteria: Offers options to refine searches, such as excluding specific folder file types or setting minimum/maximum file sizes.
Safe Deletion Options: Provides various deletion options, such as moving to Recycle Bin/Trash, moving to a specified folder, or permanent deletion, often with safeguards to prevent accidental removal of original files.
Reporting: Generates reports summarizing the scan results, including the number of duplicates found and the storage space reclaimed.

Why choose a dedicated tool like Aryson (if a general file duplicate finder exists) over programmatic methods?

Ease of Use: No complex commands or scripting are required.
Visual Interface: Provides a clear visual representation of duplicates, making selection and management straightforward.
Safety Features: Often includes built-in safeguards and recovery options (like moving to trash) that command-line tools might lack.
**Advanced Features: **This may offer additional features like finding similar images (not exact duplicates) or organizing files, which are complex to implement programmatically.

In conclusion, while programmatic methods using PowerShell on Windows or fdupes on Mac offer robust and flexible ways to manage duplicate files, dedicated software like the Aryson Duplicate File Finder & Remover provides a more accessible and feature-rich experience for the average user, simplifying the process of reclaiming valuable disk space. Choose the method that best suits your technical comfort level and specific needs.