Return to Answer

added 118 characters in body

Source Link

edited Jan 1, 2021 at 4:11

I had the same problem with my music collection... most tools/scripts were noisy (listing filenames) or did checksums of file contents, which is far too slow...

Special characters, spaces, and symbols made this challenging... the strategy is to MD5sum the sorted file names along with the parent directory, then the script can sort hashes to find duplicates. We must sort children file names, as find does not guarantee file order in two different directories.

Bash Script (Debian 10):

#!/bin/bash

# usage: ./find_duplicates tunes_dir
# output: c547c3bcf85b9c578a1a52dd20665343 - /mnt/tunes/soul brothers/Motherlode
# MD5 is generated from all children filenames + album folder name
# sort list by MD5 then list duplicate (32bit hashes) representing albums
# Album/CD1/... Album/CD2/... will show (3) results if Album is duplicated
# CD1/2 example is indistinguishable from Discography/Album/Song.mp3

if [ $# -eq 0 ]; then
    echo "Please supply tunes directory as first arg"
    exit 1
fi

# Using absolute path of tunes_dir param
find $(readlink -f $1) -type d | while IFS= read -r line
do
    cd "$line"
    children=$(find ./ -type f | sort)
    base=$(basename "$line")
    sum=$(echo $children $base | md5sum)
    echo $sum $line
done | sort -n | uniq -D -w 32

Directory structure:

user@pc:~/test# find . -type d
./super soul brothers
./super soul brothers/Stritch's Brew
./super soul brothers/Fireball!
./super soul brothers/Motherlode
./car_tunes
./car_tunes/Fireball!

Example output:

user@pc:~# ./find_duplicates  test/
07b0f79429663685f4005486af20247a - /root/test/car_tunes/Fireball!
07b0f79429663685f4005486af20247a - /root/test/super soul brothers/Fireball!

I had the same problem with my music collection... most tools/scripts were noisy (listing filenames) or did checksums of file contents, which is far too slow...

Special characters, spaces, and symbols made this challenging... the strategy is to MD5sum the file names along with the parent directory, then the script can sort hashes to find duplicates.

Bash Script (Debian 10):

#!/bin/bash

# usage: ./find_duplicates tunes_dir
# output: c547c3bcf85b9c578a1a52dd20665343 - /mnt/tunes/soul brothers/Motherlode
# MD5 is generated from all children filenames + album folder name
# sort list by MD5 then list duplicate (32bit hashes) representing albums
# Album/CD1/... Album/CD2/... will show (3) results if Album is duplicated
# CD1/2 example is indistinguishable from Discography/Album/Song.mp3

if [ $# -eq 0 ]; then
    echo "Please supply tunes directory as first arg"
    exit 1
fi

# Using absolute path of tunes_dir param
find $(readlink -f $1) -type d | while IFS= read -r line
do
    cd "$line"
    children=$(find ./ -type f)
    base=$(basename "$line")
    sum=$(echo $children $base | md5sum)
    echo $sum $line
done | sort -n | uniq -D -w 32

Directory structure:

user@pc:~/test# find . -type d
./super soul brothers
./super soul brothers/Stritch's Brew
./super soul brothers/Fireball!
./super soul brothers/Motherlode
./car_tunes
./car_tunes/Fireball!

Example output:

user@pc:~# ./find_duplicates  test/
07b0f79429663685f4005486af20247a - /root/test/car_tunes/Fireball!
07b0f79429663685f4005486af20247a - /root/test/super soul brothers/Fireball!

I had the same problem with my music collection... most tools/scripts were noisy (listing filenames) or did checksums of file contents, which is far too slow...

Bash Script (Debian 10):

#!/bin/bash

# usage: ./find_duplicates tunes_dir
# output: c547c3bcf85b9c578a1a52dd20665343 - /mnt/tunes/soul brothers/Motherlode
# MD5 is generated from all children filenames + album folder name
# sort list by MD5 then list duplicate (32bit hashes) representing albums
# Album/CD1/... Album/CD2/... will show (3) results if Album is duplicated
# CD1/2 example is indistinguishable from Discography/Album/Song.mp3

if [ $# -eq 0 ]; then
    echo "Please supply tunes directory as first arg"
    exit 1
fi

# Using absolute path of tunes_dir param
find $(readlink -f $1) -type d | while IFS= read -r line
do
    cd "$line"
    children=$(find ./ -type f | sort)
    base=$(basename "$line")
    sum=$(echo $children $base | md5sum)
    echo $sum $line
done | sort -n | uniq -D -w 32

Directory structure:

user@pc:~/test# find . -type d
./super soul brothers
./super soul brothers/Stritch's Brew
./super soul brothers/Fireball!
./super soul brothers/Motherlode
./car_tunes
./car_tunes/Fireball!

Example output:

user@pc:~# ./find_duplicates  test/
07b0f79429663685f4005486af20247a - /root/test/car_tunes/Fireball!
07b0f79429663685f4005486af20247a - /root/test/super soul brothers/Fireball!

Source Link

answered Dec 15, 2020 at 3:28

Kevin

I had the same problem with my music collection... most tools/scripts were noisy (listing filenames) or did checksums of file contents, which is far too slow...

Special characters, spaces, and symbols made this challenging... the strategy is to MD5sum the file names along with the parent directory, then the script can sort hashes to find duplicates.

Bash Script (Debian 10):

#!/bin/bash

# usage: ./find_duplicates tunes_dir
# output: c547c3bcf85b9c578a1a52dd20665343 - /mnt/tunes/soul brothers/Motherlode
# MD5 is generated from all children filenames + album folder name
# sort list by MD5 then list duplicate (32bit hashes) representing albums
# Album/CD1/... Album/CD2/... will show (3) results if Album is duplicated
# CD1/2 example is indistinguishable from Discography/Album/Song.mp3

if [ $# -eq 0 ]; then
    echo "Please supply tunes directory as first arg"
    exit 1
fi

# Using absolute path of tunes_dir param
find $(readlink -f $1) -type d | while IFS= read -r line
do
    cd "$line"
    children=$(find ./ -type f)
    base=$(basename "$line")
    sum=$(echo $children $base | md5sum)
    echo $sum $line
done | sort -n | uniq -D -w 32

Directory structure:

user@pc:~/test# find . -type d
./super soul brothers
./super soul brothers/Stritch's Brew
./super soul brothers/Fireball!
./super soul brothers/Motherlode
./car_tunes
./car_tunes/Fireball!

Example output:

user@pc:~# ./find_duplicates  test/
07b0f79429663685f4005486af20247a - /root/test/car_tunes/Fireball!
07b0f79429663685f4005486af20247a - /root/test/super soul brothers/Fireball!