I had the same problem with my music collection... most tools/scripts were noisy (listing filenames) or did checksums of file contents, which is far too slow...
Special characters, spaces, and symbols made this challenging... the strategy is to MD5sum the sorted file names along with the parent directory, then the script can sort hashes to find duplicates. We must sort children file names, as find does not guarantee file order in two different directories.
Bash Script (Debian 10):
#!/bin/bash
# usage: ./find_duplicates tunes_dir
# output: c547c3bcf85b9c578a1a52dd20665343 - /mnt/tunes/soul brothers/Motherlode
# MD5 is generated from all children filenames + album folder name
# sort list by MD5 then list duplicate (32bit hashes) representing albums
# Album/CD1/... Album/CD2/... will show (3) results if Album is duplicated
# CD1/2 example is indistinguishable from Discography/Album/Song.mp3
if [ $# -eq 0 ]; then
echo "Please supply tunes directory as first arg"
exit 1
fi
# Using absolute path of tunes_dir param
find $(readlink -f $1) -type d | while IFS= read -r line
do
cd "$line"
children=$(find ./ -type f | sort)
base=$(basename "$line")
sum=$(echo $children $base | md5sum)
echo $sum $line
done | sort -n | uniq -D -w 32
Directory structure:
user@pc:~/test# find . -type d
./super soul brothers
./super soul brothers/Stritch's Brew
./super soul brothers/Fireball!
./super soul brothers/Motherlode
./car_tunes
./car_tunes/Fireball!
Example output:
user@pc:~# ./find_duplicates test/
07b0f79429663685f4005486af20247a - /root/test/car_tunes/Fireball!
07b0f79429663685f4005486af20247a - /root/test/super soul brothers/Fireball!