I have several CSV file from 1MB to 6GB generated by inotify
script with a list of events formatted as is:
timestamp;fullpath;event;size
.
Those files are formatted like this :
timestamp;fullpath;event;size
1521540649.02;/home/workdir/ScienceXMLIn/config.cfg;IN_OPEN;2324
1521540649.02;/home/workdir/ScienceXMLIn/config.cfg;IN_ACCESS;2324
1521540649.02;/home/workdir/ScienceXMLIn/config.cfg;IN_CLOSE_NOWRITE;2324
1521540649.02;/home/workdir/quad_list_14.json;IN_OPEN;2160
1521540649.03;/home/workdir/quad_list_14.json;IN_ACCESS;2160
1521540649.03;/home/workdir/quad_list_14.json;IN_CLOSE_NOWRITE;2160
1521540649.03;/home/workdir/ScienceXMLIn/masterbias_list.asc;IN_OPEN;70
1521540649.03;/home/workdir/ScienceXMLIn/masterbias_list.asc.1;IN_OPEN;80
1521540649.03;/home/workdir/ScienceXMLIn/masterbias_list.asc.2;IN_OPEN;70
1521540649.03;/home/workdir/otherfolder/quad_list_14.json;IN_OPEN;2160
1521540649.03;/home/workdir/otherfolder/quad_list_14.json;IN_CLOSE_NOWRITE;2160
My goal is to identify the file with the same name that appears in different folders.
In this example, the file quad_list_14.json
appears in both /home/workdir/otherfolder
and /home/workdir/
.
My desired output is simple just the list of file that appears in more than one folder in this case it would look to this:
quad_list_14.json
To do this I've written this small piece of code:
#this line cut the file to only get unique filepath
PATHLIST=$(cut -d';' -f 2 ${1} | sort -u)
FILENAMELIST=""
#this loop build a list of basename from the list of filepath
for path in ${PATHLIST}
do
FILENAMELIST="$(basename "${path}")
${FILENAMELIST}"
done
#once the list is build, I simply find the duplicates with uniq -d as the list is already sorted
echo "${FILENAMELIST}" | sort | uniq -d
Don't use this code at home it's terrible, I should have replace this script with a onliner like this :
#this get all file path, sort them and only keep unique entry then
#remove the path to get the basename of the file
#and finally sort and output duplicates entry.
cut -d';' -f 2 ${1} | sort -u | grep -o '[^/]*$' | sort | uniq -d
My problem though remains and lot of file and the shortest takes 0.5 second but the longest takes 45 seconds on a SSD(and my production disk won't be as fast) to find duplicate filename in different folder.
I need to improve this code to make it more efficient. My only limitation is that I can't fully load the files in RAM.