Skip to main content
deleted 64 characters in body
Source Link
Rui F Ribeiro
  • 58k
  • 28
  • 156
  • 237

I have several CSV file from 1MB to 6GB generated by inotify script with a list of events formatted as is:
timestamp;fullpath;event;size.

Those files are formatted like this :

timestamp;fullpath;event;size
1521540649.02;/home/workdir/ScienceXMLIn/config.cfg;IN_OPEN;2324
1521540649.02;/home/workdir/ScienceXMLIn/config.cfg;IN_ACCESS;2324
1521540649.02;/home/workdir/ScienceXMLIn/config.cfg;IN_CLOSE_NOWRITE;2324
1521540649.02;/home/workdir/quad_list_14.json;IN_OPEN;2160
1521540649.03;/home/workdir/quad_list_14.json;IN_ACCESS;2160
1521540649.03;/home/workdir/quad_list_14.json;IN_CLOSE_NOWRITE;2160
1521540649.03;/home/workdir/ScienceXMLIn/masterbias_list.asc;IN_OPEN;70
1521540649.03;/home/workdir/ScienceXMLIn/masterbias_list.asc.1;IN_OPEN;80
1521540649.03;/home/workdir/ScienceXMLIn/masterbias_list.asc.2;IN_OPEN;70
1521540649.03;/home/workdir/otherfolder/quad_list_14.json;IN_OPEN;2160
1521540649.03;/home/workdir/otherfolder/quad_list_14.json;IN_CLOSE_NOWRITE;2160

My goal is to identify the file with the same name that appears in different folders.
In this example, the file quad_list_14.json appears in both /home/workdir/otherfolder and /home/workdir/.

My desired output is simple just the list of file that appears in more than one folder in this case it would look to this:

quad_list_14.json

To do this I've written this small piece of code:

#this line cut the file to only get unique filepath
PATHLIST=$(cut -d';' -f 2 ${1} | sort -u)
FILENAMELIST=""

#this loop build a list of basename from the list of filepath
for path in ${PATHLIST}
do
    FILENAMELIST="$(basename "${path}")
${FILENAMELIST}"
done

#once the list is build, I simply find the duplicates with uniq -d as the list is already sorted
echo "${FILENAMELIST}" | sort | uniq -d
Don't use this code at home it's terrible, I should have replace this script with a onliner like this :
#this get all file path, sort them and only keep unique entry then
#remove the path to get the basename of the file 
#and finally sort and output duplicates entry.
cut -d';' -f 2 ${1} | sort -u |  grep -o '[^/]*$' | sort | uniq -d

My problem though remains and lot of file and the shortest takes 0.5 second but the longest takes 45 seconds on a SSD(and my production disk won't be as fast) to find duplicate filename in different folder.

I need to improve this code to make it more efficient and I'm no expert in scripting, I'm open to a lot of solutions.
  My only limitation is that I can't fully load the files in RAM.

I have several CSV file from 1MB to 6GB generated by inotify script with a list of events formatted as is:
timestamp;fullpath;event;size.

Those files are formatted like this :

timestamp;fullpath;event;size
1521540649.02;/home/workdir/ScienceXMLIn/config.cfg;IN_OPEN;2324
1521540649.02;/home/workdir/ScienceXMLIn/config.cfg;IN_ACCESS;2324
1521540649.02;/home/workdir/ScienceXMLIn/config.cfg;IN_CLOSE_NOWRITE;2324
1521540649.02;/home/workdir/quad_list_14.json;IN_OPEN;2160
1521540649.03;/home/workdir/quad_list_14.json;IN_ACCESS;2160
1521540649.03;/home/workdir/quad_list_14.json;IN_CLOSE_NOWRITE;2160
1521540649.03;/home/workdir/ScienceXMLIn/masterbias_list.asc;IN_OPEN;70
1521540649.03;/home/workdir/ScienceXMLIn/masterbias_list.asc.1;IN_OPEN;80
1521540649.03;/home/workdir/ScienceXMLIn/masterbias_list.asc.2;IN_OPEN;70
1521540649.03;/home/workdir/otherfolder/quad_list_14.json;IN_OPEN;2160
1521540649.03;/home/workdir/otherfolder/quad_list_14.json;IN_CLOSE_NOWRITE;2160

My goal is to identify the file with the same name that appears in different folders.
In this example, the file quad_list_14.json appears in both /home/workdir/otherfolder and /home/workdir/.

My desired output is simple just the list of file that appears in more than one folder in this case it would look to this:

quad_list_14.json

To do this I've written this small piece of code:

#this line cut the file to only get unique filepath
PATHLIST=$(cut -d';' -f 2 ${1} | sort -u)
FILENAMELIST=""

#this loop build a list of basename from the list of filepath
for path in ${PATHLIST}
do
    FILENAMELIST="$(basename "${path}")
${FILENAMELIST}"
done

#once the list is build, I simply find the duplicates with uniq -d as the list is already sorted
echo "${FILENAMELIST}" | sort | uniq -d
Don't use this code at home it's terrible, I should have replace this script with a onliner like this :
#this get all file path, sort them and only keep unique entry then
#remove the path to get the basename of the file 
#and finally sort and output duplicates entry.
cut -d';' -f 2 ${1} | sort -u |  grep -o '[^/]*$' | sort | uniq -d

My problem though remains and lot of file and the shortest takes 0.5 second but the longest takes 45 seconds on a SSD(and my production disk won't be as fast) to find duplicate filename in different folder.

I need to improve this code to make it more efficient and I'm no expert in scripting, I'm open to a lot of solutions.
  My only limitation is that I can't fully load the files in RAM.

I have several CSV file from 1MB to 6GB generated by inotify script with a list of events formatted as is:
timestamp;fullpath;event;size.

Those files are formatted like this :

timestamp;fullpath;event;size
1521540649.02;/home/workdir/ScienceXMLIn/config.cfg;IN_OPEN;2324
1521540649.02;/home/workdir/ScienceXMLIn/config.cfg;IN_ACCESS;2324
1521540649.02;/home/workdir/ScienceXMLIn/config.cfg;IN_CLOSE_NOWRITE;2324
1521540649.02;/home/workdir/quad_list_14.json;IN_OPEN;2160
1521540649.03;/home/workdir/quad_list_14.json;IN_ACCESS;2160
1521540649.03;/home/workdir/quad_list_14.json;IN_CLOSE_NOWRITE;2160
1521540649.03;/home/workdir/ScienceXMLIn/masterbias_list.asc;IN_OPEN;70
1521540649.03;/home/workdir/ScienceXMLIn/masterbias_list.asc.1;IN_OPEN;80
1521540649.03;/home/workdir/ScienceXMLIn/masterbias_list.asc.2;IN_OPEN;70
1521540649.03;/home/workdir/otherfolder/quad_list_14.json;IN_OPEN;2160
1521540649.03;/home/workdir/otherfolder/quad_list_14.json;IN_CLOSE_NOWRITE;2160

My goal is to identify the file with the same name that appears in different folders.
In this example, the file quad_list_14.json appears in both /home/workdir/otherfolder and /home/workdir/.

My desired output is simple just the list of file that appears in more than one folder in this case it would look to this:

quad_list_14.json

To do this I've written this small piece of code:

#this line cut the file to only get unique filepath
PATHLIST=$(cut -d';' -f 2 ${1} | sort -u)
FILENAMELIST=""

#this loop build a list of basename from the list of filepath
for path in ${PATHLIST}
do
    FILENAMELIST="$(basename "${path}")
${FILENAMELIST}"
done

#once the list is build, I simply find the duplicates with uniq -d as the list is already sorted
echo "${FILENAMELIST}" | sort | uniq -d
Don't use this code at home it's terrible, I should have replace this script with a onliner like this :
#this get all file path, sort them and only keep unique entry then
#remove the path to get the basename of the file 
#and finally sort and output duplicates entry.
cut -d';' -f 2 ${1} | sort -u |  grep -o '[^/]*$' | sort | uniq -d

My problem though remains and lot of file and the shortest takes 0.5 second but the longest takes 45 seconds on a SSD(and my production disk won't be as fast) to find duplicate filename in different folder.

I need to improve this code to make it more efficient. My only limitation is that I can't fully load the files in RAM.

added 409 characters in body
Source Link
Kiwy
  • 9.9k
  • 13
  • 51
  • 81
#this line cut the file to only get unique filepath
PATHLIST=$(cut -d';' -f 2 ${1} | sort -u)
FILENAMELIST=""

#this loop build a list of basename from the list of filepath
for path in ${PATHLIST}
do
    FILENAMELIST="$(basename "${path}")
${FILENAMELIST}"
done

#once the list is build, I simply find the duplicates with uniq -d as the list is already sorted
echo "${FILENAMELIST}" | sort | uniq -d
Don't use this code at home it's terrible, I should have replace this script with a onliner like this:
#this line cutget theall file to only get uniquepath, filepath
PATHLIST=$(cutsort -d';'them -fand 2only ${1}keep |unique sortentry -u)
FILENAMELIST=""
then
#this#remove loopthe buildpath ato listget ofthe basename fromof the list offile filepath
for path in#and ${PATHLIST}
do
finally sort and output FILENAMELIST="$(basenameduplicates "${path}")
${FILENAMELIST}"
done
entry.
#once the list is build, Icut simply-d';' find-f the2 duplicates${1} with| uniqsort -d as theu list| is alreadygrep sorted
echo-o "${FILENAMELIST}"'[^/]*$' | sort | uniq -d

My problem is that I have athough remains and lot of those file and the shortest takes 0.5 second but the longest takes 45 seconds on a SSD(and my production disk won't be as fast) to find duplicate filename in different folder.

#this line cut the file to only get unique filepath
PATHLIST=$(cut -d';' -f 2 ${1} | sort -u)
FILENAMELIST=""

#this loop build a list of basename from the list of filepath
for path in ${PATHLIST}
do
    FILENAMELIST="$(basename "${path}")
${FILENAMELIST}"
done

#once the list is build, I simply find the duplicates with uniq -d as the list is already sorted
echo "${FILENAMELIST}" | sort | uniq -d

My problem is that I have a lot of those file and the shortest takes 0.5 second but the longest takes 45 seconds on a SSD(and my production disk won't be as fast).

#this line cut the file to only get unique filepath
PATHLIST=$(cut -d';' -f 2 ${1} | sort -u)
FILENAMELIST=""

#this loop build a list of basename from the list of filepath
for path in ${PATHLIST}
do
    FILENAMELIST="$(basename "${path}")
${FILENAMELIST}"
done

#once the list is build, I simply find the duplicates with uniq -d as the list is already sorted
echo "${FILENAMELIST}" | sort | uniq -d
Don't use this code at home it's terrible, I should have replace this script with a onliner like this:
#this get all file path, sort them and only keep unique entry then
#remove the path to get the basename of the file 
#and finally sort and output duplicates entry.
cut -d';' -f 2 ${1} | sort -u |  grep -o '[^/]*$' | sort | uniq -d

My problem though remains and lot of file and the shortest takes 0.5 second but the longest takes 45 seconds on a SSD(and my production disk won't be as fast) to find duplicate filename in different folder.

Tweeted twitter.com/StackUnix/status/984227780404633600
Source Link
Kiwy
  • 9.9k
  • 13
  • 51
  • 81

Script optimisation to find duplicates filename in huge CSV

I have several CSV file from 1MB to 6GB generated by inotify script with a list of events formatted as is:
timestamp;fullpath;event;size.

Those files are formatted like this :

timestamp;fullpath;event;size
1521540649.02;/home/workdir/ScienceXMLIn/config.cfg;IN_OPEN;2324
1521540649.02;/home/workdir/ScienceXMLIn/config.cfg;IN_ACCESS;2324
1521540649.02;/home/workdir/ScienceXMLIn/config.cfg;IN_CLOSE_NOWRITE;2324
1521540649.02;/home/workdir/quad_list_14.json;IN_OPEN;2160
1521540649.03;/home/workdir/quad_list_14.json;IN_ACCESS;2160
1521540649.03;/home/workdir/quad_list_14.json;IN_CLOSE_NOWRITE;2160
1521540649.03;/home/workdir/ScienceXMLIn/masterbias_list.asc;IN_OPEN;70
1521540649.03;/home/workdir/ScienceXMLIn/masterbias_list.asc.1;IN_OPEN;80
1521540649.03;/home/workdir/ScienceXMLIn/masterbias_list.asc.2;IN_OPEN;70
1521540649.03;/home/workdir/otherfolder/quad_list_14.json;IN_OPEN;2160
1521540649.03;/home/workdir/otherfolder/quad_list_14.json;IN_CLOSE_NOWRITE;2160

My goal is to identify the file with the same name that appears in different folders.
In this example, the file quad_list_14.json appears in both /home/workdir/otherfolder and /home/workdir/.

My desired output is simple just the list of file that appears in more than one folder in this case it would look to this:

quad_list_14.json

To do this I've written this small piece of code:

#this line cut the file to only get unique filepath
PATHLIST=$(cut -d';' -f 2 ${1} | sort -u)
FILENAMELIST=""

#this loop build a list of basename from the list of filepath
for path in ${PATHLIST}
do
    FILENAMELIST="$(basename "${path}")
${FILENAMELIST}"
done

#once the list is build, I simply find the duplicates with uniq -d as the list is already sorted
echo "${FILENAMELIST}" | sort | uniq -d

My problem is that I have a lot of those file and the shortest takes 0.5 second but the longest takes 45 seconds on a SSD(and my production disk won't be as fast).

I need to improve this code to make it more efficient and I'm no expert in scripting, I'm open to a lot of solutions.
My only limitation is that I can't fully load the files in RAM.