Script optimisation to find duplicates filename in huge CSV

Question

I have several CSV file from 1MB to 6GB generated by inotify script with a list of events formatted as is:
timestamp;fullpath;event;size.

Those files are formatted like this :

timestamp;fullpath;event;size
1521540649.02;/home/workdir/ScienceXMLIn/config.cfg;IN_OPEN;2324
1521540649.02;/home/workdir/ScienceXMLIn/config.cfg;IN_ACCESS;2324
1521540649.02;/home/workdir/ScienceXMLIn/config.cfg;IN_CLOSE_NOWRITE;2324
1521540649.02;/home/workdir/quad_list_14.json;IN_OPEN;2160
1521540649.03;/home/workdir/quad_list_14.json;IN_ACCESS;2160
1521540649.03;/home/workdir/quad_list_14.json;IN_CLOSE_NOWRITE;2160
1521540649.03;/home/workdir/ScienceXMLIn/masterbias_list.asc;IN_OPEN;70
1521540649.03;/home/workdir/ScienceXMLIn/masterbias_list.asc.1;IN_OPEN;80
1521540649.03;/home/workdir/ScienceXMLIn/masterbias_list.asc.2;IN_OPEN;70
1521540649.03;/home/workdir/otherfolder/quad_list_14.json;IN_OPEN;2160
1521540649.03;/home/workdir/otherfolder/quad_list_14.json;IN_CLOSE_NOWRITE;2160

My goal is to identify the file with the same name that appears in different folders.
In this example, the file quad_list_14.json appears in both /home/workdir/otherfolder and /home/workdir/.

My desired output is simple just the list of file that appears in more than one folder in this case it would look to this:

quad_list_14.json

To do this I've written this small piece of code:

#this line cut the file to only get unique filepath PATHLIST=$(cut -d';' -f 2 ${1} | sort -u) FILENAMELIST="" #this loop build a list of basename from the list of filepath for path in ${PATHLIST} do FILENAMELIST="$(basename "${path}") ${FILENAMELIST}" done #once the list is build, I simply find the duplicates with uniq -d as the list is already sorted echo "${FILENAMELIST}" | sort | uniq -d

Don't use this code at home it's terrible, I should have replace this script with a onliner like this :

#this get all file path, sort them and only keep unique entry then
#remove the path to get the basename of the file 
#and finally sort and output duplicates entry.
cut -d';' -f 2 ${1} | sort -u |  grep -o '[^/]*$' | sort | uniq -d

My problem though remains and lot of file and the shortest takes 0.5 second but the longest takes 45 seconds on a SSD(and my production disk won't be as fast) to find duplicate filename in different folder.

I need to improve this code to make it more efficient. My only limitation is that I can't fully load the files in RAM.

Stephen Kitt · Accepted Answer · 2018-04-12 08:54:36Z

3

The following AWK script should do the trick, without using too much memory:

#!/usr/bin/awk -f

BEGIN {
    FS = ";"
}

{
    idx = match($2, "/[^/]+$")
    if (idx > 0) {
        path = substr($2, 1, idx)
        name = substr($2, idx + 1)
        if (paths[name] && paths[name] != path && !output[name]) {
            print name
            output[name] = 1
        }
        paths[name] = path
    }
}

It extracts the path and name of each file, and stores the last path it’s seen for every name. If it had previously seen another path, it outputs the name, unless it’s already output it.

edited Apr 12, 2018 at 8:54

answered Apr 11, 2018 at 15:39

Stephen Kitt

481k59 gold badges1.2k silver badges1.4k bronze badges

Thank you for the help, I really need to learn awk looks awesomely effective

Kiwy
– Kiwy

2018-04-12 08:44:09 +00:00
Commented Apr 12, 2018 at 8:44
You made the most effective script.

Kiwy
– Kiwy

2018-04-12 12:18:34 +00:00
Commented Apr 12, 2018 at 12:18

Add a comment |

Kusalananda · Accepted Answer · 2018-04-12 09:00:05Z

The main issue with your code is that you're collecting all the pathnames in a variable and then looping over it to call basename. This makes it slow.

The loop also runs over the unquoted variable expansion ${PATHLIST}, which would be unwise if the pathnames contain spaces or shell globbing characters. In bash (or other shells that supports it), one would have used an array instead.

Suggestion:

$ sed -e '1d' -e 's/^[^;]*;//' -e 's/;.*//' file.csv | sort -u | sed 's#.*/##' | sort | uniq -d
quad_list_14.json

The first sed picks out the pathnames (and discards the header line). This might also be written as awk -F';' 'NR > 1 { print $2 }' file.csv, or as tail -n +2 file.csv | cut -d ';' -f 2.

The sort -u gives us unique pathnames, and the following sed gives us the basenames. The final sort with uniq -d at the end tells us which basenames are duplicated.

The last sed 's#.*/##' which gives you the basenames is reminiscent of the parameter expansion ${pathname##*/} which is equivalent to $( basename "$pathname" ). It just deletes everything up to and including the last / in the string.

The main difference from your code is that instead of the loop that calls basename multiple times, a single sed is used to produce the basenames from a list of pathnames.

Alternative for only looking at IN_OPEN entries:

sed -e '/;IN_OPEN;/!d' -e 's/^[^;]*;//' -e 's/;.*//' file.csv | sort -u | sed 's#.*/##' | sort | uniq -d

Grep seems to be a 20% faster: grep -oP '^[^;]*;\K[^;]*' file.csv | sort -u | grep -oP '.*/\K.*' | sort | uniq -d — user232326
– user232326, Commented Apr 12, 2018 at 7:08
@isaac Possibly. It would depend on what implementations of grep and sed one used. BSD sed is generally faster than GNU sed, and GNU grep may be faster than GNU sed too... I'm on a BSD system, so using GNU grep isn't something I'm "automatically" doing. — Kusalananda
– Kusalananda ♦, Commented Apr 12, 2018 at 7:10
@Kusalananda see my answer after some testing ;-) thank you for your help — Kiwy
– Kiwy, Commented Apr 12, 2018 at 8:43

Kiwy · Accepted Answer · 2018-04-12 12:17:50Z

Thank you both of you for your answers and thanks Isaac for the comments.
I've taken all your code and put them in a script stephen.awk kusa.sh and isaac.sh after that I've run a small benchmark like this:

for i in $(ls *.csv)
do
    script.sh $1
done

With the command time I compare them and here's the results:

stephen.awk

real    2m35,049s
user    2m26,278s
sys     0m8,495s

stephen.awk: updated with /IN_OPEN/ before the second block

real    0m35,749s
user    0m15,711s
sys     0m4,915s

kusa.sh

real    8m55,754s
user    8m48,924s
sys     0m21,307s

Update with filter on IN_OPEN:

real    0m37,463s
user    0m9,340s
sys     0m4,778s

Side note:
Though correct I had a lot of blank line outputed with sed, your script were the only one like that.

isaac.sh

grep -oP '^[^;]*;\K[^;]*' file.csv | sort -u | grep -oP '.*/\K.*' | sort | uniq -d
real    7m2,715s
user    6m56,009s
sys     0m18,385s

With filter on IN_OPEN:

real    0m32,785s
user    0m8,775s
sys     0m4,202s

my script

real    6m27,645s
user    6m13,742s
sys     0m20,570s

@Stephen you clearly win this one, with a very impressive time decrease by a 2.5 factor.

Though after thinking a bit more I came with another idea, what if I only look at OPEN file event it would reduce the complexity and you're not suppose to access a file or write it without opening it first, so I did that:

#see I add grep "IN_OPEN" to reduce complexity
PATHLIST=$(grep "IN_OPEN" "${1}" | cut -d';' -f 2 | sort -u)
FILENAMELIST=""
for path in ${PATHLIST}
do
    FILENAMELIST="$(basename "${path}")
${FILENAMELIST}"
done
echo "${FILENAMELIST}" | sort | uniq -d

With this only modification which gave me the same result I end up with this time value:

real    0m56,412s
user    0m27,439s
sys     0m9,928s

And I'm pretty sure there's plenty of other stuff I could do

Plug that grep into Stephen's and my solution too and you'll probably see an equivalent speed increase. Just don't do that loop with basename! — Kusalananda
– Kusalananda ♦, Commented Apr 12, 2018 at 8:52
Thanks for the benchmarking! I’m curious how the AWK version improves if you change the second block to start with /IN_OPEN/ {... — Stephen Kitt
– Stephen Kitt, Commented Apr 12, 2018 at 8:52
Likewise for mine. I have added this restriction at the end of my answer. — Kusalananda
– Kusalananda ♦, Commented Apr 12, 2018 at 9:01
@Kusalananda what if I use an array instead ? @StephenKitt See updated result with the filtering. awk is definitely quicker. — Kiwy
– Kiwy, Commented Apr 12, 2018 at 9:01
I simply fail to see how looping over basename can even get close to being as fast as a single sed invocation. Our approaches are identical in all other aspects (depending on what tool you use for the first step of the pipeline, but I gave a few alternatives). — Kusalananda
– Kusalananda ♦, Commented Apr 12, 2018 at 9:03

Stack Exchange Network

Script optimisation to find duplicates filename in huge CSV

3 Answers 3

You must log in to answer this question.

Hot Network Questions

Script optimisation to find duplicates filename in huge CSV

3 Answers 3

You must log in to answer this question.

Related

Hot Network Questions