Skip to main content
added 1066 characters in body
Source Link

update2 and now with a simpler locale

$ export LC_ALL=c
$ time sort -u test > /dev/null                                                                                                                                             1.2m ? Tue Apr 21 17:09:22 2020
119.18user 3.64system 0:38.24elapsed 321%CPU (0avgtext+0avgdata 2013472maxresident)k

$ time awk '!x[$0]++' test > /dev/null                                                                                                                                1161ms ? Tue Apr 21 17:07:31 2020
67.23user 2.50system 1:10.16elapsed 99%CPU (0avgtext+0avgdata 10480maxresident)k
7187520inputs+0outputs (0major+1912minor)pagefaults 0swaps

$ time uniq test > /dev/null                                                                                                                                               
22.05user 2.02system 0:24.24elapsed 99%CPU (0avgtext+0avgdata 1488maxresident)k
2959648inputs+0outputs (1major+72minor)pagefaults 0swaps

This time uniq does win the race... as Stéphane Chazelas hints to in the comments, setting your locale to C makes sort and uniq a whole bunch faster!

update2 and now with a simpler locale

$ export LC_ALL=c
$ time sort -u test > /dev/null                                                                                                                                             1.2m ? Tue Apr 21 17:09:22 2020
119.18user 3.64system 0:38.24elapsed 321%CPU (0avgtext+0avgdata 2013472maxresident)k

$ time awk '!x[$0]++' test > /dev/null                                                                                                                                1161ms ? Tue Apr 21 17:07:31 2020
67.23user 2.50system 1:10.16elapsed 99%CPU (0avgtext+0avgdata 10480maxresident)k
7187520inputs+0outputs (0major+1912minor)pagefaults 0swaps

$ time uniq test > /dev/null                                                                                                                                               
22.05user 2.02system 0:24.24elapsed 99%CPU (0avgtext+0avgdata 1488maxresident)k
2959648inputs+0outputs (1major+72minor)pagefaults 0swaps

This time uniq does win the race... as Stéphane Chazelas hints to in the comments, setting your locale to C makes sort and uniq a whole bunch faster!

added 1524 characters in body
Source Link

UPDATE, since this intrigued me quite a bit I did some more tests, let's ge the cut part out of the way, and make sure the file is nicely sorted

cat all_files | cut -d '/' -f 1,2,3,4 | sort -T . > test

This takes 8.4 minutes. test is now 7.9GB big

let's run these tools on the file instead of in a pipe, this will allow these tools to do some more optimization, like sort will multi thread. and also from a faster ssd.

You might not notice that sort is also taking a lot of memory, since it does clever tricks with temp files in /tmp which might be tmpfs and will be in your ram ( Try sorting a file bigger then /tmp, you will run into space issues, that's why I need the -T . flag in the above command)

$ time sort -u test > /dev/null
339.24user 3.54system 1:28.87elapsed 385%CPU (0avgtext+0avgdata 2365856maxresident)k
9555544inputs+0outputs (0major+591298minor)pagefaults 0swaps

$ time awk '!x[$0]++' test > /dev/null                                                                                                                             
51.15user 1.55system 0:52.94elapsed 99%CPU (0avgtext+0avgdata 10976maxresident)k
0inputs+0outputs (0major+1923minor)pagefaults 0swaps

$ time uniq test > /dev/null                                                                                                                                  
421.89user 2.76system 7:06.63elapsed 99%CPU (0avgtext+0avgdata 1980maxresident)k
52712inputs+0outputs (0major+79minor)pagefaults 0swaps

So it seems your awk solution is the fastest of these 3, and actually uses the least memory

UPDATE, since this intrigued me quite a bit I did some more tests, let's ge the cut part out of the way, and make sure the file is nicely sorted

cat all_files | cut -d '/' -f 1,2,3,4 | sort -T . > test

This takes 8.4 minutes. test is now 7.9GB big

let's run these tools on the file instead of in a pipe, this will allow these tools to do some more optimization, like sort will multi thread. and also from a faster ssd.

You might not notice that sort is also taking a lot of memory, since it does clever tricks with temp files in /tmp which might be tmpfs and will be in your ram ( Try sorting a file bigger then /tmp, you will run into space issues, that's why I need the -T . flag in the above command)

$ time sort -u test > /dev/null
339.24user 3.54system 1:28.87elapsed 385%CPU (0avgtext+0avgdata 2365856maxresident)k
9555544inputs+0outputs (0major+591298minor)pagefaults 0swaps

$ time awk '!x[$0]++' test > /dev/null                                                                                                                             
51.15user 1.55system 0:52.94elapsed 99%CPU (0avgtext+0avgdata 10976maxresident)k
0inputs+0outputs (0major+1923minor)pagefaults 0swaps

$ time uniq test > /dev/null                                                                                                                                  
421.89user 2.76system 7:06.63elapsed 99%CPU (0avgtext+0avgdata 1980maxresident)k
52712inputs+0outputs (0major+79minor)pagefaults 0swaps

So it seems your awk solution is the fastest of these 3, and actually uses the least memory

added 158 characters in body
Source Link

I just wanted to point out that gnu uniq seems terribly slow, even on a sorted list.

I just tried getting a list of directory prefixes from a list of sorted filenames:

$ pv all_files | cut -d '/' -f 1,2,3,4 | uniq > all_prefixes

36.7GiB 0:07:41 [81.4MiB/s]

$ pv all_files | cut -d '/' -f 1,2,3,4 | sort -u > all_prefixes2

36.7GiB 0:03:14 [ 193MiB/s] 

$ pv all_files  | cut -d '/' -f 1,2,3,4 | awk '!x[$0]++' > all_prefixes3                                        
36.7GiB 0:02:18 [ 270MiB/s] 

sort -u seems twice as fast as uniq, and this is with sort reading from stdin and writing to stdout, so I don't see it do any parallelization yet. I have no idea why uniq should be so much slower then sort, since it doesn't have to sort the list...

The outpuf of this command is very small (there are a lot of duplicates), only 264kb and sort terminates instantly after pv is done.

The same speeds remain if you turn around the order of the commands, my flow is limited by cpu time here, not disk access and caches (I only have 8GB of RAM and my swap is not used)

I'm running this on a fedora 31 machine with gnu coreutils sort and uniq;uniq and gnu awk; locale is set to en_US.UTF-8

I just wanted to point out that gnu uniq seems terribly slow, even on a sorted list.

I just tried getting a list of directory prefixes from a list of sorted filenames:

$ pv all_files | cut -d '/' -f 1,2,3,4 | uniq > all_prefixes

36.7GiB 0:07:41 [81.4MiB/s]

$ pv all_files | cut -d '/' -f 1,2,3,4 | sort -u > all_prefixes2

36.7GiB 0:03:14 [ 193MiB/s]

sort -u seems twice as fast as uniq, and this is with sort reading from stdin and writing to stdout, so I don't see it do any parallelization yet. I have no idea why uniq should be so much slower then sort, since it doesn't have to sort the list...

The outpuf of this command is very small (there are a lot of duplicates), only 264kb and sort terminates instantly after pv is done.

The same speeds remain if you turn around the order of the commands, my flow is limited by cpu time here, not disk access and caches (I only have 8GB of RAM and my swap is not used)

I'm running this on a fedora 31 machine with gnu coreutils sort and uniq; locale is set to en_US.UTF-8

I just wanted to point out that gnu uniq seems terribly slow, even on a sorted list.

I just tried getting a list of directory prefixes from a list of sorted filenames:

$ pv all_files | cut -d '/' -f 1,2,3,4 | uniq > all_prefixes

36.7GiB 0:07:41 [81.4MiB/s]

$ pv all_files | cut -d '/' -f 1,2,3,4 | sort -u > all_prefixes2

36.7GiB 0:03:14 [ 193MiB/s] 

$ pv all_files  | cut -d '/' -f 1,2,3,4 | awk '!x[$0]++' > all_prefixes3                                        
36.7GiB 0:02:18 [ 270MiB/s] 

sort -u seems twice as fast as uniq, and this is with sort reading from stdin and writing to stdout, so I don't see it do any parallelization yet. I have no idea why uniq should be so much slower then sort, since it doesn't have to sort the list...

The outpuf of this command is very small (there are a lot of duplicates), only 264kb and sort terminates instantly after pv is done.

The same speeds remain if you turn around the order of the commands, my flow is limited by cpu time here, not disk access and caches (I only have 8GB of RAM and my swap is not used)

I'm running this on a fedora 31 machine with gnu coreutils sort and uniq and gnu awk; locale is set to en_US.UTF-8

added 106 characters in body
Source Link
Loading
Source Link
Loading