Timeline for awk comparison using arrays
Current License: CC BY-SA 3.0
19 events
| when toggle format | what | by | license | comment | |
|---|---|---|---|---|---|
| S Jul 29, 2015 at 13:43 | history | mod moved comments to chat | |||
| S Jul 29, 2015 at 13:43 | comment | added | terdon♦ | @mikeserv Costas, let's take this to chat. You'll have to tweak both solutions to get the right results. | |
| Jul 29, 2015 at 13:41 | comment | added | mikeserv |
@terdon - your awk doesn't seem to get the right results. Instead of writing to /dev/null write to |wc -l.
|
|
| Jul 29, 2015 at 13:37 | comment | added | terdon♦ |
@mikeserv did it? Cool! Note, however, that costas's approach here uses -s39 to skip comparing the first 39 characters and -w4 to only compare characters 40-44. I'm guessing that would make it faster than my awk approach on your random data. As I told Costas above, it did for mine. I think it's safe to conclude that, as Costas has been saying all along, sort/uniq will be faster for large, non-homogeneous datasets while awk will be faster on small or large but relatively homogeneous ones. On my dd file, I got 21.6 sec for awk and 11.6 for Costas's solution.
|
|
| Jul 29, 2015 at 13:32 | comment | added | mikeserv | @terdon - your thing did it in 6 secs. | |
| Jul 29, 2015 at 13:27 | comment | added | mikeserv |
@terdon - that's a good point, too. costas - don't do sort|uniq but instead try sort...|sort -nmut, -k18,18 - I think that way the second sort should only print uniq results for the 18th field and won't need to consider anything else - the -merge sort flag makes the second almost a no-op. I just created a 400 mb file like dd bs=1M </dev/urandom | od -tu -w44 | sed 's/[0-9]\{1,5\}/&,/g;s/ *//g' for large completely random integer lines - and the whole sort|sort job came out in 8 secs - reducing 3mil lines to 100k. I'm about to try terdon's thing.
|
|
| Jul 29, 2015 at 13:26 | comment | added | terdon♦ |
@Costas first I want to clarify that I'm just trying to learn here. I'm not saying you're wrong, as far as I can tell, you know far more about this sort of thing than I. And, as it turns out, I'm quite right and you do know more than I do! I tried again with a file where every $18 was different and awk took 5.7 seconds while sort/uniq took 2.8 to produce the same file!
|
|
| Jul 29, 2015 at 13:09 | comment | added | Costas |
@terdon I have write «can be quicker» because it strongly depends of data. If you'd like I'l add «in some cases». As noted by @mikeserv above awk read data twice: fist when put data into 2 arrays, then read array in end loop. So if resulting array is small — awk will be quicker.
|
|
| Jul 29, 2015 at 13:01 | comment | added | terdon♦ |
@mikeserv no, not at all! I only compare a single field to the array, not the entire line. And that's an indexed array, a hash, so I don't need to iterate over it. It only ever compares the current field to the single corresponding value stored in the array. The size of the array is irrelevant. In any case, as I said, it's almost 10 seconds faster. I also tried with a 265M file of 22 random, comma separated integers per line (following the OP's example format) and, again,. sort/uniq was at 26 sec while awk was at 15. Still using a tmpfs and printing to /dev/null for both.
|
|
| Jul 29, 2015 at 12:55 | comment | added | terdon♦ | @Costas ah, yes indeed, very good point. Your approach should never run out of memory. I just don't think it will ever be faster, either. Even if we allow the limitations imposed by ignoring the first 39 characters. | |
| Jul 29, 2015 at 12:51 | comment | added | mikeserv |
@terdon - costas makes an excellent point - your awk has to read all of the data twice, too, you know - possibly many more times actually, but it squeezes it as it does. You have to compare each line against your array - as the size of that array gets increases its comparison time for each line will increase exponentially - but uniq only ever has to compare the current line against the last.
|
|
| Jul 29, 2015 at 12:46 | comment | added | Costas |
@terdon You make incorrect example by «content of the OP's example repeated a few thousand times» so you have 3 members in array only. For real huge amount of different data awk even can reach full of memory error.
|
|
| Jul 29, 2015 at 12:39 | comment | added | terdon♦ |
@mikeserv OK, I mounted a tmpfs on /tmp and tested again. The awk took 19 seconds this time, versus 28 for sort/uniq. Remember that even in the best case, this can't be 100% parallel since uniq needs to wait for data from sort. Actually, the entire awk command (19 sec) was faster than the sort alone (24 seconds). And this despite sort maxing out my 4 cores. You keep underestimating awk :) It is actually very, very efficiently tuned for this kind of thing.
|
|
| Jul 29, 2015 at 12:08 | comment | added | mikeserv |
@terdon - It doesn't read it twice over - it reads it twice concurrently. On a multicore system a single awk will wind up processor bound to maxing a single core - but on the same system sort | uniq can work the same data at the same time each on their own core. It's why you'll see stuff like CPU usage 187% or whatever after a pipeline finishes - it's what makes pipelines awesome. Also, sort and uniq are both precision tools - they each do a single thing and well, while awk's job is to generally interpret in-scripts - and so isn't likely as efficiently tuned as a result.
|
|
| Jul 29, 2015 at 12:04 | comment | added | terdon♦ |
@mikeserv doesn't seem to be, no. At least mount | grep tmpfs does not show /tmp. You think that's what makes it slower? Even theoretically though, why would reading all data twice ever be faster than reading it once?
|
|
| Jul 29, 2015 at 12:01 | comment | added | mikeserv |
@terdon - is your /tmp a tmpfs?
|
|
| Jul 29, 2015 at 11:40 | comment | added | terdon♦ |
I just tested using a 287M file (the content of the OP's example repeated a few thousand times) and my awk approach took 23 seconds while the sort/uniq took 30. Here, both sort and uniq need to go over the entire file. Why would you expect it to be faster? Also, limiting the number of characters compared is clever but assumes that the number of characters per line is fixed, right? It will fail if one of the fields isn't empty or has a value with >1 digit.
|
|
| Jul 29, 2015 at 11:20 | history | edited | Costas | CC BY-SA 3.0 |
added 4 characters in body
|
| Jul 29, 2015 at 11:13 | history | answered | Costas | CC BY-SA 3.0 |