Timeline for awk comparison using arrays

Current License: CC BY-SA 3.0

19 events

when toggle format	what		by	license	comment
S Jul 29, 2015 at 13:43	history	mod moved comments to chat
S Jul 29, 2015 at 13:43	comment	added	terdon♦		@mikeserv Costas, let's take this to chat. You'll have to tweak both solutions to get the right results.
Jul 29, 2015 at 13:41	comment	added	mikeserv		@terdon - your `awk` doesn't seem to get the right results. Instead of writing to `/dev/null` write to `\|wc -l`.
Jul 29, 2015 at 13:37	comment	added	terdon♦		@mikeserv did it? Cool! Note, however, that costas's approach here uses `-s39` to skip comparing the first 39 characters and `-w4` to only compare characters 40-44. I'm guessing that would make it faster than my awk approach on your random data. As I told Costas above, it did for mine. I think it's safe to conclude that, as Costas has been saying all along, sort/uniq will be faster for large, non-homogeneous datasets while awk will be faster on small or large but relatively homogeneous ones. On my `dd` file, I got 21.6 sec for `awk` and 11.6 for Costas's solution.
Jul 29, 2015 at 13:32	comment	added	mikeserv		@terdon - your thing did it in 6 secs.
Jul 29, 2015 at 13:27	comment	added	mikeserv		@terdon - that's a good point, too. costas - don't do `sort\|uniq` but instead try `sort...\|sort -nmut, -k18,18` - I think that way the second `sort` should only print `uniq` results for the 18th field and won't need to consider anything else - the `-m`erge sort flag makes the second almost a no-op. I just created a 400 mb file like `dd bs=1M </dev/urandom \| od -tu -w44 \| sed 's/[0-9]\{1,5\}/&,/g;s/ //g'` for large* completely random integer lines - and the whole `sort\|sort` job came out in 8 secs - reducing 3mil lines to 100k. I'm about to try terdon's thing.
Jul 29, 2015 at 13:26	comment	added	terdon♦		@Costas first I want to clarify that I'm just trying to learn here. I'm not saying you're wrong, as far as I can tell, you know far more about this sort of thing than I. And, as it turns out, I'm quite right and you do know more than I do! I tried again with a file where every `$18` was different and `awk` took 5.7 seconds while `sort/uniq` took 2.8 to produce the same file!
Jul 29, 2015 at 13:09	comment	added	Costas		@terdon I have write «can be quicker» because it strongly depends of data. If you'd like I'l add «in some cases». As noted by @mikeserv above `awk` read data twice: fist when put data into 2 arrays, then read array in end loop. So if resulting array is small — `awk` will be quicker.
Jul 29, 2015 at 13:01	comment	added	terdon♦		@mikeserv no, not at all! I only compare a single field to the array, not the entire line. And that's an indexed array, a hash, so I don't need to iterate over it. It only ever compares the current field to the single corresponding value stored in the array. The size of the array is irrelevant. In any case, as I said, it's almost 10 seconds faster. I also tried with a 265M file of 22 random, comma separated integers per line (following the OP's example format) and, again,. `sort/uniq` was at 26 sec while `awk` was at 15. Still using a tmpfs and printing to `/dev/null` for both.
Jul 29, 2015 at 12:55	comment	added	terdon♦		@Costas ah, yes indeed, very good point. Your approach should never run out of memory. I just don't think it will ever be faster, either. Even if we allow the limitations imposed by ignoring the first 39 characters.
Jul 29, 2015 at 12:51	comment	added	mikeserv		@terdon - costas makes an excellent point - your `awk` has to read all of the data twice, too, you know - possibly many more times actually, but it squeezes it as it does. You have to compare each line against your array - as the size of that array gets increases its comparison time for each line will increase exponentially - but `uniq` only ever has to compare the current line against the last.
Jul 29, 2015 at 12:46	comment	added	Costas		@terdon You make incorrect example by «content of the OP's example repeated a few thousand times» so you have 3 members in array only. For real huge amount of different data `awk` even can reach full of memory error.
Jul 29, 2015 at 12:39	comment	added	terdon♦		@mikeserv OK, I mounted a tmpfs on `/tmp` and tested again. The `awk` took 19 seconds this time, versus 28 for sort/uniq. Remember that even in the best case, this can't be 100% parallel since `uniq` needs to wait for data from `sort`. Actually, the entire `awk` command (19 sec) was faster than the sort alone (24 seconds). And this despite `sort` maxing out my 4 cores. You keep underestimating `awk` :) It is actually very, very efficiently tuned for this kind of thing.
Jul 29, 2015 at 12:08	comment	added	mikeserv		@terdon - It doesn't read it twice over - it reads it twice concurrently. On a multicore system a single `awk` will wind up processor bound to maxing a single core - but on the same system `sort \| uniq` can work the same data at the same time each on their own core. It's why you'll see stuff like CPU usage 187% or whatever after a pipeline finishes - it's what makes pipelines awesome. Also, `sort` and `uniq` are both precision tools - they each do a single thing and well, while `awk`'s job is to generally interpret in-scripts - and so isn't likely as efficiently tuned as a result.
Jul 29, 2015 at 12:04	comment	added	terdon♦		@mikeserv doesn't seem to be, no. At least `mount \| grep tmpfs` does not show `/tmp`. You think that's what makes it slower? Even theoretically though, why would reading all data twice ever be faster than reading it once?
Jul 29, 2015 at 12:01	comment	added	mikeserv		@terdon - is your `/tmp` a tmpfs?
Jul 29, 2015 at 11:40	comment	added	terdon♦		I just tested using a 287M file (the content of the OP's example repeated a few thousand times) and my `awk` approach took 23 seconds while the sort/uniq took 30. Here, both `sort` and `uniq` need to go over the entire file. Why would you expect it to be faster? Also, limiting the number of characters compared is clever but assumes that the number of characters per line is fixed, right? It will fail if one of the fields isn't empty or has a value with >1 digit.
Jul 29, 2015 at 11:20	history	edited	Costas	CC BY-SA 3.0	added 4 characters in body
Jul 29, 2015 at 11:13	history	answered	Costas	CC BY-SA 3.0

toggle format