Groupwise uniq command?

Question

I am searching for a command to get from a file in this format:

hello 32
hello 67
hi    2
ho    1212
ho    1390
ho    3000

To this format (deduplicate by taking the last row of a "group"):

hello 67
hi    2
ho    3000

At the moment I am using a Python and pandas snippet:

    df = pd.read_csv(self.input().path, sep='\t', names=('id', 'val'))

    # how to replace this logic with shell commands?
    surface = df.drop_duplicates(cols=('id'), take_last=True)

    with self.output().open('w') as output:
        surface.to_csv(output, sep='\t', cols=('id', 'val'))

Update: Thanks for the great answers. Here are some benchmarks:

Input file is 246M and contains 8583313 lines. Order does not matter. First column has a fixed size of 9 chars.

Example of the input file:

000000027       20131017023259.0        00
000000027       20131017023259.0        11
000000035       20130827104320.0        01
000000035       20130827104320.0        04
000000043       20120127083412.0        01
...

                              time        space complexity

tac .. | sort -k1,1 -u        27.43682s   O(log(n))
Python/Pandas                 11.76063s   O(n)
awk '{c[$1]=$0;} END{for(...  11.72060s   O(n)

Since the first column has a fixed length, uniq -w can also be used:

tac {input} | uniq -w 9        3.25484s   O(1)

@Gnouc, -w N will only consider the first N chars. For my local file I had a 9 char ID in the first column, so uniq -w 9. — miku
– miku, Commented Jun 19, 2014 at 17:34
@Gnouc, Yes for the small input, uniq -w 5should work. I think your and Mikels answers are better, since they do not make an assumption about the number of chars in the first column. However, if the input follows such a constraint, then uniq -w is the fastest. — miku
– miku, Commented Jun 19, 2014 at 17:36

Mikel · Accepted Answer · 2014-06-19 16:16:00Z

5

This seems crazy, and hopefully there's a better way, but:

tac foo | sort -k 1,1 -u

tac is used to reverse the file, so you get the last rather than the first.

-k 1,1 says use only the first field for comparison.

-u makes it unique.

answered Jun 19, 2014 at 16:16

Mikel

58.7k16 gold badges136 silver badges155 bronze badges

ah lol, tac was the solution to my answer haha

polym
– polym

2014-06-19 16:16:34 +00:00
Commented Jun 19, 2014 at 16:16
+1, this is the most elegant command line solution that I have seen so far. :)

Ramesh
– Ramesh

2014-06-19 16:36:04 +00:00
Commented Jun 19, 2014 at 16:36
This is really good, and it was the first thing I thought of, but couldn't you just sort -ruk1,1 foo? Maybe I'm not reading it right, though.

mikeserv
– mikeserv

2014-06-19 17:29:09 +00:00
Commented Jun 19, 2014 at 17:29
Thanks, I like that. It's actually a bit slower than Python/Pandas, but clear and concise. And the memory overhead is much better than the Python and other hash-based solutions.

miku
– miku

2014-06-19 17:50:16 +00:00
Commented Jun 19, 2014 at 17:50

Add a comment |

cuonglm · Accepted Answer · 2014-06-19 16:26:33Z

4

If you don't mind the order of output, here's an awk solution:

$ awk '
    {a[$1] = !a[$1] ? $2 : a[$1] < $2 ? $2 : a[$1]}
    END {
        for (i in a) { print i,a[i] }
    }
' file
hi 2
hello 67
ho 3000

answered Jun 19, 2014 at 16:26

cuonglm

158k41 gold badges341 silver badges419 bronze badges

Add a comment |

terdon · Accepted Answer · 2014-06-19 17:00:40Z

3

Some more options:

perl, if you don't care about the order of the lines.

perl -lane '$k{$F[0]}=$F[1]; END{print "$_ $k{$_}" for keys(%k)}' file

A simpler awk

awk '{c[$1]=$0;} END{for(i in c){print c[i]}}' file

A silly shell one

while read a b; do grep -w ^"$a" file | tail -n1 ; done < file | uniq

edited Jun 19, 2014 at 17:00

answered Jun 19, 2014 at 16:41

terdon♦

252k69 gold badges480 silver badges718 bronze badges

Add a comment |

Community · Accepted Answer · 2017-04-13 12:36:51Z

0

Well you can do it with sort

sort -u -k1,1 test

EDIT: tac is the solution

edited Apr 13, 2017 at 12:36

CommunityBot

1

answered Jun 19, 2014 at 16:15

polym

11.1k10 gold badges45 silver badges66 bronze badges

Add a comment |

Stack Exchange Network

Groupwise uniq command?

4 Answers 4

You must log in to answer this question.

Hot Network Questions

Groupwise uniq command?

4 Answers 4

You must log in to answer this question.

Related

Hot Network Questions