I am searching for a command to get from a file in this format:
hello 32
hello 67
hi 2
ho 1212
ho 1390
ho 3000
To this format (deduplicate by taking the last row of a "group"):
hello 67
hi 2
ho 3000
At the moment I am using a Python and pandas snippet:
df = pd.read_csv(self.input().path, sep='\t', names=('id', 'val'))
# how to replace this logic with shell commands?
surface = df.drop_duplicates(cols=('id'), take_last=True)
with self.output().open('w') as output:
surface.to_csv(output, sep='\t', cols=('id', 'val'))
Update: Thanks for the great answers. Here are some benchmarks:
Input file is 246M and contains 8583313 lines. Order does not matter. First column has a fixed size of 9 chars.
Example of the input file:
000000027 20131017023259.0 00
000000027 20131017023259.0 11
000000035 20130827104320.0 01
000000035 20130827104320.0 04
000000043 20120127083412.0 01
...
time space complexity
tac .. | sort -k1,1 -u 27.43682s O(log(n))
Python/Pandas 11.76063s O(n)
awk '{c[$1]=$0;} END{for(... 11.72060s O(n)
Since the first column has a fixed length, uniq -w can also be used:
tac {input} | uniq -w 9 3.25484s O(1)
tac {input} | uniq -w 9work?-w Nwill only consider the firstNchars. For my local file I had a 9 char ID in the first column, souniq -w 9.uniq -w 5should work. I think your and Mikels answers are better, since they do not make an assumption about the number of chars in the first column. However, if the input follows such a constraint, thenuniq -wis the fastest.