Remove first line of duplicated lines on first column

Question

I have a large csv file with a structure similar to this:

334050049049426,2018-11-06T20:21:56.591Z,xxx,gdl-qns28-1540279057144
334050049049426,2018-11-06T21:32:47.431Z,xxx,gdl-qns19-1540278993723
334090015032064,2018-11-06T22:22:31.247Z,xxx,gdl-qns15-1540279009813
334090015032064,2018-11-07T01:44:11.442Z,xxx,gdl-qns25-1540279437614
334090015032064,2018-11-07T03:57:18.911Z,xxx,gdl-qns28-1540279710160
334050069888299,2018-11-07T03:32:12.899Z,xxx,gdl-qns29-1540279367769
334050069888299,2018-11-07T03:58:15.475Z,xxx,mgc-qns20-1540281468455

I need to be able to remove the first line found of duplicated values from the first column so for example lines number 1, 3 and 6 needs to be remove.

Better use a tool designed to process csv, not the shell directly. — user232326
– user232326, Commented Nov 2, 2018 at 5:19
Are duplicated lines always consecutive? Are there always at least two lines with the same first field? (in that case awk -F, 'seen[$1]++' would be enough). — Stéphane Chazelas
– Stéphane Chazelas, Commented Nov 2, 2018 at 7:57
Just to clarify: you want to keep line 4 even though it's a duplicate? — glenn jackman
– glenn jackman, Commented Nov 2, 2018 at 14:31
I want to keep all duplicates, except the first one found, so for example if the first column has 4 duplicated values, take out the first row of that group and leave the remaining 3. By the way, all duplicates are always consecutive and there will be at least two duplicated lines with the same first field. — Ian
– Ian, Commented Nov 2, 2018 at 15:47

αғsнιη · Accepted Answer · 2018-11-02 07:52:39Z

try below awk if there is no line having unique first column at all.

awk -F, 'pre==$1 { print; next }{ pre=$1 }' infile

Or below instead in general case:

awk -F, 'pre==$1 { print; is_uniq=0; next }
                 # print when current& previous lines' 1^st column were same
                 # unset the 'is_uniq=0' variable since duplicated lines found

         is_uniq { print temp }
                 # print if previous line ('temp' variable keep a backup of previous line) is a 
                 # uniq line (according to the first column)

                 { pre=$1; temp=$0; is_uniq=1 }
                 # backup first column and whole line into 'pre' & 'temp' variable respectively
                 # and set the 'is_uinq=1' (assuming might that will be a uniq line)

END{ if(is_uniq) print temp }' infile
    # if there was a line that it's uniq and is the last line of input file, then print it

same script with comments free:

awk -F, 'pre==$1 { print; is_uniq=0; next }
         is_uniq { print temp }
                 { pre=$1; temp=$0; is_uniq=1 }
END{ if(is_uniq) print temp }' infile

Note: this assumes your input file infile is sorted on its first field, if that's not then you will need to pass sorted file into with

awk ... <(sort -t, -k1,1 infile)

There is no need to do any check on the END. If the line is repeated, print it. If it is uniq, there are no more lines to make it repeated, so print anyway. — user232326
– user232326, Commented Nov 2, 2018 at 8:53
A sort will change the original ordering of the file (that may be needed or not depending on use). — user232326
– user232326, Commented Nov 2, 2018 at 8:54

score 0 · Accepted Answer · 2018-11-02 08:51:47Z

Assuming the csv has a well behaved format (no commas or newlines inside quoted fields, no double quoted " (""), etc) you can use this:

awk -F ',' 'NR==FNR{seen1[$1]++;next};seen1[$1]==1||seen2[$1]++
            {print(NR,$0)}' infile infile

The only way to know if a line is repeated at any place of the file is to get a count of times a line is repeated. That is done with seen1. Then, if the line has a count of 1 (no repeats) or if it has already been seen (this second scan of the file) (done with seen2) print it.

If the file is sorted by the first field use @devWeek solution.

glenn jackman · Accepted Answer · 2018-11-02 14:43:52Z

0

$ cat file
1,a
2,b
2,c
3,d
3,e
3,f
4,g
4,h
5,i

We want to remove the "2,b", "3,d" and "4,g" lines:

perl -F, -anE '
    push $lines{$F[0]}->@*, $_ 
  } END { 
    for $key (sort keys %lines) {
        shift $lines{$key}->@* if (scalar($lines{$key}->@*) > 1); # remove the first
        print join "", $lines{$key}->@*;
    }
' file

1,a
2,c
3,e
3,f
4,h
5,i

answered Nov 2, 2018 at 14:43

glenn jackman

88.5k16 gold badges124 silver badges179 bronze badges

Add a comment |

Stack Exchange Network

Remove first line of duplicated lines on first column

3 Answers 3

You must log in to answer this question.

Hot Network Questions

Remove first line of duplicated lines on first column

3 Answers 3

You must log in to answer this question.

Related

Hot Network Questions