Linux uniq: how to uniq the list ignore different remark

Question

Original data (abc.csv):

8|AAAAA_001|0|
8|AAAAA_002|0|
8|AAAAA_003|0|
8|AAAAA_004|0|
8|AAAAA_005|0|AAAAA_005
8|AAAAA_006|0|
9|BBBBB_001|0|
9|BBBBB_002|0|
9|BBBBB_003|0|BBBBB_003
9|BBBBB_004|0|
9|BBBBB_005|0|
9|BBBBB_901|0|
10|CCCCC_001|0|
10|CCCCC_002|0|
10|CCCCC_003|0|
10|CCCCC_004|0|

Expected result:

8|AAAAA|0|AAAAA
9|BBBBB|0|BBBBB
10|CCCCC|0

Any idea? Thanks

What I have done as below, but it still show doubled result if data content $3

cat abc.csv | awk 'BEGIN{FS="|";OFS="|"}
                   {print $1,substr($2,1,5),$3,substr($4,1,5)}' |
  sort -t "|" -k 2 | uniq > abc_final.csv

One for awk, I think. Please clarify the requirement. [a] Are the numerics part of the data. [b] Are the fields to be unique always the same ones (you don't want some variable-length match where the longest wins?) [c] Do you want the longest line from a match -- your examples could also just be the last line of a set. [d] Will the file be sorted as shown? Uniq needs sorted data, awk will manage to find unique keys that are widely separated in the file. — Paul_Pedant
– Paul_Pedant, Commented Nov 10, 2020 at 10:11
[a] $2,$3 are CHAR_NUMBER (eg: AAAAA_001), [b] always need the longer one, because the unwanted data should be NULL, [d] previously, file is sorted as shown, the longest result always in the second line if duplicate $1 existed — wilssssssslam
– wilssssssslam, Commented Nov 10, 2020 at 10:17
We need two samples: One with the original data and one with the expected output. Only two, no more than two. Also, you need to unambiguously state in the question what are the fields to be compared. — Quasímodo
– Quasímodo, Commented Nov 10, 2020 at 10:27
What is the expected result given the data that you present at the end of your question? I'm noting that neither of the lines with Hello on them have duplicated 1st and 2nd fields. In fact, only the first and secord lines are duplicated. — Kusalananda
– Kusalananda ♦, Commented Nov 10, 2020 at 13:15
Thanks all, edited my question to original data and expected output — wilssssssslam
– wilssssssslam, Commented Nov 11, 2020 at 1:34

Stéphane Chazelas · Accepted Answer · 2020-11-11 06:37:35Z

1

Assuming GNU sort, You could do something like:

< abc.csv awk -F '|' -v OFS='|' '
  {print $1, substr($2, 1, 5), $3, substr($4, 1, 5)}' |
  sort -t '|' -k 2,2 -k4,4r | sort -t '|' -muk2,2

That is, use sort -mu instead of uniq where you can uniq based on portions of the line.

answered Nov 11, 2020 at 6:37

Stéphane Chazelas

585k96 gold badges1.1k silver badges1.7k bronze badges

Add a comment |

Stack Exchange Network

Linux uniq: how to uniq the list ignore different remark

1 Answer 1

You must log in to answer this question.

Hot Network Questions

Linux uniq: how to uniq the list ignore different remark

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions