0

Original data (abc.csv):

8|AAAAA_001|0|
8|AAAAA_002|0|
8|AAAAA_003|0|
8|AAAAA_004|0|
8|AAAAA_005|0|AAAAA_005
8|AAAAA_006|0|
9|BBBBB_001|0|
9|BBBBB_002|0|
9|BBBBB_003|0|BBBBB_003
9|BBBBB_004|0|
9|BBBBB_005|0|
9|BBBBB_901|0|
10|CCCCC_001|0|
10|CCCCC_002|0|
10|CCCCC_003|0|
10|CCCCC_004|0|

Expected result:

8|AAAAA|0|AAAAA
9|BBBBB|0|BBBBB
10|CCCCC|0

Any idea? Thanks

What I have done as below, but it still show doubled result if data content $3

cat abc.csv | awk 'BEGIN{FS="|";OFS="|"}
                   {print $1,substr($2,1,5),$3,substr($4,1,5)}' |
  sort -t "|" -k 2 | uniq > abc_final.csv
5
  • 2
    One for awk, I think. Please clarify the requirement. [a] Are the numerics part of the data. [b] Are the fields to be unique always the same ones (you don't want some variable-length match where the longest wins?) [c] Do you want the longest line from a match -- your examples could also just be the last line of a set. [d] Will the file be sorted as shown? Uniq needs sorted data, awk will manage to find unique keys that are widely separated in the file. Commented Nov 10, 2020 at 10:11
  • [a] $2,$3 are CHAR_NUMBER (eg: AAAAA_001), [b] always need the longer one, because the unwanted data should be NULL, [d] previously, file is sorted as shown, the longest result always in the second line if duplicate $1 existed Commented Nov 10, 2020 at 10:17
  • 2
    We need two samples: One with the original data and one with the expected output. Only two, no more than two. Also, you need to unambiguously state in the question what are the fields to be compared. Commented Nov 10, 2020 at 10:27
  • 1
    What is the expected result given the data that you present at the end of your question? I'm noting that neither of the lines with Hello on them have duplicated 1st and 2nd fields. In fact, only the first and secord lines are duplicated. Commented Nov 10, 2020 at 13:15
  • 1
    Thanks all, edited my question to original data and expected output Commented Nov 11, 2020 at 1:34

1 Answer 1

1

Assuming GNU sort, You could do something like:

< abc.csv awk -F '|' -v OFS='|' '
  {print $1, substr($2, 1, 5), $3, substr($4, 1, 5)}' |
  sort -t '|' -k 2,2 -k4,4r | sort -t '|' -muk2,2

That is, use sort -mu instead of uniq where you can uniq based on portions of the line.

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.