5

i have a file that looks like

1254543534523233434
3453453454323233434
2342342343223233535
0909909092324243535

Is there a way / command in bash to remove duplicates on the file above, based on a specific substring, without changing their order in the output?

ie

(With substring -> ${line:11:8}

1254543534523233434
2342342343223233535
0909909092324243535

I know that :

sort -u : sorts them numerically, then removes duplicates
sort -kx,x -u : The same
cat filein | uniq : requires them to be sorted already or it will not work

I m trying to figure out if there's a native linux solution without having to resolve to perl code for it. Thank you in advance.

3
  • This is not an exact duplicate. It has the additional constraint of comparing lines based only on a subtring, but printing the complete line. However, the answer should be easily extendible to awk '!seen[substr($0, 11, 8)]++' file.txt. Commented Aug 22, 2016 at 9:56
  • This isn't a duplicate; answers referenced avoid sorting, but they don't preserve order Commented Nov 18, 2023 at 18:07
  • Should not be closed ; not a duplicate in any way ; order must be preserved Commented Nov 18, 2023 at 18:13

1 Answer 1

9

You can use awk without any need to sorting:

awk '!uniq[substr($0, 12, 8)]++' file

1254543534523233434
2342342343223233535
0909909092324243535
  • Since awk index starts from 1 you need to use substr($0, 12, 8) to get desired 8 characters long text starting from 12th position.
  • uniq is an associative array with substring retrieved using substr function.
  • ++ sets value of array as 1
Sign up to request clarification or add additional context in comments.

1 Comment

This worked perfectly, thank you.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.