What's the best way to sort?

Question

I do text manipulation a lot, and one of the operations I use frequently is sorting - often removing any duplicates as well.

So I normally use the commands sort or sort -u either from command line or in scripts, macros, etc - if I'm not working in LibreOffice Writer or Calc, of course (which unfortunately don't have an option to remove duplicates while sorting, or I don't know ;-)

Now I have a plain text file containing a large collection of symbols, emoticons, shapes, lines, non-standard ASCII letters and numbers, etc. with many duplicates.

It was easy to convert them a into one character per line text.

However, sorting and removing duplicates apparently is not as simple as one would think:

Using the command sort -u file.txt > file-sorted.txt unfortunately reduces the 2078 lines down to just 359, removing about a thousand unique characters I guess - I can see that there are many many that are filtered out in error.

So my conclusion is that sort -u command is only good for standard alphanumeric characters.

Any ideas and suggestions?

PS: Here is a sample text of 40 characters from the file I'm trying to process:

ღ ❂ ◕ ⊕ Θ o O ♋ ☯ ⊙ ◎ ๑ ☜ ☞ ♨ ☎ ☏ ۩ ۞ ♬ ✖ ɔ ½ ' ‿ ' * ᴗ * ◕ ‿ ◕ ❊ ᴗ ❊ . ᴗ . ᵒ ᴗ

There are only a few duplicates here but although sort command processes the text with a few issues, but without any loss, both sort -u and uniq have exactly the same output, cutting it down to 11 with many characters wiped out.

Can you provide file samples? What do you want to achieve? Of course sorting with additional removing duplicates is not easy. Sort command uses mergesort algorithm, which it rather fast one. — mrc02_kr
– mrc02_kr, Commented Jul 11, 2017 at 14:41
Possibly related: Where has my uniq or sort -u line gone, with some unicode characters — steeldriver
– steeldriver, Commented Jul 11, 2017 at 14:49
Do you get the same output as sort if you pipe the file through sort | uniq? — Alexander
– Alexander, Commented Jul 11, 2017 at 15:53

muru · Accepted Answer · 2017-07-12 12:41:38Z

1

Try using something with proper Unicode support, such as Python:

$ python3 -c 'import sys; print("\n".join(sorted(set(c for l in sys.stdin.readlines() for c in l.split()))))' < bar
'
*
.
O
o
½
ɔ
Θ
۞
۩
๑
ღ
ᴗ
ᵒ
‿
⊕
⊙
◎
◕
☎
☏
☜
☞
☯
♋
♨
♬
✖
❂
❊
$ python3 -c 'import sys; print(len(set(c for l in sys.stdin.readlines() for c in l.split())))' < bar
30

edited Jul 12, 2017 at 12:41

answered Jul 12, 2017 at 11:00

muru

77.9k16 gold badges212 silver badges318 bronze badges

Great! But ( as an average user ; -) how can I apply this script to a file please? Replace "bar" at the end with the file name? Actually I intend to make a nautilus/nemo script out of this, if I can.

Sadi
– Sadi

2017-07-12 12:08:56 +00:00
Commented Jul 12, 2017 at 12:08
1

@Sadi Yes. The python command is reading from stdin, so you can pick your favourite way to send files to the stdin of a command. Or you could loop over arguments in Python itself.

muru
– muru

2017-07-12 12:10:26 +00:00
Commented Jul 12, 2017 at 12:10
I should have tried it first ;-) It works! Thanks a lot.

Sadi
– Sadi

2017-07-12 12:12:34 +00:00
Commented Jul 12, 2017 at 12:12
1

@Sadi Oh, yeah, the output is unsorted. Note that I just used the set data structure to get unique elements, I didn't bother with sorting. I updated the first command to do some sorting, but I'm not sure if the output is what you want.

muru
– muru

2017-07-12 12:42:04 +00:00
Commented Jul 12, 2017 at 12:42
1

@Sadi If splitting the lines is not needed, then it's considerably simpler: python3 -c 'import sys; print("".join(sorted(set(sys.stdin))))'

muru
– muru

2017-07-12 13:15:33 +00:00
Commented Jul 12, 2017 at 13:15

| Show 4 more comments

Stack Exchange Network

What's the best way to sort?

1 Answer 1

You must log in to answer this question.

Linked

Hot Network Questions

What's the best way to sort?

1 Answer 1

You must log in to answer this question.

Linked

Related

Hot Network Questions