0

I do text manipulation a lot, and one of the operations I use frequently is sorting - often removing any duplicates as well.

So I normally use the commands sort or sort -u either from command line or in scripts, macros, etc - if I'm not working in LibreOffice Writer or Calc, of course (which unfortunately don't have an option to remove duplicates while sorting, or I don't know ;-)

Now I have a plain text file containing a large collection of symbols, emoticons, shapes, lines, non-standard ASCII letters and numbers, etc. with many duplicates.

It was easy to convert them a into one character per line text.

However, sorting and removing duplicates apparently is not as simple as one would think:

Using the command sort -u file.txt > file-sorted.txt unfortunately reduces the 2078 lines down to just 359, removing about a thousand unique characters I guess - I can see that there are many many that are filtered out in error.

So my conclusion is that sort -u command is only good for standard alphanumeric characters.

Any ideas and suggestions?

PS: Here is a sample text of 40 characters from the file I'm trying to process:

ღ ❂ ◕ ⊕ Θ o O ♋ ☯ ⊙ ◎ ๑ ☜ ☞ ♨ ☎ ☏ ۩ ۞ ♬ ✖ ɔ ½ ' ‿ ' * ᴗ * ◕ ‿ ◕ ❊ ᴗ ❊ . ᴗ . ᵒ ᴗ 

There are only a few duplicates here but although sort command processes the text with a few issues, but without any loss, both sort -u and uniq have exactly the same output, cutting it down to 11 with many characters wiped out.

6
  • 1
    Can you provide file samples? What do you want to achieve? Of course sorting with additional removing duplicates is not easy. Sort command uses mergesort algorithm, which it rather fast one. Commented Jul 11, 2017 at 14:41
  • 2
    Possibly related: Where has my uniq or sort -u line gone, with some unicode characters Commented Jul 11, 2017 at 14:49
  • Do you get the same output as sort if you pipe the file through sort | uniq? Commented Jul 11, 2017 at 15:53
  • @mrc02_kr Thanks; please see my PS note above. Commented Jul 12, 2017 at 10:39
  • @Alexander Thanks; please see my PS note above. Commented Jul 12, 2017 at 10:40

1 Answer 1

1

Try using something with proper Unicode support, such as Python:

$ python3 -c 'import sys; print("\n".join(sorted(set(c for l in sys.stdin.readlines() for c in l.split()))))' < bar
'
*
.
O
o
½
ɔ
Θ
۞
۩
๑
ღ
ᴗ
ᵒ
‿
⊕
⊙
◎
◕
☎
☏
☜
☞
☯
♋
♨
♬
✖
❂
❊
$ python3 -c 'import sys; print(len(set(c for l in sys.stdin.readlines() for c in l.split())))' < bar
30
9
  • Great! But ( as an average user ; -) how can I apply this script to a file please? Replace "bar" at the end with the file name? Actually I intend to make a nautilus/nemo script out of this, if I can. Commented Jul 12, 2017 at 12:08
  • 1
    @Sadi Yes. The python command is reading from stdin, so you can pick your favourite way to send files to the stdin of a command. Or you could loop over arguments in Python itself. Commented Jul 12, 2017 at 12:10
  • I should have tried it first ;-) It works! Thanks a lot. Commented Jul 12, 2017 at 12:12
  • 1
    @Sadi Oh, yeah, the output is unsorted. Note that I just used the set data structure to get unique elements, I didn't bother with sorting. I updated the first command to do some sorting, but I'm not sure if the output is what you want. Commented Jul 12, 2017 at 12:42
  • 1
    @Sadi If splitting the lines is not needed, then it's considerably simpler: python3 -c 'import sys; print("".join(sorted(set(sys.stdin))))' Commented Jul 12, 2017 at 13:15

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.