I do text manipulation a lot, and one of the operations I use frequently is sorting - often removing any duplicates as well.
So I normally use the commands sort or sort -u either from command line or in scripts, macros, etc - if I'm not working in LibreOffice Writer or Calc, of course (which unfortunately don't have an option to remove duplicates while sorting, or I don't know ;-)
Now I have a plain text file containing a large collection of symbols, emoticons, shapes, lines, non-standard ASCII letters and numbers, etc. with many duplicates.
It was easy to convert them a into one character per line text.
However, sorting and removing duplicates apparently is not as simple as one would think:
Using the command sort -u file.txt > file-sorted.txt unfortunately reduces the 2078 lines down to just 359, removing about a thousand unique characters I guess - I can see that there are many many that are filtered out in error.
So my conclusion is that sort -u command is only good for standard alphanumeric characters.
Any ideas and suggestions?
PS: Here is a sample text of 40 characters from the file I'm trying to process:
ღ ❂ ◕ ⊕ Θ o O ♋ ☯ ⊙ ◎ ๑ ☜ ☞ ♨ ☎ ☏ ۩ ۞ ♬ ✖ ɔ ½ ' ‿ ' * ᴗ * ◕ ‿ ◕ ❊ ᴗ ❊ . ᴗ . ᵒ ᴗ
There are only a few duplicates here but although sort command processes the text with a few issues, but without any loss, both sort -u and uniq have exactly the same output, cutting it down to 11 with many characters wiped out.
uniqorsort -uline gone, with some unicode characterssort | uniq?