GNU sort command does not sort words of different lengths with common prefixes correctly when using field delimiter

Question

The GNU sort command is not sorting words of different lengths with common prefixes correctly for me, but only when using a field delimiter to sort on one of multiple fields.

Here is the correct, expected sort behavior without using field delimiters:

$ cat /tmp/test0
b
c
ant
a
bcd
bc
cn

$ sort /tmp/test0
a
ant
b
bc
bcd
c
cn

Note that, for all words with a common string prefix, the shorter word sorts before the longer word. E.g. a is before ant, b is before bc is before bcd, etc. This is the accepted, standard way that English strings are sorted, e.g. in a dictionary.

However, this sorting behavior changes when you are attempting to sort tabular data (such as a CSV file), and sorting on one of the columns. Here's what that looks like:

$ cat /tmp/test1
b,foo
c,bar
ant,baz
a,foo
bcd,ty
bc,pe
cn,cn

$ sort /tmp/test1 -t, -k1
a,foo
ant,baz
bcd,ty
bc,pe
b,foo
c,bar
cn,cn

Note that the words with a common prefix of a and c are still being handled correctly, but strings with a common prefix of b are not; bcd sorts before bc sorts before b, all of which is incorrect! This behavior is stable; you always get the same result. I'm experiencing this exact same issue on a much larger CSV file and the sorting errors there are deterministically random, if that makes sense.

I've tried various flags to sort and none work to correct this behavior. -d and -s don't work. This is on GNU coreutils 9.4 sort for what it's worth.

So, is this just a bug with the sort command? Am I somehow using it incorrectly? Is there anything better I can do that will dictionary sort the CVS by words in the first column?

sort sorts as per the locale's collation order, and in most locales, that's similarly to what you'll find in a dictionary, where whitespace, punctuation are ignore in a first pass (have IGNORE as their primary weight). Try in the C locale if you expect an order based on byte value (LC_ALL=C sort...) or use sort -t, -k1,1 -k2,2 if you want to sort based on the first comma-separated field as the first key and the second one as the second key. -k1 (to sort on the portion of the line starting with the first field so the whole line) is the default so pointless. — Stéphane Chazelas
– Stéphane Chazelas, Commented May 24, 2024 at 16:38
To sort CSVs, have a look at the mlr (miller) utility. sort can only sort the most simple CSVs (without header, without quoting, with no newline in cells). — Stéphane Chazelas
– Stéphane Chazelas, Commented May 24, 2024 at 16:40

Pablo A · Accepted Answer · 2024-05-24 17:53:12Z

16

It's the way your current locale defines collations/sorting rules that's causing it, and how -kN uses field N to the end of the line when comparing lines, not just field N (And some locales will sort bc,pe before b,foo if they ignore the commas).

Use -k1,1 to only use that specific field, or specify the "C" locale and you should get the expected results:

$ LC_ALL=en_US.utf8 sort -t, -k1 test.txt
a,foo
ant,baz
bcd,ty
bc,pe
b,foo
c,bar
cn,cn

$ LC_ALL=en_US.utf8 sort -t, -k1,1 test.txt
a,foo
ant,baz
b,foo
bc,pe
bcd,ty
c,bar
cn,cn

$ LC_ALL=C sort -t, -k1 test.txt
a,foo
ant,baz
b,foo
bc,pe
bcd,ty
c,bar
cn,cn

edited May 24, 2024 at 17:53

Pablo A

3,2251 gold badge26 silver badges46 bronze badges

answered May 24, 2024 at 16:47

Shawn

1,4139 silver badges9 bronze badges

Thank you for that, I can confirm it works with LC_ALL=C. However, the part I still don't understand (and what confused me) is why the sorting order over the same set of strings is different depending on whether one is a column in tabular data. If the bare non-tabular file had sorted as bcd, bc, b as well this would have been a lot simpler to figure out. It's the inconsistency that's strange, and I can see potentially tripping up a lot of people.

Ben McIlwain
– Ben McIlwain

2024-05-24 16:58:02 +00:00
Commented May 24, 2024 at 16:58
4

It's because , is sorted after c with LC_ALL=en_US.utf8 and before with LC_ALL=C.

ctx
– ctx

2024-05-24 17:09:41 +00:00
Commented May 24, 2024 at 17:09
4

@BenMcIlwain I keep forgetting that -k N uses field N to the end of the line. Try -k 1,1 instead and you also get the expected sort order.

Shawn
– Shawn

2024-05-24 17:13:05 +00:00
Commented May 24, 2024 at 17:13
8

Oh my god, that's the real answer then. -k 1,1. Yeeesh! LC_ALL=C only works by accident because of the selected delimiter's byte value, but it wouldn't work with other delimiters; -k 1,1 would work with all.

Ben McIlwain
– Ben McIlwain

2024-05-24 17:25:38 +00:00
Commented May 24, 2024 at 17:25

Add a comment |

Ben McIlwain · Accepted Answer · 2024-05-24 17:42:09Z

14

The answer turns out to be that, despite some example usages online to the contrary, the -k flag takes TWO parameters, so it needs to be written as -k 1,1. Otherwise, with -k 1, it's not getting a stop field number, and thus just going through the entire line. So the anomalous sort behavior is actually caused by the UTF-8 representation of the , delimiter falling somewhere in-between the other ASCII characters.

Thanks to Stéphane Chazelas's comments above.

edited May 24, 2024 at 17:42

answered May 24, 2024 at 16:48

Ben McIlwain

3531 silver badge9 bronze badges

Add a comment |

ctx · Accepted Answer · 2024-05-24 17:06:48Z

According to the comment of Stéphane Chazelas you are sorting the wohle line and not only the first field. With LC_ALL=C , sorts before c:

% LC_ALL=en_US.utf8 sort -t, -k1,1  test --debug 
sort: text ordering performed using ‘en_US.utf8’ sorting rules
a,foo
_
_____
ant,baz
___
_______
b,foo
_
_____
bc,pe
__
_____
bcd,ty
___
______
c,bar
_
_____
cn,cn
__
_____
% LC_ALL=en_US.utf8 sort -t, -k1  test --debug 
sort: text ordering performed using ‘en_US.utf8’ sorting rules
a,foo
_____
_____
ant,baz
_______
_______
bcd,ty
______
______
bc,pe
_____
_____
b,foo
_____
_____
c,bar
_____
_____
cn,cn
_____
_____

Stack Exchange Network

GNU sort command does not sort words of different lengths with common prefixes correctly when using field delimiter

3 Answers 3

You must log in to answer this question.

Hot Network Questions

GNU sort command does not sort words of different lengths with common prefixes correctly when using field delimiter

3 Answers 3

You must log in to answer this question.

Related

Hot Network Questions