The GNU sort command is not sorting words of different lengths with common prefixes correctly for me, but only when using a field delimiter to sort on one of multiple fields.
Here is the correct, expected sort behavior without using field delimiters:
$ cat /tmp/test0
b
c
ant
a
bcd
bc
cn
$ sort /tmp/test0
a
ant
b
bc
bcd
c
cn
Note that, for all words with a common string prefix, the shorter word sorts before the longer word. E.g. a is before ant, b is before bc is before bcd, etc. This is the accepted, standard way that English strings are sorted, e.g. in a dictionary.
However, this sorting behavior changes when you are attempting to sort tabular data (such as a CSV file), and sorting on one of the columns. Here's what that looks like:
$ cat /tmp/test1
b,foo
c,bar
ant,baz
a,foo
bcd,ty
bc,pe
cn,cn
$ sort /tmp/test1 -t, -k1
a,foo
ant,baz
bcd,ty
bc,pe
b,foo
c,bar
cn,cn
Note that the words with a common prefix of a and c are still being handled correctly, but strings with a common prefix of b are not; bcd sorts before bc sorts before b, all of which is incorrect! This behavior is stable; you always get the same result. I'm experiencing this exact same issue on a much larger CSV file and the sorting errors there are deterministically random, if that makes sense.
I've tried various flags to sort and none work to correct this behavior. -d and -s don't work. This is on GNU coreutils 9.4 sort for what it's worth.
So, is this just a bug with the sort command? Am I somehow using it incorrectly? Is there anything better I can do that will dictionary sort the CVS by words in the first column?
sortsorts as per the locale's collation order, and in most locales, that's similarly to what you'll find in a dictionary, where whitespace, punctuation are ignore in a first pass (have IGNORE as their primary weight). Try in theClocale if you expect an order based on byte value (LC_ALL=C sort...) or usesort -t, -k1,1 -k2,2if you want to sort based on the first comma-separated field as the first key and the second one as the second key.-k1(to sort on the portion of the line starting with the first field so the whole line) is the default so pointless.sortCSVs, have a look at themlr(miller) utility.sortcan only sort the most simple CSVs (without header, without quoting, with no newline in cells).