1

Per the man pages we have the following description of the --numeric-sort option of the sort command.

-n, --numeric-sort
              compare according to string numerical value

I assume, by string numeric value, we mean comparing each string character consecutively by its ASCII value?

The info pages read

‘-n’
‘--numeric-sort’
‘--sort=numeric’
     Sort numerically.  The number begins each line and consists of
     optional blanks, an optional ‘-’ sign, and zero or more digits
     possibly separated by thousands separators, optionally followed by
     a decimal-point character and zero or more digits.  An empty number
     is treated as ‘0’.  The ‘LC_NUMERIC’ locale specifies the
     decimal-point character and thousands separator.  By default a
     blank is a space or a tab, but the ‘LC_CTYPE’ locale can change
     this.

     Comparison is exact; there is no rounding error.

     Neither a leading ‘+’ nor exponential notation is recognized.  To
     compare such strings numerically, use the ‘--general-numeric-sort’
     (‘-g’) option.

After reading both docs, I still do not see explicitly explained which collation order is used for the -n option.

How does the --numeric-sort option differ from the default? My naive guess would be that numbers take precedence over letters, but I am not reading this in the documentation.

And which documentation states this explicitly, i.e. where could I have found this info by just looking up the documentation?

9
  • 3
    Try sorting the output of printf %s\\n 111 10 2 22... first time use sort then sort -n (the latter meaning sort by arithmetic value) Commented Jul 30, 2017 at 20:02
  • @don_crissti But, my main source of confusion is when mixed character strings are used. How would you explain the difference in that case? Commented Jul 30, 2017 at 20:10
  • And how does the documentation make that distinction? It should be described in the documentation. Commented Jul 30, 2017 at 20:13
  • It's all in the info page... "An empty number is treated as ‘0’" and "Finally, as a last resort when all keys compare equal, ‘sort’ compares entire lines as if no ordering options other than ‘--reverse’ (‘-r’) were specified." so printf %s\\n b2 a3 | sort -n is the same as printf %s\\n 0b2 0a3 | sort Commented Jul 30, 2017 at 20:16
  • @don_crissti What is an empty number? Commented Jul 30, 2017 at 20:22

2 Answers 2

5

When you have multi-digit numbers, sort -n considers the entire number; by default the file

3
2
1
20
30

sorts like this:

1
2
20
3
30

which is probably not what you wanted. With -n, you get:

1
2
3
20
30

The numeric sort also deals with negative numbers, decimal points, and thousands separators (as determined by your locale). If there's trailing "non-number" text, it is ignored in the sorting order. If the line starts with something non-numeric, that line is counted as 0.

More exactly, the logic is like this: the (primary) sort key is an initial numeric string. (That is, "the number begins each line".) This string is defined to consist of possible blanks, a minus sign, zero or more digits, and possibly . and , (or whatever). Trailing letters don't factor in — they are not part of "the number". If the line does not start with numbers, that is treated as an invisible ("empty") number equal to 0. (Or, "a number with zero digits".)

So, having sorted on "the number" (compare using -k to give a sort key), if there are any remaining lines, those lines are sorted with the default sort. (That is, 1a before 1b — and 1a20 before 1a3.) The whole line is sorted in this way, not the line except the sort key, which gives some odd behavior in this case (0cookies sorts before biscuits — for the secondary sort, there's no "invisible 0" added).

In general, use -n when you actually want to sort lines (or fields) which consist of numbers. If you have a bunch of things that aren't numbers, or are numbers mixed in with other strings, you'll still get a consistent result, but it might not be what you want.

If you do have a mix of letters and numbers (and lines which contain both), you might prefer -V, which does a version sort according to special rules which divide the string into logical components — but be careful because this will put 1.10 higher than 1.9.

8
  • I get confused when mixed character strings are introduced. Consider for example the case of a3 b2 c1. I tend to think this should be sorted as c1 b2 a3. Because the first non-numeric characters are considered "empty numbers" and hence evaluate to 0. Commented Jul 30, 2017 at 20:57
  • @MusséRedi I understand your logic there, but that's not how it works. If the first character is a letter, the whole line is counted as a zero. Then, all of the different zero-equivalent lines are sorted, apparently using the normal alphanumeric rule. Commented Jul 30, 2017 at 21:04
  • @don_crissti He's thinking a3 b2 c103 02 0101 02 03... or back to c1 b2 a3. Commented Jul 30, 2017 at 21:08
  • The concept of empty number is not described in the manual, so I took a guess. In the manual, it only reads: an empty number is treated as '0'. I thought that a non-numeric character constituted an empty number. Commented Jul 30, 2017 at 21:29
  • 1
    @don_crissti I missed that because it talks about "key fields". It might be more clear if the -n option were documented using that same idiom. ("Use an initial numeric string as the sort key.") Edited now. Thanks. Commented Jul 30, 2017 at 22:10
4

By default sort sorts character by character using a locale specified sort order. Generally that's pretty close to ASCII order, but there may be some regional variations. From the man page:

***  WARNING  ***  The  locale  specified  by the environment affects sort order.
Set LC_ALL=C to get the traditional sort order that uses native byte values.

Native byte value usually means by ASCII value, so digits come before uppercase letters which come before lowercase letters. But ordering is still character by character, so 10 comes before 2 because 1 comes before 2.

When the -n or --numeric-sort option is specified, runs of digits are treated as numbers (not individual characters), and sorted numerically from smallest number to largest.

The documentation is not entirely explicit on the details, so here are the rules of the -n flag derived experimentally:

  1. Lines that begin numerically are sorted by numeric value (smaller numbers come first)
  2. Trailing characters on numeric lines do not affect the numeric portion, but the trailing characters are sorted alphanumerically if the numeric portion is the same.
  3. Lines that begin non-numerically are sorted as if they were zero, and then by rule 2.

Observe:

$ printf %s\\n 2z 111 10 20b 20a aa2 aa10 | sort -n
aa10
aa2
2z
10
20a
20b
111

By Rule 3, lines aa10 and aa2 are treated as zeros, and sorted by the remaining characters (including the digits, which are considered characters).

By Rule 2, lines 2z, 20a and 20b are treated as numbers and the trailing character only comes into effect when the numbers are the same.

And by Rule 1, all lines that begin with a number are sorted by numeric value.

Without the -n flag, sorting is done character by character where digit characters come before letter characters. Observe:

$ printf %s\\n 2z 111 10 20b 20a aa2 aa10 | sort
10
111
20a
20b
2z
aa10
aa2
3
  • So the difference between a default sort and --numeric-sort is what? Commented Jul 30, 2017 at 21:12
  • With the default sort Rule 1 becomes: sorted by digit numeric value. Rule 2 stays the same. Rule 3: sorted by alphanumeric value. Commented Jul 30, 2017 at 21:17
  • Oh yes. Have added section on default behaviour to try to illustrate the difference. Note that the default behaviour is not by "numeric" value, it's by byte value - the fact that the byte value of 1 is less than the byte value of 2 makes it look like its numeric order. Hope that's made clear. Commented Jul 31, 2017 at 23:18

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.