Sort groups of rows where field values are the same in certain columns

Question

I have a text file like this:

1 bob A
1 jim B
1 Kate A
1 Nancy C
1 bill A
1 Jason A
2 James B
2 fill B
2 cake C
2 lucky C
2 Lucy A
2 lily B

How can I order the data by column 3 within each 1 & 2 group? The output should be:

1 bob A
1 Kate A
1 bill A
1 Jason A
1 jim B
1 Nancy C
2 Lucy A
2 James B
2 fill B
2 lily B
2 cake C
2 lucky C

Note that Kate appears before bill in the output because they are in that order in the input.

The values of column 1 are large, it goes from 1, 2, to 2000, so I was thinking about awk print while comparing row numbers and not just equal to a certain value.

(1) Are you saying (as your example seems to suggest) that all the lines (i.e., rows or records) with $1 = 1 are together (consecutive) in the input file, and then all the lines with $1 = 2 are together, and then all the lines with $1 = 3 are together, … and then all the lines with $1 = 42 are together, … and finally all the lines with $1 = 2000 are together? And you want the output to have the same property?   (2) Please confirm what I wrote about the $2 values. — G-Man Says 'Reinstate Monica'
– G-Man Says 'Reinstate Monica', Commented May 15, 2020 at 8:10

guest · Accepted Answer · 2020-05-18 04:56:31Z

6

sort the file numerically by first column, lexicographically by third column:

sort -s -k1n,1 -k3,3 file

Note: -s is an extension to the POSIX specification

edited May 18, 2020 at 4:56

answered May 15, 2020 at 3:34

guest

2,1746 silver badges11 bronze badges

Note that the question says that $1 ranges up to 2000, so the file might have 12000 lines. sort will read this entire file and then sort it. This will take at least n log(n) time; i.e., 12000×log(12000). This probably consumes more resources than sorting 2000 groups of six lines.

G-Man Says 'Reinstate Monica'
– G-Man Says 'Reinstate Monica'

2020-05-15 08:12:19 +00:00
Commented May 15, 2020 at 8:12
1

@G-ManSays'ReinstateMonica' 12000 lines is still not very many though.

Kusalananda
– Kusalananda ♦

2020-05-15 08:44:03 +00:00
Commented May 15, 2020 at 8:44
You should mention that this requires GNU sort for -s.

Ed Morton
– Ed Morton

2020-05-16 18:39:29 +00:00
Commented May 16, 2020 at 18:39

Add a comment |

Ed Morton · Accepted Answer · 2020-05-16 18:43:05Z

2

If you have GNU sort for -s then see @guest's solution, otherwise using any cat+sort+cut:

$ cat -n file | sort -k2,2n -k4,4 -k1,1n | cut -f2-
1 bob A
1 Kate A
1 bill A
1 Jason A
1 jim B
1 Nancy C
2 Lucy A
2 James B
2 fill B
2 lily B
2 cake C
2 lucky C

answered May 16, 2020 at 18:43

Ed Morton

35.8k6 gold badges25 silver badges60 bronze badges

Add a comment |

user413047user413047 · Accepted Answer · 2020-05-18 03:47:30Z

Collect each line in an array. When the first word of a line is not the same as the previous first word, print the array sorted by the third word. This may be somewhat over the top when a simple sort can do the job. The below does not account for input files that differ from the format shown in the question.

gawk:

BEGIN {ors=ORS; ORS=""; PROCINFO["sorted_in"]="@ind_str_asc"}

$1!=r {
    output()
    delete a
    r=$1
}
{
    a[$3]=a[$3] $0 ors
}

END {
    output()
}

function output() {
    for (i in a)
        print a[i]
}

python:

import fileinput, operator
r=''; a=[]
def out():
    for p in sorted(a,key=operator.itemgetter(2)):
        print(' '.join(p))

for line in fileinput.input():
    x = line.rstrip().split()
    if r!=x[0]:
        r=x[0]
        if a:
            out()
            del a[:]
    a.append(x)
out()

perl:

perl -lae 'sub out {foreach(sort keys %a) {print $a{$_}}} BEGIN {$ors=$\;$\=""}
    if ($F[0] ne $r) {$r=$F[0]; out; %a=()}
    $a{$F[2]}=$a{$F[2]}.$_.$ors; END{out}'

G-Man Says 'Reinstate Monica' · Accepted Answer · 2020-05-15 08:08:17Z

Here’s the awk solution you wanted. (Well, gawk [GNU awk], to be specific; this won’t work in POSIX awk.)

awk '
        function dump() {
                PROCINFO["sorted_in"] = "@ind_str_asc"
                for (arg3 in group) {
                        PROCINFO["sorted_in"] = "@ind_num_asc"
                        for (line_num in group[arg3]) {
                                print group[arg3][line_num]
                        }
                        PROCINFO["sorted_in"] = "@ind_str_asc"
                }
        }
        {
                if ($1 != saved_arg1) {
                        dump()
                        delete group
                        saved_arg1 = $1
                }
                group[$3][NR] = $0
        }
        END {
                dump()
        }
    '

The main work begins in the middle. For each line, if its $1 value is different from the most recent one we’ve seen, that means that we’re entering a new group. Dump the data from the previous group (i.e., write it to the output), delete the saved data for the previous group, and then remember the new $1 value.

Then, in either case, add the current line to the group array. This is a two-dimensional array, indexed by $3 value and NR (line number). So, for example, for the first six lines of your sample input, we get

group["A"][1] = "1 bob A"
group["B"][2] = "1 jim B"
group["A"][3] = "1 Kate A"
group["C"][4] = "1 Nancy C"
group["A"][5] = "1 bill A"
group["A"][6] = "1 Jason A"

When we see $1 = 2 on line 7, we call the dump function (defined at the top of the program). for (arg3 in group) sets arg3 to A, B and C, in that order. Then, for arg3 = A, the loop for (line_num in group[arg3]) (i.e., for (line_num in group["A"]) sets line_num to 1, 3, 5 and 6, in that order. And so we print out

1 bob A
1 Kate A
1 bill A
1 Jason A

And so on for the other $3 values. And so on for the other $1 values.

Stack Exchange Network

Sort groups of rows where field values are the same in certain columns

4 Answers 4

You must log in to answer this question.

Hot Network Questions

Sort groups of rows where field values are the same in certain columns

4 Answers 4

You must log in to answer this question.

Related

Hot Network Questions