Revisions to For each line in a file, print fields from specific column to NF if those values are less than value in another field

added perl version

Source Link

edited Oct 5 at 9:34

84k
8
136
205

Update: 2025-10-05. an upvote inspired me to write a perl version

It's easier/simpler with perl:

$ perl -lane 'print join "\t", @F[0..3],
                      grep { $_ < $F[3] } @F[4..$#F]' input.txt 
NC_000001.11_NM_001005484.2 69270   234 69037   65565
NC_000001.11_NM_001005484.2 69511   475 69037   65565
NC_000001.11_NM_001005484.2 69761   725 69037   65565
NC_000001.11_NM_001385640.1 942155  20  942136  924432  925922  930155  931039  935772  939040  939272  941144

This prints the first 4 fields of each line, and any remaining fields which are numerically less than the fourth field, joined by tab characters (change that to a space if you need to).

It uses the perl command-line options -l (enable automatic handling of line-endings), -a (autosplit each input line into array @F), -n (operate like sed -n, i.e. read and process each line but don't print anything by default), and -e (next arg is the script to execute). See man perlrun for details on these and other options.

It also uses perl's built-in grep() function, which is named for its similarity to the grep command-line tool - it can be used for regex or string matches, but what it actually does is return a list of all elements of a list for which the expression evaluates to true. See perldoc f grep for details.

Update: 2025-10-05. an upvote inspired me to write a perl version

It's easier/simpler with perl:

$ perl -lane 'print join "\t", @F[0..3],
                      grep { $_ < $F[3] } @F[4..$#F]' input.txt 
NC_000001.11_NM_001005484.2 69270   234 69037   65565
NC_000001.11_NM_001005484.2 69511   475 69037   65565
NC_000001.11_NM_001005484.2 69761   725 69037   65565
NC_000001.11_NM_001385640.1 942155  20  942136  924432  925922  930155  931039  935772  939040  939272  941144

This prints the first 4 fields of each line, and any remaining fields which are numerically less than the fourth field, joined by tab characters (change that to a space if you need to).

It uses the perl command-line options -l (enable automatic handling of line-endings), -a (autosplit each input line into array @F), -n (operate like sed -n, i.e. read and process each line but don't print anything by default), and -e (next arg is the script to execute). See man perlrun for details on these and other options.

It also uses perl's built-in grep() function, which is named for its similarity to the grep command-line tool - it can be used for regex or string matches, but what it actually does is return a list of all elements of a list for which the expression evaluates to true. See perldoc f grep for details.

added note about the awk line re-split hack

Source Link

edited Sep 9, 2021 at 8:38

cas

84k
8
136
205

This iterates over the fields from the end of the line to the beginning (i.e. in reverse order) and deletes the field if the field number (NF) is greater than 4 AND the value of that field is greater than the value of field 4 ($4).

$ awk '{
    for (i=NF; i>=1; i--) {
      if ((i > 4) && ($i >= $4)) {
        $i=""
      }
    };
    print
    }' input.txt
NC_000001.11_NM_001005484.2 69270 234 69037 65565 
NC_000001.11_NM_001005484.2 69511 475 69037 65565 
NC_000001.11_NM_001005484.2 69761 725 69037 65565 
NC_000001.11_NM_001385640.1 942155 20 942136 924432 925922 930155 931039 935772 939040 939272 941144

BTW, it's not clear whether your input is space or tab separated. if you want tab-separated output (rather than a single space between each field), then add -v OFS='\t' to the awk command immediately before the single-quote starting the script. e.g.

awk -v OFS='\t' '...awk script here...' input.txt

BTW, awk will leave a lot of extra field separators in the output line, one wherever a field used to be before it was deleted. If you want to get rid of those, add the following line immediately before the print statement:

    $0=$0; $1=$1;

This will effectively remove any empty fields, by forcing awk to re-evaluate the input line and split it into fields again (splitting on FS, the field-separator, which defaults to any amount of white-space). It's a bit of a hack because awk doesn't have any way to actually delete a field from a line, so you have to force it to do that after the line has been modified.

This iterates over the fields from the end of the line to the beginning (i.e. in reverse order) and deletes the field if the field number (NF) is greater than 4 AND the value of that field is greater than the value of field 4 ($4).

$ awk '{
    for (i=NF; i>=1; i--) {
      if ((i > 4) && ($i >= $4)) {
        $i=""
      }
    };
    print
    }' input.txt
NC_000001.11_NM_001005484.2 69270 234 69037 65565 
NC_000001.11_NM_001005484.2 69511 475 69037 65565 
NC_000001.11_NM_001005484.2 69761 725 69037 65565 
NC_000001.11_NM_001385640.1 942155 20 942136 924432 925922 930155 931039 935772 939040 939272 941144

BTW, it's not clear whether your input is space or tab separated. if you want tab-separated output (rather than a single space between each field), then add -v OFS='\t' to the awk command immediately before the single-quote starting the script. e.g.

awk -v OFS='\t' '...awk script here...' input.txt

This iterates over the fields from the end of the line to the beginning (i.e. in reverse order) and deletes the field if the field number (NF) is greater than 4 AND the value of that field is greater than the value of field 4 ($4).

$ awk '{
    for (i=NF; i>=1; i--) {
      if ((i > 4) && ($i >= $4)) {
        $i=""
      }
    };
    print
    }' input.txt
NC_000001.11_NM_001005484.2 69270 234 69037 65565 
NC_000001.11_NM_001005484.2 69511 475 69037 65565 
NC_000001.11_NM_001005484.2 69761 725 69037 65565 
NC_000001.11_NM_001385640.1 942155 20 942136 924432 925922 930155 931039 935772 939040 939272 941144

BTW, it's not clear whether your input is space or tab separated. if you want tab-separated output (rather than a single space between each field), then add -v OFS='\t' to the awk command immediately before the single-quote starting the script. e.g.

awk -v OFS='\t' '...awk script here...' input.txt

BTW, awk will leave a lot of extra field separators in the output line, one wherever a field used to be before it was deleted. If you want to get rid of those, add the following line immediately before the print statement:

    $0=$0; $1=$1;

This will effectively remove any empty fields, by forcing awk to re-evaluate the input line and split it into fields again (splitting on FS, the field-separator, which defaults to any amount of white-space). It's a bit of a hack because awk doesn't have any way to actually delete a field from a line, so you have to force it to do that after the line has been modified.

added 2 characters in body

Source Link

edited Sep 9, 2021 at 8:18

cas

84k
8
136
205

This iterates over the fields from the end of the line to the beginning (i.e. in reverse order) and deletes the field if the field number (NF) is greater than 4 AND the value of that field is greater than the value of field 4 ($4).

$ awk '{
    for (i=NF; i>=1; i--) {
      if ((i > 4) && ($i >= $4)) {
        $i=""
      }
    };
    print
    }' input.txt
NC_000001.11_NM_001005484.2 69270 234 69037 65565 
NC_000001.11_NM_001005484.2 69511 475 69037 65565 
NC_000001.11_NM_001005484.2 69761 725 69037 65565 
NC_000001.11_NM_001385640.1 942155 20 942136 924432 925922 930155 931039 935772 939040 939272 941144

BTW, it's not clear whether your input is space or tab separated. if you want tab-separated output (rather than a single space between each field), then add -v OFS='\t' to the awk command immediately before the single-quote starting the script. e.g.

awk -v OFS='\t' '...awk script here...' input.txt

This iterates over the fields from the end of the line to the beginning (i.e. in reverse order) and deletes the field if the field number (NF) is greater than 4 AND the value of that field is greater than the value of field 4 ($4).

$ awk '{
    for (i=NF; i>=1; i--) {
      if ((i > 4) && ($i >= $4)) {
        $i=""
      }
    };
    print
    }' input.txt
NC_000001.11_NM_001005484.2 69270 234 69037 65565 
NC_000001.11_NM_001005484.2 69511 475 69037 65565 
NC_000001.11_NM_001005484.2 69761 725 69037 65565 
NC_000001.11_NM_001385640.1 942155 20 942136 924432 925922 930155 931039 935772 939040 939272 941144

BTW, it's not clear whether your input is space or tab separated. if you want tab-separated output (rather than a single space between each field), then add -v OFS='\t' to the awk command immediately before the single-quote starting the script. e.g.

awk -v OFS='\t' '...awk script here...' input.txt

This iterates over the fields from the end of the line to the beginning (i.e. in reverse order) and deletes the field if the field number (NF) is greater than 4 AND the value of that field is greater than the value of field 4 ($4).

$ awk '{
    for (i=NF; i>=1; i--) {
      if ((i > 4) && ($i >= $4)) {
        $i=""
      }
    };
    print
    }' input.txt
NC_000001.11_NM_001005484.2 69270 234 69037 65565 
NC_000001.11_NM_001005484.2 69511 475 69037 65565 
NC_000001.11_NM_001005484.2 69761 725 69037 65565 
NC_000001.11_NM_001385640.1 942155 20 942136 924432 925922 930155 931039 935772 939040 939272 941144

BTW, it's not clear whether your input is space or tab separated. if you want tab-separated output (rather than a single space between each field), then add -v OFS='\t' to the awk command immediately before the single-quote starting the script. e.g.

awk -v OFS='\t' '...awk script here...' input.txt

Source Link

answered Sep 9, 2021 at 8:13

cas

84k
8
136
205

Loading

Stack Exchange Network

Return to Answer

Update: 2025-10-05. an upvote inspired me to write a perl version

Update: 2025-10-05. an upvote inspired me to write a perl version