Skip to main content
added perl version
Source Link
cas
  • 84k
  • 8
  • 136
  • 205

Update: 2025-10-05. an upvote inspired me to write a perl version

It's easier/simpler with perl:

$ perl -lane 'print join "\t", @F[0..3],
                      grep { $_ < $F[3] } @F[4..$#F]' input.txt 
NC_000001.11_NM_001005484.2 69270   234 69037   65565
NC_000001.11_NM_001005484.2 69511   475 69037   65565
NC_000001.11_NM_001005484.2 69761   725 69037   65565
NC_000001.11_NM_001385640.1 942155  20  942136  924432  925922  930155  931039  935772  939040  939272  941144

This prints the first 4 fields of each line, and any remaining fields which are numerically less than the fourth field, joined by tab characters (change that to a space if you need to).

It uses the perl command-line options -l (enable automatic handling of line-endings), -a (autosplit each input line into array @F), -n (operate like sed -n, i.e. read and process each line but don't print anything by default), and -e (next arg is the script to execute). See man perlrun for details on these and other options.

It also uses perl's built-in grep() function, which is named for its similarity to the grep command-line tool - it can be used for regex or string matches, but what it actually does is return a list of all elements of a list for which the expression evaluates to true. See perldoc f grep for details.


Update: 2025-10-05. an upvote inspired me to write a perl version

It's easier/simpler with perl:

$ perl -lane 'print join "\t", @F[0..3],
                      grep { $_ < $F[3] } @F[4..$#F]' input.txt 
NC_000001.11_NM_001005484.2 69270   234 69037   65565
NC_000001.11_NM_001005484.2 69511   475 69037   65565
NC_000001.11_NM_001005484.2 69761   725 69037   65565
NC_000001.11_NM_001385640.1 942155  20  942136  924432  925922  930155  931039  935772  939040  939272  941144

This prints the first 4 fields of each line, and any remaining fields which are numerically less than the fourth field, joined by tab characters (change that to a space if you need to).

It uses the perl command-line options -l (enable automatic handling of line-endings), -a (autosplit each input line into array @F), -n (operate like sed -n, i.e. read and process each line but don't print anything by default), and -e (next arg is the script to execute). See man perlrun for details on these and other options.

It also uses perl's built-in grep() function, which is named for its similarity to the grep command-line tool - it can be used for regex or string matches, but what it actually does is return a list of all elements of a list for which the expression evaluates to true. See perldoc f grep for details.

added note about the awk line re-split hack
Source Link
cas
  • 84k
  • 8
  • 136
  • 205

This iterates over the fields from the end of the line to the beginning (i.e. in reverse order) and deletes the field if the field number (NF) is greater than 4 AND the value of that field is greater than the value of field 4 ($4).

$ awk '{
    for (i=NF; i>=1; i--) {
      if ((i > 4) && ($i >= $4)) {
        $i=""
      }
    };
    print
    }' input.txt
NC_000001.11_NM_001005484.2 69270 234 69037 65565 
NC_000001.11_NM_001005484.2 69511 475 69037 65565 
NC_000001.11_NM_001005484.2 69761 725 69037 65565 
NC_000001.11_NM_001385640.1 942155 20 942136 924432 925922 930155 931039 935772 939040 939272 941144 

BTW, it's not clear whether your input is space or tab separated. if you want tab-separated output (rather than a single space between each field), then add -v OFS='\t' to the awk command immediately before the single-quote starting the script. e.g.

awk -v OFS='\t' '...awk script here...' input.txt

BTW, awk will leave a lot of extra field separators in the output line, one wherever a field used to be before it was deleted. If you want to get rid of those, add the following line immediately before the print statement:

    $0=$0; $1=$1;

This will effectively remove any empty fields, by forcing awk to re-evaluate the input line and split it into fields again (splitting on FS, the field-separator, which defaults to any amount of white-space). It's a bit of a hack because awk doesn't have any way to actually delete a field from a line, so you have to force it to do that after the line has been modified.

This iterates over the fields from the end of the line to the beginning (i.e. in reverse order) and deletes the field if the field number (NF) is greater than 4 AND the value of that field is greater than the value of field 4 ($4).

$ awk '{
    for (i=NF; i>=1; i--) {
      if ((i > 4) && ($i >= $4)) {
        $i=""
      }
    };
    print
    }' input.txt
NC_000001.11_NM_001005484.2 69270 234 69037 65565 
NC_000001.11_NM_001005484.2 69511 475 69037 65565 
NC_000001.11_NM_001005484.2 69761 725 69037 65565 
NC_000001.11_NM_001385640.1 942155 20 942136 924432 925922 930155 931039 935772 939040 939272 941144 

BTW, it's not clear whether your input is space or tab separated. if you want tab-separated output (rather than a single space between each field), then add -v OFS='\t' to the awk command immediately before the single-quote starting the script. e.g.

awk -v OFS='\t' '...awk script here...' input.txt

This iterates over the fields from the end of the line to the beginning (i.e. in reverse order) and deletes the field if the field number (NF) is greater than 4 AND the value of that field is greater than the value of field 4 ($4).

$ awk '{
    for (i=NF; i>=1; i--) {
      if ((i > 4) && ($i >= $4)) {
        $i=""
      }
    };
    print
    }' input.txt
NC_000001.11_NM_001005484.2 69270 234 69037 65565 
NC_000001.11_NM_001005484.2 69511 475 69037 65565 
NC_000001.11_NM_001005484.2 69761 725 69037 65565 
NC_000001.11_NM_001385640.1 942155 20 942136 924432 925922 930155 931039 935772 939040 939272 941144 

BTW, it's not clear whether your input is space or tab separated. if you want tab-separated output (rather than a single space between each field), then add -v OFS='\t' to the awk command immediately before the single-quote starting the script. e.g.

awk -v OFS='\t' '...awk script here...' input.txt

BTW, awk will leave a lot of extra field separators in the output line, one wherever a field used to be before it was deleted. If you want to get rid of those, add the following line immediately before the print statement:

    $0=$0; $1=$1;

This will effectively remove any empty fields, by forcing awk to re-evaluate the input line and split it into fields again (splitting on FS, the field-separator, which defaults to any amount of white-space). It's a bit of a hack because awk doesn't have any way to actually delete a field from a line, so you have to force it to do that after the line has been modified.

added 2 characters in body
Source Link
cas
  • 84k
  • 8
  • 136
  • 205

This iterates over the fields from the end of the line to the beginning (i.e. in reverse order) and deletes the field if the field number (NF) is greater than 4 AND the value of that field is greater than the value of field 4 ($4).

$ awk '{
    for (i=NF; i>=1; i--) {
      if ((i > 4) && ($i >= $4)) {
        $i=""
      }
    };
    print
    }' input.txt
NC_000001.11_NM_001005484.2 69270 234 69037 65565 
NC_000001.11_NM_001005484.2 69511 475 69037 65565 
NC_000001.11_NM_001005484.2 69761 725 69037 65565 
NC_000001.11_NM_001385640.1 942155 20 942136 924432 925922 930155 931039 935772 939040 939272 941144 

BTW, it's not clear whether your input is space or tab separated. if you want tab-separated output (rather than a single space between each field), then add -v OFS='\t' to the awk command immediately before the single-quote starting the script. e.g.

awk -v OFS='\t' '...awk script here...' input.txt

This iterates over the fields from the end of the line to the beginning (i.e. in reverse order) and deletes the field if the field number (NF) is greater than 4 AND the value of that field is greater than the value of field 4 ($4).

$ awk '{
    for (i=NF; i>=1; i--) {
      if ((i > 4) && ($i >= $4)) {
        $i=""
      }
    };
    print
    }' input.txt
NC_000001.11_NM_001005484.2 69270 234 69037 65565 
NC_000001.11_NM_001005484.2 69511 475 69037 65565 
NC_000001.11_NM_001005484.2 69761 725 69037 65565 
NC_000001.11_NM_001385640.1 942155 20 942136 924432 925922 930155 931039 935772 939040 939272 941144 

BTW, it's not clear whether your input is space or tab separated. if you want tab-separated output (rather than a single space between each field), then add -v OFS='\t' to the awk command immediately before the single-quote starting the script. e.g.

awk -v OFS='\t' '...awk script here...' input.txt

This iterates over the fields from the end of the line to the beginning (i.e. in reverse order) and deletes the field if the field number (NF) is greater than 4 AND the value of that field is greater than the value of field 4 ($4).

$ awk '{
    for (i=NF; i>=1; i--) {
      if ((i > 4) && ($i >= $4)) {
        $i=""
      }
    };
    print
    }' input.txt
NC_000001.11_NM_001005484.2 69270 234 69037 65565 
NC_000001.11_NM_001005484.2 69511 475 69037 65565 
NC_000001.11_NM_001005484.2 69761 725 69037 65565 
NC_000001.11_NM_001385640.1 942155 20 942136 924432 925922 930155 931039 935772 939040 939272 941144 

BTW, it's not clear whether your input is space or tab separated. if you want tab-separated output (rather than a single space between each field), then add -v OFS='\t' to the awk command immediately before the single-quote starting the script. e.g.

awk -v OFS='\t' '...awk script here...' input.txt
Source Link
cas
  • 84k
  • 8
  • 136
  • 205
Loading