How to grep all lines from one file in specific column in multiple other files?

Question

I have one file: combined.txt like this:

GO_GLUTAMINE_FAMILY_AMINO_ACID_METABOLIC_PROCESS
REACTOME_APC_CDC20_MEDIATED_DEGRADATION_OF_NEK2A
LEE_METASTASIS_AND_RNA_PROCESSING_UP
RB_DN.V1_UP
REACTOME_ABORTIVE_ELONGATION_OF_HIV1_TRANSCRIPT_IN_THE_ABSENCE_OF_TAT
...

and in my current directory I have multiple .xls files which are named like lines in combined.txt, for example: GO_GLUTAMINE_FAMILY_AMINO_ACID_METABOLIC_PROCESS.xls

In those .xls files I want to retrieve everything in column named: GENE_TITLE for which I have "Yes" in column named: "METRIC SCORE"

those files look like:

 NAME    PROBE   GENE SYMBOL     GENE_TITLE      RANK IN GENE LIST       RANK METRIC SCORE       RUNNING ES      CORE ENRICHMENT
row_0   MKI67   null    null    51      3.389514923095703       0.06758767      Yes
row_1   CDCA8   null    null    96      2.8250465393066406      0.123790346     Yes
row_2   NUSAP1  null    null    118     2.7029471397399902      0.17939204      Yes
row_3   H2AFX   null    null    191     2.3259851932525635      0.22256653      Yes
row_4   DLGAP5  null    null    193     2.324765920639038       0.2718671       Yes
row_5   SMC2    null    null    229     2.2023487091064453      0.31562105      No
row_6   CKS1B   null    null    279     2.0804455280303955      0.3555722       No
row_7   UBE2C   null    null    403     1.816525936126709       0.38350475      No

And in the output file I would have just in every line:

 GO_GLUTAMINE_FAMILY_AMINO_ACID_METABOLIC_PROCESS 51 96 118 191 193
<name of the particular line in combined.txt>  <list of all entries in GENE_TITLE which have METRIC SCORE=Yes>

What I tried so far is:

grep -iw -f combined.txt *.xls > out1

I also tried this but here I am not using information from combined.txt neither getting values labeled with "Yes" just extracting 5th column from all files

awk '{ a[FNR] = (a[FNR] ? a[FNR] FS : "") $5 } END { for(i=1;i<=FNR;i++) print a[i] }' $(ls -1v *.xls) > out2

this is maybe a little bit closer but still not there:

awk 'BEGIN {ORS=" "} BEGINFILE{print FILENAME} {print $5 " " $8} ENDFILE{ printf("\n")}'  *.xls > out3

I am getting something like:

GENE_TITLE GENE 1 Yes 4 Yes 11 Yes 23 Yes 49 Yes 76 Yes 85 Yes 118 No 161 No....
GENE_TITLE GENE 0 Yes 16 No 28 Yes 51 Yes 63 No 96 Yes 182 Yes 191 Yes
...

so my desired output would have instead of "GENE_TITLE GENE" the name of the file from where it did grab those values (without .xls suffix) : 0 Yes 16 No 28 Yes 51 Yes 63 No 96...not including the one which have "No"

UPDATE

I did get the file I needed but I wrote the ugliest code possible (see bellow). If someone has something a little bit more elegant please do share.

This is how I got it:

awk '{print FILENAME " "$5 " "$8}' *.xls  | awk '!/^ranked/' | awk '!/^gsea/'|  awk '!/^gene/' | awk '$3!="No"  {print $1 " " $2}' | awk '$2!="GENE_TITLE"  {print}' |awk -v ncr=4 '{$1=substr($1,0,length($1)-ncr)}1' | awk -F' ' -v OFS=' ' '{x=$1;$1="";a[x]=a[x]$0}END{for(x in a)print x,a[x]}'>out3

grep -iw -f combined.txt out3 > ENTR_combined_SET.txt

To be honest, this seems complicated enough that you might want to switch to python (or a similar language). It will make your code more readable and easier to maintain. — Panki
– Panki, Commented Apr 16, 2019 at 7:37

Kusalananda · Accepted Answer · 2019-04-22 21:03:01Z

0

xargs -I {} awk '$8 == "Yes" { title = title OFS $5 } END { print substr(FILENAME,1,length(FILENAME)-4), title }' {}.xls <combined.txt

This uses xargs to execute an awk program for each name listed in your combined.txt file.

The awk program is given whatever names is read from the combined.txt file with .xls added onto the end of the name as its input file.

The awk program collects the data from the 5th column for each row whose 8th column is Yes. This string is then printed together with the filename with its last four characters (the file name suffix) chopped off.

answered Apr 22, 2019 at 21:03

Kusalananda♦

356k42 gold badges735 silver badges1.1k bronze badges

Hi How would I change this command so that it prints me file name as it is and 2nd column which is called "PROBE", instead of "GENE_TITLE" that I have now?

anikaM
– anikaM

2019-04-23 18:19:53 +00:00
Commented Apr 23, 2019 at 18:19
@anikaM You would change $5 to $2 and use just FILENAME instead of substr(...).

Kusalananda
– Kusalananda ♦

2019-04-23 18:36:56 +00:00
Commented Apr 23, 2019 at 18:36
Is it like this: xargs -I {} awk '$8 == "Yes" { title = title OFS $2 } END { print FILENAME, title }' {}.xls < combined.txt

anikaM
– anikaM

2019-04-23 19:01:12 +00:00
Commented Apr 23, 2019 at 19:01
@anikaM I believe so, yes.

Kusalananda
– Kusalananda ♦

2019-04-23 19:06:37 +00:00
Commented Apr 23, 2019 at 19:06

Add a comment |

Freddy · Accepted Answer · 2019-04-16 08:28:38Z

Bash script:

#!/bin/bash

# read combined.txt line by line
while read -r line; do
        # skip missing file ${line}.xls
        [ ! -f "$line".xls ] && continue

        # echo line and one space character (without newline)
        echo -n "$line " >> out

        # get 5th column if line ends with "Yes" and optional whitespace at end of line
        # replace newline '\n' with space ' '
        sed -nE 's/^\S+\s+\S+\s+\S+\s+\S+\s+(\S+).*\sYes\s*$/\1/p' "$line".xls | tr '\n' ' ' >> out

        # add newline
        echo >> out
done < combined.txt

in one line:

while read -r line; do [ ! -f "$line".xls ] && continue; echo -n "$line " >> out; sed -nE 's/^\S+\s+\S+\s+\S+\s+\S+\s+(\S+).*\sYes\s*$/\1/p' "$line".xls | tr '\n' ' ' >> out; echo >> out; done < combined.txt

Note that each line in out will have one additional space character at the end of the line.

Stack Exchange Network

How to grep all lines from one file in specific column in multiple other files?

UPDATE

2 Answers 2

You must log in to answer this question.

Hot Network Questions

How to grep all lines from one file in specific column in multiple other files?

UPDATE

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions