I have one file: combined.txt like this:
GO_GLUTAMINE_FAMILY_AMINO_ACID_METABOLIC_PROCESS
REACTOME_APC_CDC20_MEDIATED_DEGRADATION_OF_NEK2A
LEE_METASTASIS_AND_RNA_PROCESSING_UP
RB_DN.V1_UP
REACTOME_ABORTIVE_ELONGATION_OF_HIV1_TRANSCRIPT_IN_THE_ABSENCE_OF_TAT
...
and in my current directory I have multiple .xls files which are named like lines in combined.txt, for example: GO_GLUTAMINE_FAMILY_AMINO_ACID_METABOLIC_PROCESS.xls
In those .xls files I want to retrieve everything in column named: GENE_TITLE for which I have "Yes" in column named: "METRIC SCORE"
those files look like:
NAME PROBE GENE SYMBOL GENE_TITLE RANK IN GENE LIST RANK METRIC SCORE RUNNING ES CORE ENRICHMENT
row_0 MKI67 null null 51 3.389514923095703 0.06758767 Yes
row_1 CDCA8 null null 96 2.8250465393066406 0.123790346 Yes
row_2 NUSAP1 null null 118 2.7029471397399902 0.17939204 Yes
row_3 H2AFX null null 191 2.3259851932525635 0.22256653 Yes
row_4 DLGAP5 null null 193 2.324765920639038 0.2718671 Yes
row_5 SMC2 null null 229 2.2023487091064453 0.31562105 No
row_6 CKS1B null null 279 2.0804455280303955 0.3555722 No
row_7 UBE2C null null 403 1.816525936126709 0.38350475 No
And in the output file I would have just in every line:
GO_GLUTAMINE_FAMILY_AMINO_ACID_METABOLIC_PROCESS 51 96 118 191 193
<name of the particular line in combined.txt> <list of all entries in GENE_TITLE which have METRIC SCORE=Yes>
What I tried so far is:
grep -iw -f combined.txt *.xls > out1
I also tried this but here I am not using information from combined.txt neither getting values labeled with "Yes" just extracting 5th column from all files
awk '{ a[FNR] = (a[FNR] ? a[FNR] FS : "") $5 } END { for(i=1;i<=FNR;i++) print a[i] }' $(ls -1v *.xls) > out2
this is maybe a little bit closer but still not there:
awk 'BEGIN {ORS=" "} BEGINFILE{print FILENAME} {print $5 " " $8} ENDFILE{ printf("\n")}' *.xls > out3
I am getting something like:
GENE_TITLE GENE 1 Yes 4 Yes 11 Yes 23 Yes 49 Yes 76 Yes 85 Yes 118 No 161 No....
GENE_TITLE GENE 0 Yes 16 No 28 Yes 51 Yes 63 No 96 Yes 182 Yes 191 Yes
...
so my desired output would have instead of "GENE_TITLE GENE" the name of the file from where it did grab those values (without .xls suffix) : 0 Yes 16 No 28 Yes 51 Yes 63 No 96...not including the one which have "No"
UPDATE
I did get the file I needed but I wrote the ugliest code possible (see bellow). If someone has something a little bit more elegant please do share.
This is how I got it:
awk '{print FILENAME " "$5 " "$8}' *.xls | awk '!/^ranked/' | awk '!/^gsea/'| awk '!/^gene/' | awk '$3!="No" {print $1 " " $2}' | awk '$2!="GENE_TITLE" {print}' |awk -v ncr=4 '{$1=substr($1,0,length($1)-ncr)}1' | awk -F' ' -v OFS=' ' '{x=$1;$1="";a[x]=a[x]$0}END{for(x in a)print x,a[x]}'>out3
grep -iw -f combined.txt out3 > ENTR_combined_SET.txt
python(or a similar language). It will make your code more readable and easier to maintain.