Extract values from string

Question

I want to extract certain values from a string in python.

snp_1_881627    AA=G;ALLELE=A;DAF_GLOBAL=0.473901;GENE_TRCOUNT_AFFECTED=1;GENE_TRCOUNT_TOTAL=1;SEVERE_GENE=ENSG00000188976;SEVERE_IMPACT=SYNONYMOUS_CODON;TR_AFFECTED=FULL;ANNOTATION_CLASS=REG_FEATURE,SYNONYMOUS_CODON,ACTIVE_CHROM,NC_TRANSCRIPT_VARIANT,NC_TRANSCRIPT_VARIANT;A_A_CHANGE=.,L,.,.,.;A_A_LENGTH=.,750,.,.,.;A_A_POS=.,615,.,.,.;CELL=GM12878,.,GM12878,.,.;CHROM_STATE=.,.,11,.,.;EXON_NUMBER=.,16/19,.,.,.;GENE_ID=.,ENSG00000188976,.,ENSG00000188976,ENSG00000188976;GENE_NAME=.,NOC2L,.,NOC2L,NOC2L;HGVS=.,c.1843N>T,.,n.3290N>T,n.699N>T;REG_ANNOTATION=H3K36me3,.,.,.,.;TR_BIOTYPE=.,PROTEIN_CODING,.,PROCESSED_TRANSCRIPT,PROCESSED_TRANSCRIPT;TR_ID=.,ENST00000327044,.,ENST00000477976,ENST00000483767;TR_LENGTH=.,2790,.,4201,1611;TR_POS=.,1893,.,3290,699;TR_STRAND=.,-1,.,-1,-1

Output:

              GENE_ID         GENE_NAME   EXON_NUMBER  SEVERE_IMPACT
snp_1_881627  ENSG00000188976 NOC2L       16/19        SYNONYMOUS_CODON

If the string has values for each of those variables(GENE_ID,GENE_NAME,EXON_NUMBER) existing then output, else "NA"(variables don't exist or their values don't exist).In some cases,these variables don't exist in the string.

Which string method should I use to accomplish this?Should I split my string before extracting any values?I have 10k rows to extract values for each snp_*

string=string.split(';')

P.S. I am a newbie in python

Have you actually tried to use split? Where's the code, and what was the result? — jonrsharpe
– jonrsharpe, Commented May 13, 2014 at 22:04
Once I split,the values to be extracted can be inconsistent,so I can not use them through indices.I was thinking to find pattern(e.g. GENE_ID) in a whole complete string. — Rgeek
– Rgeek, Commented May 13, 2014 at 22:08
Don't use indices, actually search for the terms you want in the list using startswith. I suggest you make a dictionary e.g. {'ID': 'snp_1_881627', 'SEVERE_IMPACT': 'SYNONYMOUS_CODON', ...} — jonrsharpe
– jonrsharpe, Commented May 13, 2014 at 22:13

Rob Watts · Accepted Answer · 2014-05-13 22:27:46Z

There are two general strategies for this - split and regex.

To use split, first split off the row label (snp_1_881627):

rowname, data = row.split()

Then, you can split data into the individual entries using the ; separator:

data = data.split(';')

Since you need to get the value of certain keys, we can turn it into a dictionary:

dataDictionary = {}
for entry in data:
    entry = entry.split('=')
    dataDictionary[entry[0]] = entry[1] if len(entry) > 1 else None

Then you can simply check if the keys are in dataDictionary, and if so grab their values.

Using split is nice in that it will index everything in the data string, making it easy to grab whichever ones you need.

If the ones you need will not change, then regex might be a better option:

>>> import re
>>> re.search('(?<=GENE_ID=)[^;]*', 'onevalue;GENE_ID=SOMETHING;othervalue').group()
'SOMETHING'

Here I'm using a "lookbehind" to match one of the keywords, then grabbing the value from the match using group(). Putting your keywords into a list, you could find all the values like this:

import re
...
keywords = ['GENE_ID', 'GENE_NAME', 'EXON_NUMBER', 'SEVERE_IMPACT']
desiredValues = {}
for keyword in keywords:
    match = re.search('(?<={}=)[^;]*'.format(keyword), string_to_search)
    desiredValues[keyword] = match.group() if match else DEFAULT_VALUE

biw · Accepted Answer · 2014-05-13 22:41:11Z

I think this is going to be the solution you are looking for.

#input
user_in = 'snp_1_881627    AA=G;ALLELE=A;DAF_GLOBAL=0.473901;GENE_TRCOUNT_AFFECTED=1;GENE_TRCOUNT_TOTAL=1;SEVERE_GENE=ENSG00000188976;SEVERE_IMPACT=SYNONYMOUS_CODON;TR_AFFECTED=FULL;ANNOTATION_CLASS=REG_FEATURE,SYNONYMOUS_CODON,ACTIVE_CHROM,NC_TRANSCRIPT_VARIANT,NC_TRANSCRIPT_VARIANT;A_A_CHANGE=.,L,.,.,.;A_A_LENGTH=.,750,.,.,.;A_A_POS=.,615,.,.,.;CELL=GM12878,.,GM12878,.,.;CHROM_STATE=.,.,11,.,.;EXON_NUMBER=.,16/19,.,.,.;GENE_ID=.,ENSG00000188976,.,ENSG00000188976,ENSG00000188976;GENE_NAME=.,NOC2L,.,NOC2L,NOC2L;HGVS=.,c.1843N>T,.,n.3290N>T,n.699N>T;REG_ANNOTATION=H3K36me3,.,.,.,.;TR_BIOTYPE=.,PROTEIN_CODING,.,PROCESSED_TRANSCRIPT,PROCESSED_TRANSCRIPT;TR_ID=.,ENST00000327044,.,ENST00000477976,ENST00000483767;TR_LENGTH=.,2790,.,4201,1611;TR_POS=.,1893,.,3290,699;TR_STRAND=.,-1,.,-1,-1'

#set some empty vars
user_in = user_in.split(';')
final_output = ""
GENE_ID_FOUND = False
GENE_NAME_FOUND = False
EXON_NUMBER_FOUND = False
GENE_ID_OUTPUT = ''
GENE_NAME_OUTPUT = ''
EXON_NUMBER_OUTPUT = ''
SEVERE_IMPACT_OUTPUT = ''


for x in range(0, len(user_in)):
  if x == 0:
    first_line_count = 0
    first_line_print = ''
    while(user_in[0][first_line_count] != " "):
      first_line_print += user_in[0][first_line_count]
      first_line_count += 1
    final_output += first_line_print + "\t"
  else:

    if user_in[x][0:11] == "SEVERE_GENE":
      GENE_ID_OUTPUT += user_in[x][12:] + "\t"
      GENE_ID_FOUND = True

    if user_in[x][0:9] == "GENE_NAME":
      GENE_NAME_OUTPUT += user_in[x][10:] + "\t"
      GENE_NAME_FOUND = True

    if user_in[x][0:11] == "EXON_NUMBER":
      EXON_NUMBER_OUTPUT += user_in[x][12:] + "\t"
      EXON_NUMBER_FOUND = True

    if user_in[x][0:13] == "SEVERE_IMPACT":
      SEVERE_IMPACT_OUTPUT += user_in[x][14:] + "\t"

if GENE_ID_FOUND == True:
  final_output += GENE_ID_OUTPUT
else:
  final_output += "NA"

if GENE_NAME_FOUND == True:
  final_output += GENE_NAME_OUTPUT
else:
  final_output += "NA"

if EXON_NUMBER_FOUND == True:
  final_output += EXON_NUMBER_OUTPUT
else:
  final_output += "NA"

final_output += SEVERE_IMPACT_OUTPUT


print(final_output)

Collectives™ on Stack Overflow

Extract values from string

2 Answers 2

Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Related