0

I want to extract certain values from a string in python.

snp_1_881627    AA=G;ALLELE=A;DAF_GLOBAL=0.473901;GENE_TRCOUNT_AFFECTED=1;GENE_TRCOUNT_TOTAL=1;SEVERE_GENE=ENSG00000188976;SEVERE_IMPACT=SYNONYMOUS_CODON;TR_AFFECTED=FULL;ANNOTATION_CLASS=REG_FEATURE,SYNONYMOUS_CODON,ACTIVE_CHROM,NC_TRANSCRIPT_VARIANT,NC_TRANSCRIPT_VARIANT;A_A_CHANGE=.,L,.,.,.;A_A_LENGTH=.,750,.,.,.;A_A_POS=.,615,.,.,.;CELL=GM12878,.,GM12878,.,.;CHROM_STATE=.,.,11,.,.;EXON_NUMBER=.,16/19,.,.,.;GENE_ID=.,ENSG00000188976,.,ENSG00000188976,ENSG00000188976;GENE_NAME=.,NOC2L,.,NOC2L,NOC2L;HGVS=.,c.1843N>T,.,n.3290N>T,n.699N>T;REG_ANNOTATION=H3K36me3,.,.,.,.;TR_BIOTYPE=.,PROTEIN_CODING,.,PROCESSED_TRANSCRIPT,PROCESSED_TRANSCRIPT;TR_ID=.,ENST00000327044,.,ENST00000477976,ENST00000483767;TR_LENGTH=.,2790,.,4201,1611;TR_POS=.,1893,.,3290,699;TR_STRAND=.,-1,.,-1,-1

Output:

              GENE_ID         GENE_NAME   EXON_NUMBER  SEVERE_IMPACT
snp_1_881627  ENSG00000188976 NOC2L       16/19        SYNONYMOUS_CODON

If the string has values for each of those variables(GENE_ID,GENE_NAME,EXON_NUMBER) existing then output, else "NA"(variables don't exist or their values don't exist).In some cases,these variables don't exist in the string.

Which string method should I use to accomplish this?Should I split my string before extracting any values?I have 10k rows to extract values for each snp_*

string=string.split(';')

P.S. I am a newbie in python

3
  • 1
    Have you actually tried to use split? Where's the code, and what was the result? Commented May 13, 2014 at 22:04
  • Once I split,the values to be extracted can be inconsistent,so I can not use them through indices.I was thinking to find pattern(e.g. GENE_ID) in a whole complete string. Commented May 13, 2014 at 22:08
  • Don't use indices, actually search for the terms you want in the list using startswith. I suggest you make a dictionary e.g. {'ID': 'snp_1_881627', 'SEVERE_IMPACT': 'SYNONYMOUS_CODON', ...} Commented May 13, 2014 at 22:13

2 Answers 2

2

There are two general strategies for this - split and regex.

To use split, first split off the row label (snp_1_881627):

rowname, data = row.split()

Then, you can split data into the individual entries using the ; separator:

data = data.split(';')

Since you need to get the value of certain keys, we can turn it into a dictionary:

dataDictionary = {}
for entry in data:
    entry = entry.split('=')
    dataDictionary[entry[0]] = entry[1] if len(entry) > 1 else None

Then you can simply check if the keys are in dataDictionary, and if so grab their values.

Using split is nice in that it will index everything in the data string, making it easy to grab whichever ones you need.

If the ones you need will not change, then regex might be a better option:

>>> import re
>>> re.search('(?<=GENE_ID=)[^;]*', 'onevalue;GENE_ID=SOMETHING;othervalue').group()
'SOMETHING'

Here I'm using a "lookbehind" to match one of the keywords, then grabbing the value from the match using group(). Putting your keywords into a list, you could find all the values like this:

import re
...
keywords = ['GENE_ID', 'GENE_NAME', 'EXON_NUMBER', 'SEVERE_IMPACT']
desiredValues = {}
for keyword in keywords:
    match = re.search('(?<={}=)[^;]*'.format(keyword), string_to_search)
    desiredValues[keyword] = match.group() if match else DEFAULT_VALUE
Sign up to request clarification or add additional context in comments.

Comments

0

I think this is going to be the solution you are looking for.

#input
user_in = 'snp_1_881627    AA=G;ALLELE=A;DAF_GLOBAL=0.473901;GENE_TRCOUNT_AFFECTED=1;GENE_TRCOUNT_TOTAL=1;SEVERE_GENE=ENSG00000188976;SEVERE_IMPACT=SYNONYMOUS_CODON;TR_AFFECTED=FULL;ANNOTATION_CLASS=REG_FEATURE,SYNONYMOUS_CODON,ACTIVE_CHROM,NC_TRANSCRIPT_VARIANT,NC_TRANSCRIPT_VARIANT;A_A_CHANGE=.,L,.,.,.;A_A_LENGTH=.,750,.,.,.;A_A_POS=.,615,.,.,.;CELL=GM12878,.,GM12878,.,.;CHROM_STATE=.,.,11,.,.;EXON_NUMBER=.,16/19,.,.,.;GENE_ID=.,ENSG00000188976,.,ENSG00000188976,ENSG00000188976;GENE_NAME=.,NOC2L,.,NOC2L,NOC2L;HGVS=.,c.1843N>T,.,n.3290N>T,n.699N>T;REG_ANNOTATION=H3K36me3,.,.,.,.;TR_BIOTYPE=.,PROTEIN_CODING,.,PROCESSED_TRANSCRIPT,PROCESSED_TRANSCRIPT;TR_ID=.,ENST00000327044,.,ENST00000477976,ENST00000483767;TR_LENGTH=.,2790,.,4201,1611;TR_POS=.,1893,.,3290,699;TR_STRAND=.,-1,.,-1,-1'

#set some empty vars
user_in = user_in.split(';')
final_output = ""
GENE_ID_FOUND = False
GENE_NAME_FOUND = False
EXON_NUMBER_FOUND = False
GENE_ID_OUTPUT = ''
GENE_NAME_OUTPUT = ''
EXON_NUMBER_OUTPUT = ''
SEVERE_IMPACT_OUTPUT = ''


for x in range(0, len(user_in)):
  if x == 0:
    first_line_count = 0
    first_line_print = ''
    while(user_in[0][first_line_count] != " "):
      first_line_print += user_in[0][first_line_count]
      first_line_count += 1
    final_output += first_line_print + "\t"
  else:

    if user_in[x][0:11] == "SEVERE_GENE":
      GENE_ID_OUTPUT += user_in[x][12:] + "\t"
      GENE_ID_FOUND = True

    if user_in[x][0:9] == "GENE_NAME":
      GENE_NAME_OUTPUT += user_in[x][10:] + "\t"
      GENE_NAME_FOUND = True

    if user_in[x][0:11] == "EXON_NUMBER":
      EXON_NUMBER_OUTPUT += user_in[x][12:] + "\t"
      EXON_NUMBER_FOUND = True

    if user_in[x][0:13] == "SEVERE_IMPACT":
      SEVERE_IMPACT_OUTPUT += user_in[x][14:] + "\t"

if GENE_ID_FOUND == True:
  final_output += GENE_ID_OUTPUT
else:
  final_output += "NA"

if GENE_NAME_FOUND == True:
  final_output += GENE_NAME_OUTPUT
else:
  final_output += "NA"

if EXON_NUMBER_FOUND == True:
  final_output += EXON_NUMBER_OUTPUT
else:
  final_output += "NA"

final_output += SEVERE_IMPACT_OUTPUT


print(final_output)

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.