1

Using python and pandas as pd, I am trying to OUTPUT a file that has a subset of columns based on specific headers.

Here is an example of an input file

gene_input = pd.read_table(args.gene, sep="\t" ,index_col=0)

The structure of gene_input:

       Sample1  Sample2  Sample3  Sample4  Sample5  Sample6  Sample7  Sample8
Gene1        2       23      213      213       13      132      213     4312
Gene2        3       12    21312      123      123       23     4321      432
Gene3        5      213    21312       15      516     3421     4312     4132
Gene4        2      123      123        7      610       23     3214     4312
Gene5        1      213      213        1      152       23     1423     3421

Using a different loop, I generated TWO dictionaries. The first one has the keys (Sample 1 and Sample 7) and the second has the keys (Sample 4 and 8).

I would like to have the following output (Note that I want the samples from each of the dictionaries to be consecutive; i.e. all Dictionary 1 first, then all Dictionary 2): The output that I am looking for is:

        Sample1 Sample7 Sample4 Sample8
Gene1   2   213 213 4312
Gene2   3   4321    123 432
Gene3   5   4312    15  4132
Gene4   2   3214    7   4312
Gene5   1   1423    1   3421

I have tried the following but none worked:

key_num=list(dictionary1.keys())
num = genes_input[gene_input.columns.isin(key_num)]

In order to extract the first set of columns then somehow combine it, but that failed. It kept giving me attributes error, and i did update pandas. I also tried the following:

reader = csv.reader( open(gene_input, 'rU'), delimiter='\t')
header_row = reader.next() # Gets the header

for key, value in numerator.items():
    output.write(key + "\t")
    if key in header_row:
        for row in reader:
            idx=header_row.index(key)
            output.write(idx +"\t")

as well as some other commands/loops/lines. Sometimes i only get the first key only to be in the output, other times i get an error; depending on which method i tried (i am not listing them all here for sake of convenience).

Anyway, if anyone has any input on how I can generate the output file of interest, I'd be grateful.

Again, here is what I want as a final output:

        Sample1 Sample7 Sample4 Sample8
Gene1   2   213 213 4312
Gene2   3   4321    123 432
Gene3   5   4312    15  4132
Gene4   2   3214    7   4312
Gene5   1   1423    1   3421

1 Answer 1

4

For a specific set of columns in a specific order, use:
df = gene_input[['Sample1', 'Sample2', 'Sample4', 'Sample7']]

If you need to make that list (['Sample1',...]) automatically, and the names are as given, you should be able to build the two lists, combine them and then sort:
column_names = sorted(dictionary1.keys() + dictionary2.keys())

The names that you have should sort correctly. For output, you should be able to use:
df.to_csv(<output file name>, sep='\t')

EDIT: added part about output

Sign up to request clarification or add additional context in comments.

1 Comment

Thank you. df.to_csv fully answers my request. One more quick question: if i want to have multiple index columns in gene_input = pd.read_table(args.gene, sep="\t" ,index_col=0) , how can i do that? I know it must be in index_col but i couldnt choose more than 1 column as an index. that could help me combine other columns too. thanks

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.