Select Specific Columns only form a dataframe in Python

Question

Using python and pandas as pd, I am trying to OUTPUT a file that has a subset of columns based on specific headers.

Here is an example of an input file

gene_input = pd.read_table(args.gene, sep="\t" ,index_col=0)

The structure of gene_input:

       Sample1  Sample2  Sample3  Sample4  Sample5  Sample6  Sample7  Sample8
Gene1        2       23      213      213       13      132      213     4312
Gene2        3       12    21312      123      123       23     4321      432
Gene3        5      213    21312       15      516     3421     4312     4132
Gene4        2      123      123        7      610       23     3214     4312
Gene5        1      213      213        1      152       23     1423     3421

Using a different loop, I generated TWO dictionaries. The first one has the keys (Sample 1 and Sample 7) and the second has the keys (Sample 4 and 8).

I would like to have the following output (Note that I want the samples from each of the dictionaries to be consecutive; i.e. all Dictionary 1 first, then all Dictionary 2): The output that I am looking for is:

        Sample1 Sample7 Sample4 Sample8
Gene1   2   213 213 4312
Gene2   3   4321    123 432
Gene3   5   4312    15  4132
Gene4   2   3214    7   4312
Gene5   1   1423    1   3421

I have tried the following but none worked:

key_num=list(dictionary1.keys())
num = genes_input[gene_input.columns.isin(key_num)]

In order to extract the first set of columns then somehow combine it, but that failed. It kept giving me attributes error, and i did update pandas. I also tried the following:

reader = csv.reader( open(gene_input, 'rU'), delimiter='\t')
header_row = reader.next() # Gets the header

for key, value in numerator.items():
    output.write(key + "\t")
    if key in header_row:
        for row in reader:
            idx=header_row.index(key)
            output.write(idx +"\t")

as well as some other commands/loops/lines. Sometimes i only get the first key only to be in the output, other times i get an error; depending on which method i tried (i am not listing them all here for sake of convenience).

Anyway, if anyone has any input on how I can generate the output file of interest, I'd be grateful.

Again, here is what I want as a final output:

        Sample1 Sample7 Sample4 Sample8
Gene1   2   213 213 4312
Gene2   3   4321    123 432
Gene3   5   4312    15  4132
Gene4   2   3214    7   4312
Gene5   1   1423    1   3421

Dthal · Accepted Answer · 2015-12-27 02:52:17Z

4

For a specific set of columns in a specific order, use:
df = gene_input[['Sample1', 'Sample2', 'Sample4', 'Sample7']]

If you need to make that list (['Sample1',...]) automatically, and the names are as given, you should be able to build the two lists, combine them and then sort:
column_names = sorted(dictionary1.keys() + dictionary2.keys())

The names that you have should sort correctly. For output, you should be able to use:
df.to_csv(<output file name>, sep='\t')

EDIT: added part about output

edited Dec 27, 2015 at 2:52

answered Dec 27, 2015 at 2:42

Dthal

3,3361 gold badge19 silver badges10 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

BioProgram Over a year ago

Thank you. df.to_csv fully answers my request. One more quick question: if i want to have multiple index columns in gene_input = pd.read_table(args.gene, sep="\t" ,index_col=0) , how can i do that? I know it must be in index_col but i couldnt choose more than 1 column as an index. that could help me combine other columns too. thanks

Collectives™ on Stack Overflow

Select Specific Columns only form a dataframe in Python

1 Answer 1

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Related