1

I need to create a function in pandas that takes a single dataframe as input and returns multiple dataframes as output based on a specific condition. (please check the example below for condition). I am having a hard time to figure out how. I need some experts' advice on coding.

Example 1:

Input = dataframe with 100 columns

Outputs = dataframe1 with first 10% columns (columns 1 to 10), dataframe2 with second 10% columns (columns 11 to 20) and so on upto the last 10% Columns (columns 91 to 100).

Example 2:

Input = dataframe with 109 columns

Outputs = dataframe1 with first 10% of columns (rounded off) (columns 1 to 11), dataframe2 with second 10% columns (columns 12 to 23) and so on upto the last 10% columns (columns 109)

This is the logic I am trying to develop:

  1. find the 10% value from the total number the columns in the original dataframe as 'n'
  2. pick the first 'n' columns from the original dataframe.
  3. add them to a new dataframe
  4. drop them from the original dataframe
  5. check whether the total number of columns in the original dataframe is greater than 'n'
  6. if NO -> repeat step 2 to step 5.
  7. if YES -> add all the remaining columns to the last created dataframe.

I tried the following code but it is wrong. In the following code I am trying to get the respected column numbers based on the percentage split and later I am planning to use those numbers to split the dataframe using iloc function.

def split_column_numbers(total_columns, percentage_split):
    list1 = []
    number = round((total_columns * (percentage_split/100)))
    list1.append([0,number])
    for i in range(number):
        last_num = list1[-1][-1]
        if (last_num < total_columns):
            if((total_columns-last_num) > number):
                list1.append([last_num+1, last_num+number])
            else:
                list1.append([last_num+1, total_columns])
    return list1
split_column_numbers(101, 10)

Could anyone help me on whether this logic is correct and how to achieve this?

1
  • the above sample works for split_column_numbers(100, 10) and split_column_numbers(109, 10). but it is not working for split_column_numbers(101, 10) Commented Nov 1, 2019 at 10:04

1 Answer 1

1

If you pass your frame directly to the function, it should make it easier for you to work out which columns to grab later on. We can use math.ceil to round up, and itertools.zip_longest to split into our subgroups.

from itertools import zip_longest
from math import ceil


def split_columns(frame, percentage_split):
    cols = frame.columns
    grp_size = ceil(len(cols) * percentage_split/100)
    return [[c for c in grp if c] for grp in zip_longest(*(iter(cols),) * grp_size)]

For example, if we set up a dummy frame as follows:

from string import ascii_lowercase

import pandas as pd

tmp = pd.DataFrame(columns=list(ascii_lowercase))

Then if we do split_columns(tmp, 10) we get:

[['a', 'b', 'c'],
 ['d', 'e', 'f'],
 ['g', 'h', 'i'],
 ['j', 'k', 'l'],
 ['m', 'n', 'o'],
 ['p', 'q', 'r'],
 ['s', 't', 'u'],
 ['v', 'w', 'x'],
 ['y', 'z']]

And if we do split_columns(tmp, 30) we get:

[['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'],
 ['i', 'j', 'k', 'l', 'm', 'n', 'o', 'p'],
 ['q', 'r', 's', 't', 'u', 'v', 'w', 'x'],
 ['y', 'z']]

If we then want to use these column choices to create new frames, you could do this with a dictionary comprehension, and enumerate:

frames = {i: tmp[cols] for i, cols in enumerate(split_columns(tmp, 30))}

This gives us a dictionary where the keys are integers (first group of columns corresponds to 0, second to 1 etc) and the values are the selected columns from the dataframe.

Sign up to request clarification or add additional context in comments.

3 Comments

Thank you so much for your quick response. I will implement this in my code and let you know :)
Since I am gonna be dealing with big data (datasets with more than 1000 columns) it would be better for me to get the numbers of columns and use it in iloc function to get the dataframes. That way would be much quicker than using the actual feature names. Do you have any idea how to deal with this situation ? :)
I've edited in line with your first comment. I don't think selecting columns is really going to be that much of a bottleneck in terms of your speeds?

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.