7
\$\begingroup\$

I wrote a function that computes the cross tabulation between two variables in a polars dataframe.

It supports absolute values, row percentages, column percentages and total percentages.

import polars as pl
import polars.selectors as cs
from enum import Enum, auto

class PercOptions(Enum):
    NO = auto()
    TOTAL = auto()
    COLUMNS = auto()
    ROWS = auto()

def crosstab(df:pl.DataFrame, col_a:str, col_b:str, perc:PercOptions=PercOptions.NO) -> pl.DataFrame:
    crosstab = df.pivot(
        on=col_b, index=col_a, values=col_b, aggregate_function="len", sort_columns=True
    ).sort(col_a)

    if perc == PercOptions.NO:
        return crosstab

    elif perc == PercOptions.TOTAL:

        def perc_cols(df):
            tot = df.select(~cs.by_index(0)).to_numpy().sum()
            for col in df.columns[1:]:
                yield ((pl.col(col) / tot) * 100)

        crosstab_perc = pl.concat(
            [crosstab.select(cs.by_index(0)), crosstab.select(perc_cols(crosstab))],
            how="horizontal",
        )

        return crosstab_perc
    
    elif perc == PercOptions.COLUMNS:

        def perc_cols(df):
            for col in df.columns[1:]:
                tot = df.select(col).to_numpy().sum()
                yield ((pl.col(col) / tot) * 100)

        crosstab_perc = pl.concat(
            [crosstab.select(cs.by_index(0)), crosstab.select(perc_cols(crosstab))],
            how="horizontal",
        )

        return crosstab_perc
    
    elif perc == PercOptions.ROWS:
        def perc_cols(df):
            hs = crosstab.select(~cs.by_index(0)).sum_horizontal()
            for col in df.columns[1:]:
                yield ((pl.col(col) / hs) * 100)
                
        crosstab_perc = crosstab.select(cs.by_index(0), *perc_cols(crosstab))
        return crosstab_perc
    
    else:
        raise ValueError(f"Unknown value for perc {perc}")

Here is some example:

from palmerpenguins import load_penguins
penguins = load_penguins()
penguins = pl.from_pandas(penguins)

crosstab(penguins, "species", "island", perc=PercOptions.NO)
crosstab(penguins, "species", "island", perc=PercOptions.COLUMNS)
crosstab(penguins, "species", "island", perc=PercOptions.ROWS)
crosstab(penguins, "species", "island", perc=PercOptions.TOTAL)

Do you see any bugs?

Is there a more efficient way to do this?

\$\endgroup\$
0

3 Answers 3

3
\$\begingroup\$

Just some notes about your Polars usage:

Selectors

There is also a dedicated cs.first() selector which may read nicer.

You can "invert" the selection in various ways:

cs.exclude(cs.first())
cs.all() - cs.first()
~cs.first()

They end up as the same selector:

>>> print(~cs.first())
# [cs.all() - cs.first(require=true)]

Concat

.concat() is generally used when you have multiple frames.

df_pivot = (
    penguins
    .pivot(on="island", index="species", values="species", aggregate_function="len", sort_columns=True)
    .sort("species")
)
# shape: (3, 4)
# ┌───────────┬────────┬───────┬───────────┐
# │ species   ┆ Biscoe ┆ Dream ┆ Torgersen │
# │ ---       ┆ ---    ┆ ---   ┆ ---       │
# │ str       ┆ u32    ┆ u32   ┆ u32       │
# ╞═══════════╪════════╪═══════╪═══════════╡
# │ Adelie    ┆ 44     ┆ 56    ┆ 52        │
# │ Chinstrap ┆ 0      ┆ 68    ┆ 0         │
# │ Gentoo    ┆ 124    ┆ 0     ┆ 0         │
# └───────────┴────────┴───────┴───────────┘

Instead of using .select() and horizontal .concat() to "rebuild" the frame:

values = ~cs.first()

pl.concat(
    [df_pivot.select(cs.first()), df_pivot.select(values / values.sum() * 100)], 
    how="horizontal"
)

You can use .with_columns()

values = ~cs.first()

df_pivot.with_columns(values / values.sum() * 100) # PercOptions.COLUMNS
# shape: (3, 4)
# ┌───────────┬───────────┬──────────┬───────────┐
# │ species   ┆ Biscoe    ┆ Dream    ┆ Torgersen │
# │ ---       ┆ ---       ┆ ---      ┆ ---       │
# │ str       ┆ f64       ┆ f64      ┆ f64       │
# ╞═══════════╪═══════════╪══════════╪═══════════╡
# │ Adelie    ┆ 26.190476 ┆ 45.16129 ┆ 100.0     │
# │ Chinstrap ┆ 0.0       ┆ 54.83871 ┆ 0.0       │
# │ Gentoo    ┆ 73.809524 ┆ 0.0      ┆ 0.0       │
# └───────────┴───────────┴──────────┴───────────┘

The ROWS and TOTAL calculations for completeness.

df_pivot.with_columns(values / pl.sum_horizontal(values) * 100) # PercOptions.ROWS
# shape: (3, 4)
# ┌───────────┬───────────┬───────────┬───────────┐
# │ species   ┆ Biscoe    ┆ Dream     ┆ Torgersen │
# │ ---       ┆ ---       ┆ ---       ┆ ---       │
# │ str       ┆ f64       ┆ f64       ┆ f64       │
# ╞═══════════╪═══════════╪═══════════╪═══════════╡
# │ Adelie    ┆ 28.947368 ┆ 36.842105 ┆ 34.210526 │
# │ Chinstrap ┆ 0.0       ┆ 100.0     ┆ 0.0       │
# │ Gentoo    ┆ 100.0     ┆ 0.0       ┆ 0.0       │
# └───────────┴───────────┴───────────┴───────────┘
df_pivot.with_columns(values / pl.sum_horizontal(values).sum() * 100) # PercOptions.TOTAL
# shape: (3, 4)
# ┌───────────┬───────────┬───────────┬───────────┐
# │ species   ┆ Biscoe    ┆ Dream     ┆ Torgersen │
# │ ---       ┆ ---       ┆ ---       ┆ ---       │
# │ str       ┆ f64       ┆ f64       ┆ f64       │
# ╞═══════════╪═══════════╪═══════════╪═══════════╡
# │ Adelie    ┆ 12.790698 ┆ 16.27907  ┆ 15.116279 │
# │ Chinstrap ┆ 0.0       ┆ 19.767442 ┆ 0.0       │
# │ Gentoo    ┆ 36.046512 ┆ 0.0       ┆ 0.0       │
# └───────────┴───────────┴───────────┴───────────┘

Expressions

As Expressions don't compute anything by themselves:

>>> print(values.sum() * 100)
# [([cs.all() - cs.first(require=true)].sum()) * (dyn int: 100)]

You can build an expression with the conditional logic and have a single "frame operation" returned at the end.

def crosstab(df, ...):
    if perc not in PercOptions:
        raise ...

    df_pivot = ...

    if perc == PercOptions.NO:
        return df_pivot

    values = ~cs.first()

    options = {
        PercOptions.COLUMNS: values.sum(),
        PercOptions.ROWS: pl.sum_horizontal(values),
        PercOptions.TOTAL: pl.sum_horizontal(values).sum()
    }

    return df_pivot.with_columns(values / options[perc] * 100)

Pipe

.pipe() is an alternative way of passing a DataFrame to a custom function.

penguins.pipe(crosstab, "species", "island", perc=PercOptions.TOTAL)
# shape: (3, 4)
# ┌───────────┬───────────┬───────────┬───────────┐
# │ species   ┆ Biscoe    ┆ Dream     ┆ Torgersen │
# │ ---       ┆ ---       ┆ ---       ┆ ---       │
# │ str       ┆ f64       ┆ f64       ┆ f64       │
# ╞═══════════╪═══════════╪═══════════╪═══════════╡
# │ Adelie    ┆ 12.790698 ┆ 16.27907  ┆ 15.116279 │
# │ Chinstrap ┆ 0.0       ┆ 19.767442 ┆ 0.0       │
# │ Gentoo    ┆ 36.046512 ┆ 0.0       ┆ 0.0       │
# └───────────┴───────────┴───────────┴───────────┘
\$\endgroup\$
3
  • \$\begingroup\$ Thank you very much! I don't see the usage of concat in your example though? \$\endgroup\$ Commented Oct 1 at 19:18
  • \$\begingroup\$ I was referring to your crosstab_perc = pl.concat( calls which break apart the same frame and do a horizontal concat. This can be done directly using with_columns as all the columns you are processing come from the same frame. I've added in an example which should hopefully make that clearer. \$\endgroup\$ Commented Oct 1 at 21:24
  • \$\begingroup\$ thank you very much. I have a working implementaion with your method which is much cleaner \$\endgroup\$ Commented Oct 2 at 11:27
8
\$\begingroup\$

I don't have any specifics about the business logic, but I do know how to write clean code, especially with my experience using Pandas.

Format

I immediately found the code hard to skim, especially with the long lines. You might want to apply a formatter like autopep8 or Black.

Here's an example with some chunks near the start

autopep8 makes some modest changes - just some extra whitespace:

# ...
from enum import Enum, auto


class PercOptions(Enum):
    NO = auto()
    TOTAL = auto()
    COLUMNS = auto()
    ROWS = auto()


def crosstab(df: pl.DataFrame, col_a: str, col_b: str, perc: PercOptions = PercOptions.NO) -> pl.DataFrame:
# ...

Black by default only goes one step further:

def crosstab(
    df: pl.DataFrame, col_a: str, col_b: str, perc: PercOptions = PercOptions.NO
) -> pl.DataFrame:

But adding trailing commas is where Black shines:

def crosstab(
    df: pl.DataFrame,
    col_a: str,
    col_b: str,
    perc: PercOptions = PercOptions.NO,
) -> pl.DataFrame:
    crosstab = df.pivot(
        on=col_b,
        index=col_a,
        values=col_b,
        aggregate_function="len",
        sort_columns=True,
    ).sort(col_a)

That's much easier to read at a glance.

Note that you could remove the trailing commas at this point and autopep8 wouldn't complain.

More style points

  • Unnecessary parentheses: ((pl.col(col) / tot) * 100)

    pl.col(col) / tot * 100
    
  • Strictly speaking, you can replace the elifs with ifs since you return on each of them, but this is a matter of taste.

  • Convention in Python is to avoid blank lines inside functions. As PEP 8 says: "Use blank lines in functions, sparingly, to indicate logical sections."

Shadowing

The name crosstab in def crosstab is a shadow. That's confusing and can lead to confusing bugs when editing the code later, so rename one of them.

crosstab() is the function name used by Pandas (which I'm more familiar with than Polars), so I would keep that. I'm not sure of a better name for the inner variable, so I would just add an underscore until I think of something better (crosstab_) or abbreviate it, like xtab or even ct.

Refactor

If you're not doing closures, there's no reason to do nested functions (at least off the top of my head). Bringing them to the top level lets you reduce nesting, which is a good thing!* If you don't want to expose them as public functions, you can prepend an underscore. You'll also need to rename them so that they're unique, e.g. _perc_cols_total, _perc_cols_columns, and _perc_cols_rows, to match the enum members they correspond to.

While refactoring, I checked whether each variable is free or bound and noticed that the free variable crosstab is used in the PercOptions.ROWS nested function, which seems to be a mistake: it should use its parameter df instead.

def _perc_cols_total(df):
    tot = df.select(~cs.by_index(0)).to_numpy().sum()
    for col in df.columns[1:]:
        yield pl.col(col) / tot * 100


def _perc_cols_columns(df):
    for col in df.columns[1:]:
        tot = df.select(col).to_numpy().sum()
        yield pl.col(col) / tot * 100


def _perc_cols_rows(df):
    hs = df.select(~cs.by_index(0)).sum_horizontal()
    for col in df.columns[1:]:
        yield pl.col(col) / hs * 100

Error message

Call repr() on the invalid value to make sure it's unambiguous, and add a colon to introduce it.

f"Unknown value for perc: {perc!r}"

It also wouldn't hurt to add a hint for the valid values, like list(PercOptions).

f"Unknown value for perc: {perc!r}. Choose from {list(PercOptions)}"

Fail fast

If perc is invalid, there's no sense computing the crosstab just to throw it away and raise an error. Check at the top of the function instead of at the bottom. This is called a "guard clause" or some variation like "guard statement".

if perc not in PercOptions:
    raise ValueError(f"Invalid value for perc: {perc!r}")

Note: I believe this is the right way to check enum membership. Correct me if I'm wrong. (maybe isinstance()?)

That said, I'll keep the "unknown value" error in case of programmer error (new items in enum not accounted for, accidentally deleting an elif, etc).

Use a linter

Pylint caught some of these problems itself:

Further refactors

You can factor out return crosstab_perc from the elifs. I think it would be cleanest to separate the code there, into "no further processing required" (at the if, i.e. return early) and "further processing required" (elifs). You could also convert the elifs into a match-case statement if you prefer.

Lastly, you have cs.by_index(0) all over, which I believe is idiomatic Polars, but just in case it's not, or if it's expensive to construct, you could factor it out to a module-level constant - call it something like _COL0.

* The Zen of Python: "Flat is better than nested."


All together

import polars as pl
import polars.selectors as cs
from enum import Enum, auto


class PercOptions(Enum):
    NO = auto()
    TOTAL = auto()
    COLUMNS = auto()
    ROWS = auto()


def _perc_cols_total(df):
    tot = df.select(~cs.by_index(0)).to_numpy().sum()
    for col in df.columns[1:]:
        yield pl.col(col) / tot * 100


def _perc_cols_columns(df):
    for col in df.columns[1:]:
        tot = df.select(col).to_numpy().sum()
        yield pl.col(col) / tot * 100


def _perc_cols_rows(df):
    hs = df.select(~cs.by_index(0)).sum_horizontal()
    for col in df.columns[1:]:
        yield pl.col(col) / hs * 100


def crosstab(
    df: pl.DataFrame,
    col_a: str,
    col_b: str,
    perc: PercOptions = PercOptions.NO,
) -> pl.DataFrame:
    if perc not in PercOptions:
        raise ValueError(
            f"Invalid value for perc: {perc!r}. Choose from {list(PercOptions)}")

    ct = df.pivot(
        on=col_b,
        index=col_a,
        values=col_b,
        aggregate_function="len",
        sort_columns=True,
    ).sort(col_a)
    if perc == PercOptions.NO:  # No further processing
        return ct

    # Further processing
    match perc:
        case PercOptions.TOTAL:
            crosstab_perc = pl.concat(
                [ct.select(cs.by_index(0)), ct.select(_perc_cols_total(ct))],
                how="horizontal",
            )
        case PercOptions.COLUMNS:
            crosstab_perc = pl.concat(
                [ct.select(cs.by_index(0)), ct.select(_perc_cols_columns(ct))],
                how="horizontal",
            )
        case PercOptions.ROWS:
            crosstab_perc = ct.select(cs.by_index(0), *_perc_cols_rows(ct))
        case _:
            raise ValueError(
                f"Unknown value for perc: {perc!r}. Possible programmer error.")
    return crosstab_perc
\$\endgroup\$
2
  • 1
    \$\begingroup\$ Great review! What are your thoughts on replacing the (exhaustive) if-elifchain by a match statement on perc, assigning crosstab_perc in each and only return once afterwards? \$\endgroup\$ Commented Sep 24 at 9:38
  • \$\begingroup\$ @ojdo That's a fine idea. Reducing exit points is generally good. I wouldn't want to reduce all the way to one though, because assigning crosstab_perc = ct would be confusing. It could just return early instead though. I'll edit that in! \$\endgroup\$ Commented Sep 24 at 14:48
5
\$\begingroup\$

Here are some general coding style suggestions.

Documentation

The PEP 8 style guide recommends adding docstrings for classes and functions. For example, you can use the description from the text of the question:

def crosstab(df:pl.DataFrame, col_a:str, col_b:str, perc:PercOptions=PercOptions.NO) -> pl.DataFrame:
    """
    Compute the cross tabulation between two variables in a polars dataframe.
    It supports absolute values, row percentages, column percentages and total percentages.
    """

Naming

It can be confusing to use the same name for your function and a variable in the function (crosstab):

def crosstab(df:pl.DataFrame, col_a:str, col_b:str, perc:PercOptions=PercOptions.NO) -> pl.DataFrame:
    crosstab = df.pivot(

I recommend renaming the function as something like create_crosstab:

def create_crosstab(df:pl.DataFrame, col_a:str, col_b:str, perc:PercOptions=PercOptions.NO) -> pl.DataFrame:
    crosstab = df.pivot(

This better represents what the function is doing.

There are many functions and variables with perc in the name. I think the code would be easier to read and understand using percent instead:

def percent_cols(df):

crosstab_percent = pl.concat(

Divide

There are several places where you perform a numeric division operation, such as:

pl.col(col) / tot

If it is possible for the denominator to be 0, it would be a good practice to check before the operation or to handle the error with an exception.

\$\endgroup\$
5
  • \$\begingroup\$ Thank you very much. I mean, can I also ask to further test the code? I obtained the result I expected for some values, but I would more confident in the code if more people tested it \$\endgroup\$ Commented Sep 23 at 18:42
  • \$\begingroup\$ @robertspierre: You're welcome. I agree that testing is a great approach, but I don't have any specific advice on new scenarios. Others are likely to come along and post answers as well. It's still early. \$\endgroup\$ Commented Sep 23 at 20:47
  • 1
    \$\begingroup\$ "I recommend renaming the function as something like create_crosstab" - FWIW, Pandas has a function crosstab(), so there's precedent at least. \$\endgroup\$ Commented Sep 24 at 1:50
  • \$\begingroup\$ @wjandrea In my experience, platforms are often horrible at naming things. A simple naming scheme is to name functions/methods (which do things) as verbs and variables/properties (which are things) as nouns. \$\endgroup\$ Commented Sep 24 at 4:50
  • \$\begingroup\$ @mdfst13 Yeah, but I think practicality beats purity in this case. These kind of functions are often used interactively, so shortness helps. \$\endgroup\$ Commented Sep 24 at 14:37

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.