I don't have any specifics about the business logic, but I do know how to write clean code, especially with my experience using Pandas.
Format
I immediately found the code hard to skim, especially with the long lines. You might want to apply a formatter like autopep8 or Black.
Here's an example with some chunks near the start
autopep8 makes some modest changes - just some extra whitespace:
# ...
from enum import Enum, auto
class PercOptions(Enum):
NO = auto()
TOTAL = auto()
COLUMNS = auto()
ROWS = auto()
def crosstab(df: pl.DataFrame, col_a: str, col_b: str, perc: PercOptions = PercOptions.NO) -> pl.DataFrame:
# ...
Black by default only goes one step further:
def crosstab(
df: pl.DataFrame, col_a: str, col_b: str, perc: PercOptions = PercOptions.NO
) -> pl.DataFrame:
But adding trailing commas is where Black shines:
def crosstab(
df: pl.DataFrame,
col_a: str,
col_b: str,
perc: PercOptions = PercOptions.NO,
) -> pl.DataFrame:
crosstab = df.pivot(
on=col_b,
index=col_a,
values=col_b,
aggregate_function="len",
sort_columns=True,
).sort(col_a)
That's much easier to read at a glance.
Note that you could remove the trailing commas at this point and autopep8 wouldn't complain.
More style points
Unnecessary parentheses: ((pl.col(col) / tot) * 100)
pl.col(col) / tot * 100
Strictly speaking, you can replace the elifs with ifs since you return on each of them, but this is a matter of taste.
Convention in Python is to avoid blank lines inside functions. As PEP 8 says: "Use blank lines in functions, sparingly, to indicate logical sections."
Shadowing
The name crosstab in def crosstab is a shadow. That's confusing and can lead to confusing bugs when editing the code later, so rename one of them.
crosstab() is the function name used by Pandas (which I'm more familiar with than Polars), so I would keep that. I'm not sure of a better name for the inner variable, so I would just add an underscore until I think of something better (crosstab_) or abbreviate it, like xtab or even ct.
Refactor
If you're not doing closures, there's no reason to do nested functions (at least off the top of my head). Bringing them to the top level lets you reduce nesting, which is a good thing!* If you don't want to expose them as public functions, you can prepend an underscore. You'll also need to rename them so that they're unique, e.g. _perc_cols_total, _perc_cols_columns, and _perc_cols_rows, to match the enum members they correspond to.
While refactoring, I checked whether each variable is free or bound and
noticed that the free variable crosstab is used in the PercOptions.ROWS nested function, which seems to be a mistake: it should use its parameter df instead.
def _perc_cols_total(df):
tot = df.select(~cs.by_index(0)).to_numpy().sum()
for col in df.columns[1:]:
yield pl.col(col) / tot * 100
def _perc_cols_columns(df):
for col in df.columns[1:]:
tot = df.select(col).to_numpy().sum()
yield pl.col(col) / tot * 100
def _perc_cols_rows(df):
hs = df.select(~cs.by_index(0)).sum_horizontal()
for col in df.columns[1:]:
yield pl.col(col) / hs * 100
Error message
Call repr() on the invalid value to make sure it's unambiguous, and add a colon to introduce it.
f"Unknown value for perc: {perc!r}"
It also wouldn't hurt to add a hint for the valid values, like list(PercOptions).
f"Unknown value for perc: {perc!r}. Choose from {list(PercOptions)}"
Fail fast
If perc is invalid, there's no sense computing the crosstab just to throw it away and raise an error. Check at the top of the function instead of at the bottom. This is called a "guard clause" or some variation like "guard statement".
if perc not in PercOptions:
raise ValueError(f"Invalid value for perc: {perc!r}")
Note: I believe this is the right way to check enum membership. Correct me if I'm wrong. (maybe isinstance()?)
That said, I'll keep the "unknown value" error in case of programmer error (new items in enum not accounted for, accidentally deleting an elif, etc).
Use a linter
Pylint caught some of these problems itself:
Further refactors
You can factor out return crosstab_perc from the elifs. I think it would be cleanest to separate the code there, into "no further processing required" (at the if, i.e. return early) and "further processing required" (elifs). You could also convert the elifs into a match-case statement if you prefer.
Lastly, you have cs.by_index(0) all over, which I believe is idiomatic Polars, but just in case it's not, or if it's expensive to construct, you could factor it out to a module-level constant - call it something like _COL0.
* The Zen of Python: "Flat is better than nested."
All together
import polars as pl
import polars.selectors as cs
from enum import Enum, auto
class PercOptions(Enum):
NO = auto()
TOTAL = auto()
COLUMNS = auto()
ROWS = auto()
def _perc_cols_total(df):
tot = df.select(~cs.by_index(0)).to_numpy().sum()
for col in df.columns[1:]:
yield pl.col(col) / tot * 100
def _perc_cols_columns(df):
for col in df.columns[1:]:
tot = df.select(col).to_numpy().sum()
yield pl.col(col) / tot * 100
def _perc_cols_rows(df):
hs = df.select(~cs.by_index(0)).sum_horizontal()
for col in df.columns[1:]:
yield pl.col(col) / hs * 100
def crosstab(
df: pl.DataFrame,
col_a: str,
col_b: str,
perc: PercOptions = PercOptions.NO,
) -> pl.DataFrame:
if perc not in PercOptions:
raise ValueError(
f"Invalid value for perc: {perc!r}. Choose from {list(PercOptions)}")
ct = df.pivot(
on=col_b,
index=col_a,
values=col_b,
aggregate_function="len",
sort_columns=True,
).sort(col_a)
if perc == PercOptions.NO: # No further processing
return ct
# Further processing
match perc:
case PercOptions.TOTAL:
crosstab_perc = pl.concat(
[ct.select(cs.by_index(0)), ct.select(_perc_cols_total(ct))],
how="horizontal",
)
case PercOptions.COLUMNS:
crosstab_perc = pl.concat(
[ct.select(cs.by_index(0)), ct.select(_perc_cols_columns(ct))],
how="horizontal",
)
case PercOptions.ROWS:
crosstab_perc = ct.select(cs.by_index(0), *_perc_cols_rows(ct))
case _:
raise ValueError(
f"Unknown value for perc: {perc!r}. Possible programmer error.")
return crosstab_perc