Return to Answer

Expand notes about functionality

Source Link

edited Jul 9, 2017 at 17:01

4.3k
12
24
36

I am unsure of the veracityThis review is primarily about code clarity not function. But, before we get going, a couple of encoding Nan Integers as one less thanquick comments about function:

I am unsure of the veracity of encoding Nan Integers as one less than the minimum value in your data.

You mention that people should check afterwards to insure what to do with the now encoded Nans. But this method basically requires someone to hand manage everything. I am not sure of the practicality of that.

The mechanism used to decided to convert to integers could have issues if all numbers were very small.

Very small numbers will look like integers, and will all be converted to 0!

Floating point numbers by their nature are not really limited by the max value.

The reason for different sized floats is to be able to encode different levels of precision. So deciding to shrink a float down in size based on the maximum value in a data set is highly unlikely to be generally useful.

But having said that I will assume the minimum value innumeric functions are relevant and correct for your data. Also the mechanism used to decided to convert, and will discuss how to integers could have issues if all numbers were very small. Exceptmake the previous comments this review will be about code clarity not functionmore Pythonic.

Source Link

answered Jul 9, 2017 at 15:55

Stephen Rauch

4.3k
12
24
36

I am unsure of the veracity of encoding Nan Integers as one less than the minimum value in your data. Also the mechanism used to decided to convert to integers could have issues if all numbers were very small. Except the previous comments this review will be about code clarity not function.

Stacked ifs often should be consolidated

Roughly half the code you have presented here is a large structure of stacked ifs that are looking for the range of values. Each if and the action from the if are all very similar. In many cases it is much clearer to code only the differences between the ifs, and then have the common code work with the codified differences. So for example, this:

# Make Integer/unsigned Integer datatypes
if IsInt:
    if mn >= 0:
        if mx < 255:
            df[col] = df[col].astype(np.uint8)
        elif mx < 65535:
            ....
            df[col] = df[col].astype(np.uint16)
    else:
        if mn > np.iinfo(np.int8).min and mx < np.iinfo(np.int8).max:
            df[col] = df[col].astype(np.int8)
        elif mn > np.iinfo(np.int16).min and mx < np.iinfo(np.int16).max:
            ....

# Make float datatypes
else:
    if mn > np.finfo(np.float16).min and mx < np.finfo(np.float16).max:
        df[col] = df[col].astype(np.float16)
    elif mn > np.finfo(np.float32).min and mx < np.finfo(np.float32).max:
        ....

Can be changed to:

if IsInt:
    info = np.iinfo
    # Make Integer/unsigned Integer datatypes
    if mn >= 0:
        types = (np.uint8, np.uint16, np.uint32, np.uint64)
    else:
        types = (np.int8, np.int16, np.int32, np.int64)
else:
    info = np.finfo
    types = (np.float16, np.float32, np.float64)

for t in types:
    if info(t).min <= mn and mx <= info(t).max:
        df[col] = df[col].astype(t)
        break

This code makes the differences between signed, unsigned and floats explicit. And then the action code works against a description of the differences and so the action taken is more explicit. Additionally, the resulting code is much smaller.

###Chained Comparisons

I would change:

# test if column can be converted to an integer
asint = df[col].fillna(0).astype(np.int64)
result = (df[col] - asint)
result = result.sum()
if result > -0.01 and result < 0.01:
    IsInt = True

# Make Integer/unsigned Integer datatypes
if IsInt:

To:

# test if column can be converted to an integer
asint = df[col].fillna(0).astype(np.int64)
errors_from_convert_to_int = (df[col] - asint).sum()

if -0.01 < errors_from_convert_to_int < 0.01:

Changes include:

name the result in a manner to describe what was calculated
use chained comparisons
remove unneeded intermediate variable IsInt

###Be Careful with Ranges

In the code to encode a Nan to min-1, there is a comment about integers, but the code is applied to all columns. In addition the range needs to be expanded to include the now lower minimum value. How about:

if -0.01 < convert_to_int_errors < 0.01:
    # Integer does not support NA, therefore, NA needs to be filled
    if not np.isfinite(df[col]).all():
        na_list.append(col)
        mn -= 1
        df[col].fillna(mn, inplace=True)

###Bonus Comment:

This suggestion fits more in personal style bucket. The main loop works against each column in turn, but does want to work on object columns. That is coded as:

for col in df.columns:
    if df[col].dtype != object:

But to know that the code always only works with non-object columns requires the reader to scroll down looking for an else. The fact that the loop only works against non-objects can be made more explicit (ie: done in one expression) with something like:

for col in (c for c in df.columns if df[c].dtype != object):