I am unsure of the veracity of encoding Nan Integers as one less than the minimum value in your data. Also the mechanism used to decided to convert to integers could have issues if all numbers were very small. Except the previous comments this review will be about code clarity not function.
Stacked ifs often should be consolidated
Roughly half the code you have presented here is a large structure of stacked ifs that are looking for the range of values. Each if and the action from the if are all very similar. In many cases it is much clearer to code only the differences between the ifs, and then have the common code work with the codified differences. So for example, this:
# Make Integer/unsigned Integer datatypes
if IsInt:
if mn >= 0:
if mx < 255:
df[col] = df[col].astype(np.uint8)
elif mx < 65535:
....
df[col] = df[col].astype(np.uint16)
else:
if mn > np.iinfo(np.int8).min and mx < np.iinfo(np.int8).max:
df[col] = df[col].astype(np.int8)
elif mn > np.iinfo(np.int16).min and mx < np.iinfo(np.int16).max:
....
# Make float datatypes
else:
if mn > np.finfo(np.float16).min and mx < np.finfo(np.float16).max:
df[col] = df[col].astype(np.float16)
elif mn > np.finfo(np.float32).min and mx < np.finfo(np.float32).max:
....
Can be changed to:
if IsInt:
info = np.iinfo
# Make Integer/unsigned Integer datatypes
if mn >= 0:
types = (np.uint8, np.uint16, np.uint32, np.uint64)
else:
types = (np.int8, np.int16, np.int32, np.int64)
else:
info = np.finfo
types = (np.float16, np.float32, np.float64)
for t in types:
if info(t).min <= mn and mx <= info(t).max:
df[col] = df[col].astype(t)
break
This code makes the differences between signed, unsigned and floats explicit. And then the action code works against a description of the differences and so the action taken is more explicit. Additionally, the resulting code is much smaller.
###Chained Comparisons
I would change:
# test if column can be converted to an integer
asint = df[col].fillna(0).astype(np.int64)
result = (df[col] - asint)
result = result.sum()
if result > -0.01 and result < 0.01:
IsInt = True
# Make Integer/unsigned Integer datatypes
if IsInt:
To:
# test if column can be converted to an integer
asint = df[col].fillna(0).astype(np.int64)
errors_from_convert_to_int = (df[col] - asint).sum()
if -0.01 < errors_from_convert_to_int < 0.01:
Changes include:
- name the result in a manner to describe what was calculated
- use chained comparisons
- remove unneeded intermediate variable
IsInt
###Be Careful with Ranges
In the code to encode a Nan to min-1, there is a comment about integers, but the code is applied to all columns.
In addition the range needs to be expanded to include the now lower minimum value. How about:
if -0.01 < convert_to_int_errors < 0.01:
# Integer does not support NA, therefore, NA needs to be filled
if not np.isfinite(df[col]).all():
na_list.append(col)
mn -= 1
df[col].fillna(mn, inplace=True)
###Bonus Comment:
This suggestion fits more in personal style bucket. The main loop works against each column in turn, but does want to work on object columns. That is coded as:
for col in df.columns:
if df[col].dtype != object:
But to know that the code always only works with non-object columns requires the reader to scroll down looking for an else. The fact that the loop only works against non-objects can be made more explicit (ie: done in one expression) with something like:
for col in (c for c in df.columns if df[c].dtype != object):