-
-
Couldn't load subscription status.
- Fork 19.2k
Description
Code Sample, a copy-pastable example if possible
In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: pd.__version__
Out[3]: u'0.19.0'
In [4]: mi = pd.MultiIndex.from_tuples([[1, 1, 3], [1, 1, np.nan]], names=list('ABC'))
In [5]: df = pd.DataFrame([[1, 2], [3, 4]], mi)
In [6]: df.sort_index(na_position="first")
Out[6]:
0 1
A B C
1 1 NaN 3 4
3 1 2
In [7]: df.sort_index(na_position="last")
Out[7]:
0 1
A B C
1 1 NaN 3 4
3 1 2Problem description
The na_position argument isn't used in DataFrame.sort_index() or Series.sort_index() due to the way we sort the MultiIndex. Whenever we create a MultiIndex, we store the labels as relative values. For instance, if we have the following MultiIndex:
MultiIndex.from_tuples([[1, 1, 3], [1, 1, np.nan]], names=list('ABC'))the values get stored as
MultiIndex(levels=[[1], [1], [3]],
labels=[[0, 0], [0, 0], [0, -1]],
names=[u'A', u'B', u'C'])with a NaN placeholder of -1.
These label values are what get passed to the sorting algorithm for both DataFrames and Series. Since the sorting only happens on the labels, it has no notion of the NaN.
This has been discussed in #14015 and #14672 .
My original naive solution was to change these lines from:
indexer = _lexsort_indexer(labels.labels, orders=ascending,
na_position=na_position)to
index_values_list = np.dstack(labels.get_values())[0].tolist()
indexer = _lexsort_indexer(index_values_list, orders=ascending,
na_position=na_position)This didn't break any tests, but it isn't necessarily the best approach.
Expected Output
In [7]: df.sort_index(na_position="last")
Out[7]:
0 1
A B C
1 1 3 1 2
NaN 3 4Output of pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 2.7.12.final.0
python-bits: 64
OS: Linux
OS-release: 3.16.0-77-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None
pandas: 0.19.0
nose: 1.3.4
pip: 9.0.0
setuptools: 27.2.0
Cython: 0.21
numpy: 1.11.2
scipy: 0.16.1
statsmodels: 0.6.1
xarray: None
IPython: 4.0.0
sphinx: 1.2.3
patsy: 0.3.0
dateutil: 2.5.3
pytz: 2016.7
blosc: None
bottleneck: None
tables: 3.1.1
numexpr: 2.3.1
matplotlib: 1.5.0
openpyxl: 1.8.5
xlrd: 0.9.3
xlwt: 0.7.5
xlsxwriter: 0.5.7
lxml: 3.4.0
bs4: 4.3.2
html5lib: None
httplib2: 0.9.2
apiclient: 1.5.5
sqlalchemy: 0.9.7
pymysql: None
psycopg2: 2.6.1 (dt dec pq3 ext lo64)
jinja2: 2.7.3
boto: 2.32.1
pandas_datareader: None