0

I have two questions. First, my filling up the data in the end triggers the following error. Second, since I am not too familiar with ``pandas'', this code is probably really untypical. If you have any improvements, feel free to help make this compact and efficient.

The code is supposed to create a crosswalk between x to y. The database may contain the same x<->y relationship several time. However, it should be unique. For every X, I check if the database is actually correct: if there is more than one relation, they all match to the same y.

Beginning of the crosswalk.csv:

x,y
832,"6231"
0,"00000000"
0,"00000000"
0,"00000000"
0,"00000000"
0,"00000000"
0,"00000000"
840,"6214"
842,"6111"

The code

data = pd.read_csv('data/crosswalk_short.csv')
df = pd.DataFrame(data)

xs = df.x.unique()
result = pd.DataFrame(index=xs)
result.fillna(NaN)

for x in xs:
    ys = df[df.x == x].y
    range = arange(0, len(ys.index))
    ys = ys.reindex(range)

    if (range[-1] > 0 and not isnan(ys[1]) ):
        print 'error!'

    result._ix[x] = ys[0]

The error:

  File "<ipython-input-129-4cf0c04508c4>", line 1, in <module>
    result._ix[x] = ys[0]
TypeError: 'NoneType' object does not support item assignment

1 Answer 1

3

Part 1

Anything with a single underscore as the first character of a name is generally "private" which in pandas code base really means "subject to change". So, you shouldn't be using _ix for anything. Use loc, iloc, [] syntax, or ix to perform assignment and to select subsets of your data. This error happens because _ix is not instantiated until you call ix (and its value is None until that happens), but this implementation detail is completely irrelevant to you as a user of pandas. Use the public APIs and you usually won't get these kinds of errors.

Also, this line

result.fillna(NaN)

is a no-op because by default fillna returns a copy. If you to update result in place, do

result.fillna(NaN, inplace=True)

This API convention is fairly consistent throughout pandas. That is, for methods where it makes sense to do so, the function signatures have something like

object.method(..., inplace=False)

by default.

Part 2

As for your second question, it looks like you want to check whether all duplicate xs have the same y value. One way to do that is:

df.groupby('x').filter(lambda x: x.count() > 1).groupby('x').y.nunique() == 1

This says:

  1. groupby the 'x' column
  2. give me subsets where there's more than a single label in the groups (repeated values in 'x')
  3. groupby our new de-single-fied 'x' column
  4. tell me whether there's more than a single unique 'y' for each value in 'x'

If 4. is False for any of the groups, that means you have x values repeated, where the y values are different.

Here's an example of this in action (I've modified your original dataset a little bit):

In [94]: df = pd.read_csv(StringIO('''x,y
q832,"6231"
1,"00000000"
1,"00000001"
0,"00000000"
0,"00000000"
0,"00000000"
0,"00000000"
840,"6214"
840,"6111"'''))

In [95]: df.groupby('x').filter(lambda x: x.count() > 1).groupby('x').y.nunique() == 1
Out[95]:
x
0       True
1      False
840    False
dtype: bool
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.