How can one read/write pandas DataFrames (Numpy arrays) of strings in Cython?
It works just fine when I work with integers or floats:
# Cython file numpy_.pyx
@boundscheck(False)
@wraparound(False)
cpdef fill(np.int64_t[:,::1] arr):
arr[0,0] = 10
arr[0,1] = 11
arr[1,0] = 13
arr[1,1] = 14
# Python code
import numpy as np
from numpy_ import fill
a = np.array([[0,1,2],[3,4,5]], dtype=np.int64)
print(a)
fill(a)
print(a)
gives
>>> a = np.array([[0,1,2],[3,4,5]], dtype=np.int64)
>>> print(a)
[[0 1 2]
[3 4 5]]
>>> fill(a)
>>> print(a)
[[10 11 2]
[13 14 5]]
Also, the following code
# Python code
import numpy as np, pandas as pd
from numpy_ import fill
a = np.array([[0,1,2],[3,4,5]], dtype=np.int64)
df = pd.DataFrame(a)
print(df)
fill(df.values)
print(df)
gives
>>> a = np.array([[0,1,2],[3,4,5]], dtype=np.int64)
>>> df = pd.DataFrame(a)
>>> print(df)
0 1 2
0 0 1 2
1 3 4 5
>>> fill(df.values)
>>> print(df)
0 1 2
0 10 11 2
1 13 14 5
However, I am having hard time figuring out how to do the same thing when the input is an array of strings. For example, how can I read of modify a Numpy array or a pandas DataFrame:
a2 = np.array([['000','111','222'],['333','444','555']], dtype='U3')
df2 = pd.DataFrame(a2)
and, let us say, the goal is to change through Cython
'000' -> 'AAA'; '111' -> 'BBB'; '222' -> 'CCC'; '333' -> 'DDD'
I did read the following NumPy documentation page and the following Cython documentation page, but still can not figure out what to do.
Thank you very much for your help!
pandasdoes not use thenumpystring dtypes. It makes those series object dtype. Look atdf2.dtypes.cpdef fill_str(np.object_t[:,::1] arr)? Why doestype(df2.at[0,0])then give<class 'str'>(i.e. not 'object')?stris anobject. A dataframe designed to holdobjectcan hold any subclass ofobjectincludingstrdtype. It also doesn't help with Unicode. I don't really have much advice beyond what's in this comment...