In Py2.7
In [375]: arr=np.array([u"array",u"of",u"unicode"],dtype=np.unicode)
In [376]: arr
Out[376]:
array([u'array', u'of', u'unicode'],
dtype='<U7')
In [377]: arr.dtype
Out[377]: dtype('<U7')
In [378]: type(arr[0])
Out[378]: numpy.unicode_
In [379]: type(arr[0].item())
Out[379]: unicode
In general x[0] returns an element of x in a numpy subclass. In this case np.unicode_ is a subclass of unicode.
In [384]: isinstance(arr[0],np.unicode_)
Out[384]: True
In [385]: isinstance(arr[0],unicode)
Out[385]: True
I think you'd encounter the same sort of issues between np.int32 and int. But I haven't worked enough with cython to be sure.
Where have you seen cython code that specifies a string (unicode or byte) dtype?
http://docs.cython.org/src/tutorial/numpy.html has expressions like
# We now need to fix a datatype for our arrays. I've used the variable
# DTYPE for this, which is assigned to the usual NumPy runtime
# type info object.
DTYPE = np.int
# "ctypedef" assigns a corresponding compile-time type to DTYPE_t. For
# every type in the numpy module there's a corresponding compile-time
# type with a _t-suffix.
ctypedef np.int_t DTYPE_t
....
def naive_convolve(np.ndarray[DTYPE_t, ndim=2] f):
The purpose of the [] part is to improve indexing efficiency.
What we need to do then is to type the contents of the ndarray objects. We do this with a special “buffer” syntax which must be told the datatype (first argument) and number of dimensions (“ndim” keyword-only argument, if not provided then one-dimensional is assumed).
I don't think np.unicode will help because it doesn't specify character length. The full string dtype has to include the number of characters, eg. <U7 in my example.
We need to find working examples which pass string arrays - either in the cython documentation or other SO cython questions.
For some operations, you could treat the unicode array as an array of int32.
In [397]: arr.nbytes
Out[397]: 84
3 strings x 7 char/string * 4bytes/char
In [398]: arr.view(np.int32).reshape(-1,7)
Out[398]:
array([[ 97, 114, 114, 97, 121, 0, 0],
[111, 102, 0, 0, 0, 0, 0],
[117, 110, 105, 99, 111, 100, 101]])
Cython gives you the greatest speed improvement when you can bypass Python functions and methods. That would include bypassing much of the Python string and unicode functionality.
unicodeobjects in your Cython code, the easiest way would be to give the Numpy array anobjectdtype. If you want to keep a fixed-length Unicode array, maybe somehow you could use PyUnicode_FromUnicode where necessary?