3

I'm new to cython, and I've been having a re-ocurring problem involving encoding unicode inside of a numpy array.

Here's an example of the problem:

import numpy as np
cimport numpy as np

cpdef pass_array(np.ndarray[ndim=1,dtype=np.unicode] a):
    pass

cpdef access_unicode_item(np.ndarray a):
    cdef unicode item = a[0]

Example errors:

In [3]: unicode_array = np.array([u"array",u"of",u"unicode"],dtype=np.unicode)

In [4]: pass_array(unicode_array)
ValueError: Does not understand character buffer dtype format string ('w')

In [5]: access_item(unicode_array)
TypeError: Expected unicode, got numpy.unicode_

The problem seems to be that the values are not real unicode, but instead numpy.unicode_ . Is there a way to encode the values in the array as proper unicode (so that I can type individual items for use in cython code)?

1
  • If you want to use Python unicode objects in your Cython code, the easiest way would be to give the Numpy array an object dtype. If you want to keep a fixed-length Unicode array, maybe somehow you could use PyUnicode_FromUnicode where necessary? Commented Mar 1, 2016 at 14:12

1 Answer 1

2

In Py2.7

In [375]: arr=np.array([u"array",u"of",u"unicode"],dtype=np.unicode)

In [376]: arr
Out[376]: 
array([u'array', u'of', u'unicode'], 
      dtype='<U7')

In [377]: arr.dtype
Out[377]: dtype('<U7')

In [378]: type(arr[0])
Out[378]: numpy.unicode_

In [379]: type(arr[0].item())
Out[379]: unicode

In general x[0] returns an element of x in a numpy subclass. In this case np.unicode_ is a subclass of unicode.

In [384]: isinstance(arr[0],np.unicode_)
Out[384]: True

In [385]: isinstance(arr[0],unicode)
Out[385]: True

I think you'd encounter the same sort of issues between np.int32 and int. But I haven't worked enough with cython to be sure.


Where have you seen cython code that specifies a string (unicode or byte) dtype?

http://docs.cython.org/src/tutorial/numpy.html has expressions like

# We now need to fix a datatype for our arrays. I've used the variable
# DTYPE for this, which is assigned to the usual NumPy runtime
# type info object.
DTYPE = np.int
# "ctypedef" assigns a corresponding compile-time type to DTYPE_t. For
# every type in the numpy module there's a corresponding compile-time
# type with a _t-suffix.
ctypedef np.int_t DTYPE_t
....
def naive_convolve(np.ndarray[DTYPE_t, ndim=2] f):

The purpose of the [] part is to improve indexing efficiency.

What we need to do then is to type the contents of the ndarray objects. We do this with a special “buffer” syntax which must be told the datatype (first argument) and number of dimensions (“ndim” keyword-only argument, if not provided then one-dimensional is assumed).

I don't think np.unicode will help because it doesn't specify character length. The full string dtype has to include the number of characters, eg. <U7 in my example.

We need to find working examples which pass string arrays - either in the cython documentation or other SO cython questions.


For some operations, you could treat the unicode array as an array of int32.

In [397]: arr.nbytes
Out[397]: 84

3 strings x 7 char/string * 4bytes/char

In [398]: arr.view(np.int32).reshape(-1,7)
Out[398]: 
array([[ 97, 114, 114,  97, 121,   0,   0],
       [111, 102,   0,   0,   0,   0,   0],
       [117, 110, 105,  99, 111, 100, 101]])

Cython gives you the greatest speed improvement when you can bypass Python functions and methods. That would include bypassing much of the Python string and unicode functionality.

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.