4

The Model:

class ItemType(models.Model):
  name = models.CharField(max_length=100)
  def __unicode__(self):
    logger.debug("1. Item Type %s created" % self.name)
    return self.name 

The code:

  (...)
    type = re.search(r"Type:(.*?)",text)
    itemtype = ItemType.objects.create(name = name.group(1), defaults={'name':name.group(1)})
    logger.debug("2. Item Type %s created" % name.group(1))
    logger.debug("4. Item Type %s created" % itemtype.name)
    logger.debug("3. Item Type %s created" % itemtype)

And the result is unexpected (to me of course):

The first logger.debug prints Item Type ąęńłśóć created as expected, but the second raises error:

DjangoUnicodeDecodeError: 'ascii' codec can't decode byte  in position : 
ordinal not in range(128). 
You passed in <ItemType: [Bad Unicode data]> (<class 'aaa.models.ItemType'>)

Why there's an error, and how can I fix it?

(text is html response with utf-8 encoding)

updated

I add debug into model and debug result is:

2014-10-06 09:38:53,342 DEBUG views 2. Item Type ąęćńółśż created
2014-10-06 09:38:53,342 DEBUG views 4. Item Type ąęćńółśż created
2014-10-06 09:38:53,344 DEBUG models 1. Item Type ąęćńółśż created
2014-10-06 09:38:53,358 DEBUG models 1. Item Type ąęćńółśż created

so why debug 3. can't print it?

UPDATE 2 The problem is here:

  itemtype = ItemType.objects.create(name = name.group(1), defaults={'name':name.group(1)})

if I changed it into

  itemtype = ItemType.objects.create(name = name.group(1), defaults={'name':u'ĄĆĘŃŁÓŚ'})

everything was ok.

So how to convert it into unicode? unicode(name.group(1)) doesn't work.

5
  • Which database are you using? Oracle? Also could you try changing to logger.debug("1. Item Type %s created", self.name). In loggers avoid using '%'. Commented Oct 6, 2014 at 8:28
  • changed to logger.debug(itemtype) the same error Commented Oct 6, 2014 at 8:54
  • This error occurs when you pass a Unicode string containing non-English characters (Unicode characters beyond 128) to something that expects an ASCII bytestring. The default encoding for a Python bytestring is ASCII, "which handles exactly 128 (English) characters". This is why trying to convert Unicode characters beyond 128 produces the error. see saltycrane.com/blog/2008/11/… Commented Oct 6, 2014 at 9:05
  • Is your Postgres configured to accept unicode ? Commented Oct 6, 2014 at 9:29
  • Yes, postgres is configured properly. But i can't agree that create expects ascii if i add u"żółć" instead of regex result model debug prints expected value. I think that i have to encode result but how encode and unicode don't work. Commented Oct 6, 2014 at 10:32

1 Answer 1

3

After two days of figthing with own shadow I found a solution. It isn't a workaround for this case, but complex change of thinking and I have to refactor whole code.

  1. My assumption is EVERY STRING is UNICODE. If it isn't - fix it.

  2. do not use "%s" or "something" ALWAYS use u"%s" and u"cośtam"

  3. In every model which has models.CharField() or other "text" oriented fields I override save() method:

in example:

class ItemType(models.Model):
  name = models.CharField(max_length=100)

  def save(self, *args, **kwargs):
    if isinstance(self.name, str):
      self.name=self.name.decode("utf-8")
    super(ItemType, self).save(*args, **kwargs)

Explanation - if somehow the name is filled with str not unicode - CHANGE it into unicode.

How I found this:

I was wondering what type is text in models.CharField, and found, that if you fill it with unicode - it is unicode, if you fill - str - it's str. So if you once fill it by "hand" with unicode, and in other place regex fill it with str - the result is unexpected.

The biggest problem of unicode and str is that is no problem of using diactrics with both:

>>> text_str = "żółć"
>>> text_unicode = u"żółć"
>>> print text_str
żółć
>>> print text_uni
żółć

so you can't see the difference.

But if you use other command:

>>> text_str
'\xc5\xbc\xc3\xb3\xc5\x82\xc4\x87'
>>> text_uni
u'\u017c\xf3\u0142\u0107'

The difference glares.

if there is some setting to change the behaviour of print (and similiars) to this:

>>> print text_str
'\xc5\xbc\xc3\xb3\xc5\x82\xc4\x87'
>>> print text_uni
żółć

everything would be much easier to debug - if you can see diactrics it's ok - if not - it's bad.

Using the decode('utf-8') leads me to the solution:

>>> text_str
'\xc5\xbc\xc3\xb3\xc5\x82\xc4\x87'
>>> text_str.decode('utf-8')
u'\u017c\xf3\u0142\u0107'
>>> text_uni
u'\u017c\xf3\u0142\u0107'

VOILA!

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.