Compare unicode string with byte string

Question

Version: Python 2.7

I'm reading values from a Unicode CSV file and looping through to find a particular product code - a string. The variable p is from the CSV file.

sku = '1450'             # sku can contain spaces.
print p, '|', sku
print p == '1450'
print binascii.hexlify(p), '|', binascii.hexlify(sku)
print binascii.hexlify(p) == binascii.hexlify(sku)
print 'repr(p): ', repr(p)

which results in

1450 | 1450
False
003100340035003000 | 31343530
False
repr(p): '\x001\x004\x005\x000\x00'

Q1. What is a future-proof way (for version 3, etc.) to successfully compare? Q2. The Unicode is little-endian. Why have I got 00 at both ends of the Unicode hex?

Note: attempts at converting to Unicode - u'1450' - don't seem to have any affect on the output.

Thanks.

why do you mention Unicode when all you have shown us is ASCII digits? And when you say a Unicode file, you should also say WHICH Unicode encoding. — Walter Tross
– Walter Tross, Commented Dec 19, 2020 at 18:39
@WalterTross Python character strings are Unicode, even if all the characters are ASCII. — Mark Ransom
– Mark Ransom, Commented Dec 19, 2020 at 18:41
repr shows you the actual content of the string, as opposed to the user-friendly version you get from str (str is called when you print). In this case, repr shows us that there are null bytes (\x00) between each digit, and this is a strong indication of a UTF-16 encoding, as Walter Tross has observed (in a now deleted comment). — snakecharmerb
– snakecharmerb, Commented Dec 19, 2020 at 18:48
Considering your concerns about a future-proof approach, why are you programming in Python 2 at all? — Ulrich Eckhardt
– Ulrich Eckhardt, Commented Dec 19, 2020 at 19:36

ti7 · Accepted Answer · 2020-12-19 20:06:41Z

3

This is probably much easier in Python 3 due to a change in how strings are handled.

Try opening your file with the encoding specified and pass the file-like to the csv library See csv Examples

import csv
with open('some.csv', newline='', encoding='UTF-16LE') as fh:
    reader = csv.reader(fh)
    for row in reader:  # reader is iterable
        # work with row

After some comments, the read attempt comes from a FTP server.
Switching a string read to FTP binary and reading through a io.TextIOWrapper() may work out

Out now with even more context managers!:

import io
import csv
from ftplib import FTP

with FTP("ftp.example.org") as ftp:
    with io.BytesIO() as binary_buffer:
        # read all of products.csv into a binary buffer
        ftp.retrbinary("RETR products.csv", binary_buffer.write)
        binary_buffer.seek(0)  # rewind file pointer
        # create a text wrapper to associate an encoding with the file-like for reading
        with io.TextIOWrapper(binary_buffer, encoding="UTF-16LE") as csv_string:
            for row in csv.reader(csv_string):
                # work with row

edited Dec 19, 2020 at 20:06

answered Dec 19, 2020 at 18:42

ti7

19.8k8 gold badges50 silver badges86 bronze badges

Sign up to request clarification or add additional context in comments.

11 Comments

ti7 Over a year ago

@WalterTross I think so too +LE; updated!

Mark Ransom Over a year ago

How do you know it's little-endian and not big-endian? With every other byte being zero there's no way to know, especially when the leading/trailing byte of the previous/next string is also included.

ti7 Over a year ago

@MarkRansom stated in the Question!

Transistor Over a year ago

@MarkRansom: I'm reading a CSV file by FTP from an industrial HMI. The HMI user manual states that the data is little-endian.

Transistor Over a year ago

Thank you. That gave me enough of a lead to figure it out. (There were some more details that I didn't want to trouble you with.) I got an "Attribute error" on the with FTP line as 2.7 doesn't have an exit method. See here for the answer that helped me resolve this.

|

Collectives™ on Stack Overflow

Compare unicode string with byte string

1 Answer 1

11 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

11 Comments

Linked

Related