146

I have a string that is to be sent over a network. I need to check the total bytes it is represented in.

sys.getsizeof(string_name) returns extra bytes. For example, for sys.getsizeof("a") returns 22, while one character is only represented in 1 byte in Python. Is there some other method to find this?

5
  • 1
    What version of Python are you using? Commented Jun 6, 2015 at 19:26
  • 11
    That's because the string "a" is an object in python that contains extra information. Commented Jun 6, 2015 at 19:27
  • 1
    @Some Developer is there a way to get bytes for the string only, without extra information of the complete object? Commented Jun 6, 2015 at 19:29
  • @squiguy My python version is 2.7.9 Commented Jun 6, 2015 at 19:30
  • 2
    Does this answer your question? How can I determine the byte length of a utf-8 encoded string in Python? Commented Nov 1, 2020 at 18:42

3 Answers 3

248

If you want the number of bytes in a string, this function should do it for you pretty solidly.

def utf8len(s):
    return len(s.encode('utf-8'))

The reason you got weird numbers is because encapsulated in a string is a bunch of other information due to the fact that strings are actual objects in Python.

It’s interesting because if you look at my solution to encode the string into 'utf-8', there's an 'encode' method on the 's' object (which is a string). Well, it needs to be stored somewhere right? Hence, the higher than normal byte count. Its including that method, along with a few others :).

Sign up to request clarification or add additional context in comments.

2 Comments

No worries. Sometimes simple answers make their way into seemingly weird problems haha.
The reason for encoding is that, in Python 3, some single-character strings will require multiple bytes to be represented. For instance: len('你'.encode('utf-8')).
32

You can use len(s.encode()), but there's a caveat.

The size in bytes of a string depends on the encoding you choose (by default "utf-8").

For some multi-byte encodings (e.g., UTF-16), string.encode will add a byte-order mark (BOM) at the start, which is a sequence of special bytes that inform the reader on the byte endianness used. So the length you get is actually len(BOM) + len(encoded_word).

If you don't want to count the BOM bytes, you can use either the little-endian version of the encoding (adding the suffix "-le") or the big-endian version (adding the suffix "be").

>>> len('ciao'.encode('utf-16'))
10
>>> len('ciao'.encode('utf-16-le'))
8

Comments

2

The question is old, but there were no correct answer about str size in memory. So let me explain.

The most interesting thing with str in Python: it has adaptive representation depending on present characters: it could be latin-1 (1 byte per char), UCS-2 (2 bytes per char) or UCS-4 (4 bytes per char).

For ASCII strings you may find that each char adds +1 byte to empty string (it is Python 3.9, in Py3.14 it is more compact like 41 bytes):

>>> sys.getsizeof("")
49
>>> sys.getsizeof("a")
50
>>> sys.getsizeof("ab")
51

where 49 is a size of initial C structure inside.

But for 2 byte characters even the intial C structure has different size (74 bytes in Py3.9, in Py3.14 it is like 59 bytes):

>>> sys.getsizeof("你好世界!"[:1]) # Hello, World! in Chinese
76
>>> sys.getsizeof("你好世界!"[:2])
78
>>> sys.getsizeof("你好世界!"[:3])
80
>>> sys.getsizeof("你好世界!"[:4])
82
>>> sys.getsizeof("你好世界!"[:5])
84

and even ASCII exclamation point ! adds 2 bytes anyway because max char width == 2.

The same fixed increase by 4 bytes happens for string with more complicated symbols.

Current adaptive representation had introduced in Python 3.3, and in general it is still not changed (3.14 is coming this year, but no much changes in str) except some optimizations for initial C structure size and having cached strings (so called interned strings).

There is a long term plan to move str to UTF-8 representation: https://github.com/faster-cpython/ideas/issues/684 but it should happen not earlier than in 3.16 I guess.

If you encode to UTF-8 currently, it is only creation of bytes object which is not connected with original string in memory.

P.S. If you send the string by network, encoding to bytes is necessary as kind of serialization. In such case .encode("utf-8") is good choice since UTF-8 doesn't depend on byte order. But you have to call .decode("utf-8") on another side.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.