Python: Get size of string in bytes

Question

I have a string that is to be sent over a network. I need to check the total bytes it is represented in.

sys.getsizeof(string_name) returns extra bytes. For example, for sys.getsizeof("a") returns 22, while one character is only represented in 1 byte in Python. Is there some other method to find this?

That's because the string "a" is an object in python that contains extra information. — Kris
– Kris, Commented Jun 6, 2015 at 19:27
@Some Developer is there a way to get bytes for the string only, without extra information of the complete object? — Iffat Fatima
– Iffat Fatima, Commented Jun 6, 2015 at 19:29
Does this answer your question? How can I determine the byte length of a utf-8 encoded string in Python? — maxschlepzig
– maxschlepzig, Commented Nov 1, 2020 at 18:42

Peter Mortensen · Accepted Answer · 2025-05-05 22:29:08Z

248

If you want the number of bytes in a string, this function should do it for you pretty solidly.

def utf8len(s):
    return len(s.encode('utf-8'))

The reason you got weird numbers is because encapsulated in a string is a bunch of other information due to the fact that strings are actual objects in Python.

It’s interesting because if you look at my solution to encode the string into 'utf-8', there's an 'encode' method on the 's' object (which is a string). Well, it needs to be stored somewhere right? Hence, the higher than normal byte count. Its including that method, along with a few others :).

edited May 5 at 22:29

Peter Mortensen

31.4k22 gold badges110 silver badges134 bronze badges

answered Jun 6, 2015 at 19:28

Kris

10.4k6 gold badges32 silver badges66 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Kris Over a year ago

No worries. Sometimes simple answers make their way into seemingly weird problems haha.

Brad Solomon Over a year ago

The reason for encoding is that, in Python 3, some single-character strings will require multiple bytes to be represented. For instance: len('你'.encode('utf-8')).

Peter Mortensen · Accepted Answer · 2025-05-05 22:31:58Z

You can use len(s.encode()), but there's a caveat.

The size in bytes of a string depends on the encoding you choose (by default "utf-8").

For some multi-byte encodings (e.g., UTF-16), string.encode will add a byte-order mark (BOM) at the start, which is a sequence of special bytes that inform the reader on the byte endianness used. So the length you get is actually len(BOM) + len(encoded_word).

If you don't want to count the BOM bytes, you can use either the little-endian version of the encoding (adding the suffix "-le") or the big-endian version (adding the suffix "be").

>>> len('ciao'.encode('utf-16'))
10
>>> len('ciao'.encode('utf-16-le'))
8

Vasily Ryabov · Accepted Answer · 2025-06-14 18:18:39Z

The question is old, but there were no correct answer about str size in memory. So let me explain.

The most interesting thing with str in Python: it has adaptive representation depending on present characters: it could be latin-1 (1 byte per char), UCS-2 (2 bytes per char) or UCS-4 (4 bytes per char).

For ASCII strings you may find that each char adds +1 byte to empty string (it is Python 3.9, in Py3.14 it is more compact like 41 bytes):

>>> sys.getsizeof("")
49
>>> sys.getsizeof("a")
50
>>> sys.getsizeof("ab")
51

where 49 is a size of initial C structure inside.

But for 2 byte characters even the intial C structure has different size (74 bytes in Py3.9, in Py3.14 it is like 59 bytes):

>>> sys.getsizeof("你好世界！"[:1]) # Hello, World! in Chinese
76
>>> sys.getsizeof("你好世界！"[:2])
78
>>> sys.getsizeof("你好世界！"[:3])
80
>>> sys.getsizeof("你好世界！"[:4])
82
>>> sys.getsizeof("你好世界！"[:5])
84

and even ASCII exclamation point ! adds 2 bytes anyway because max char width == 2.

The same fixed increase by 4 bytes happens for string with more complicated symbols.

Current adaptive representation had introduced in Python 3.3, and in general it is still not changed (3.14 is coming this year, but no much changes in str) except some optimizations for initial C structure size and having cached strings (so called interned strings).

There is a long term plan to move str to UTF-8 representation: https://github.com/faster-cpython/ideas/issues/684 but it should happen not earlier than in 3.16 I guess.

If you encode to UTF-8 currently, it is only creation of bytes object which is not connected with original string in memory.

P.S. If you send the string by network, encoding to bytes is necessary as kind of serialization. In such case .encode("utf-8") is good choice since UTF-8 doesn't depend on byte order. But you have to call .decode("utf-8") on another side.

Collectives™ on Stack Overflow

Python: Get size of string in bytes

3 Answers 3

2 Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

Comments

Linked

Related