Why Linux read() avoids using full 2 GiB in one call

Question

Reading the manual page (man 2 read) on my Debian system, I see the following note:

NOTES
[...] On Linux, read() (and similar system calls) will transfer at most 0x7ffff000 (2,147,479,552) bytes, returning the number of bytes actually transferred. (This is true on both 32-bit and 64-bit systems.)

I am guessing something is being done with the remaining 0xfff (4095) bytes so that this number is chosen for a specific implementation detail.

So my question is: Why is it 0x7ffff000 and not simply (naively) 0x7fffffff?

Transfers for data that is not block-aligned (let alone word-aligned) will have an overhead during transfers to deal with part-blocks. Also, media that are capable of DMA (Direct Memory Access, which bypasses CPU usage) are more likely to be 32-bit limited internally. — Paul_Pedant
– Paul_Pedant, Commented Apr 29 at 8:42
@Paul_Pedant but read() already needs to be able to read an arbitrary amount of bytes, at an arbitrary file position, so even if 0x7fffffff = 2147483647 is a prime, it shouldn't matter. — ilkkachu
– ilkkachu, Commented Apr 29 at 9:30
FWIW, it's down to a #define MAX_RW_COUNT (INT_MAX & PAGE_MASK) in the code (in include/linux/fs.h). git log/blame might allow you to get back to the rationale — Stéphane Chazelas
– Stéphane Chazelas, Commented Apr 29 at 12:30
See github.com/torvalds/linux/commit/…. "We want to protect lower layers from (the sadly all too common) overflow conditions, but prefer to do so by chopping the requests up, rather than just refusing them outright" — Stéphane Chazelas
– Stéphane Chazelas, Commented Apr 29 at 12:36
@ilkkachu Agree that read() can deal with any file-pos and length, but a block device is just that: any data that is not precisely block-aligned would have to be cached and trimmed somewhere. The first line of the man page says "attempts to read up to count bytes", so the existing limit is both conformant, and avoids those overheads. — Paul_Pedant
– Paul_Pedant, Commented Apr 29 at 20:02

jpa · Accepted Answer · 2025-04-30 06:20:22Z

_{(Thanks to Stéphane Chazelas in comments for looking up the commits. There is also some information in the Stack Overflow question "Why can't linux write more than 2147479552 bytes?")}

The Linux kernel commit that introduced this limit is e28cc71572da3 from 2006. Even before that, many filesystem drivers limited to INT_MAX, but returned an error instead of a partially completed IO request.

We want to protect lower layers from (the sadly all too common) overflow conditions, but prefer to do so by chopping the requests up, rather than just refusing them outright.

The limit itself is defined as #define MAX_RW_COUNT (INT_MAX & PAGE_CACHE_MASK), which depends on the system page size. It is 0x7ffff000 for 4 kB virtual memory page size.

I think the reason is to protect against this very common idiom overflowing:

int ret = read(....);

That is present in a lot of old code, which when compiled for 64-bit platforms might end up giving a larger limit to read. For example, if you just check the file size and malloc() a buffer for it, the read would complete ok, but the return value would overflow. New code should use ssize_t for the return value which avoids the overflow, but that type was first standardized in 2001 and took several more years to become commonly used.

But why not 0x7fffffff instead? The answer is in the mailing list post "Limit sendfile() to 2^31-PAGE_CACHE_SIZE bytes without error":

I set the limit to 2^31-PAGE_CACHE_SIZE so that a transfer that starts at the beginning of the file will continue to be page-aligned.

It is expected that when the application receives a return value indicating partially completed read/write call, it will continue with another system call until the whole file is transferred. But if the first request was limited to INT_MAX, every future request would be misaligned and have worse performance when accessing the IO caches.

Marcus Müller · Accepted Answer · 2025-04-30 13:56:55Z

9

Can only guess, but in operating system interfaces, stability is important. read returns an ssize_t, which might be 64 bit (allowing for sizes up to 63 bit, because, well, the first s is for signed). But that's not always been that way.

Functionally, for the most of the first 25 years of C and POSIX history, that was just the same as int. In fact, it was int, up to relatively recently, in System V Release 4 (Programmers Reference Manual Page 307), in BSD4.3 (UNIX Programmer's Reference Manual).

So, programmers were right to assume this call can only return as many bytes as INT_MAX was. So, on your massive 64 bit machine allocate 2GB of buffer, int howmany = read(fd, buffer, (unsigned int) INT_MAX); done. Due to the API, read cannot ever read more than INT_MAX (minus one page, after all, the process that calls read needs to have at least 1 page of executable memory), which is the largest signed 32 bit integer.

Now imagine what happens if your compiler switches int to a 64 bit type (this is not the standard for GCC on x86_64 or aarch64, but it can and has happened on UNIX systems). Suddenly, you're not reading up to INT_MAX items, but INT64_MAX items into a INT_MAX buffer.

Yeah, that's a buffer overflow due to the implementation of the syscall actually "growing" with its API. It makes kittens sad.

So, instead, Linux decided that, meh, let's not do that, whoever needs to read more than 2GB can call read multiple time.

This has one very important "downside" from the user perspective: POSIX read (and readv, which I tested will not read more than read at once, bummer, on Linux) is atomic, meaning even if other threads or processes work on the same file descriptor, you can be sure that the offset will not be shifted on the middle of reading.

Maybe, that's, however, also an OS design decision: If you're doing a file operation larger than 2 GB, you better figure out an application-level mutually exclusive file access scheme than to expect the kernel to keep some handle consistent for such a large, and long-duration, operation.

edited Apr 30 at 13:56

answered Apr 29 at 10:05

Marcus Müller

51.2k4 gold badges77 silver badges119 bronze badges

4

which might be a reason the limit is <= 2 GB (assuming 64-bit expansion was a consideration when that number was set), but why is it 0x7ffff000 (2 GB - 4096), and not 0x7ffffffff (2 GB - 1, the largest signed 32-bit int)?

ilkkachu
– ilkkachu

2025-04-29 11:08:12 +00:00
Commented Apr 29 at 11:08
2

@TomYan, I don't see any of that mentioned in this answer, hence the question.

ilkkachu
– ilkkachu

2025-04-29 14:20:12 +00:00
Commented Apr 29 at 14:20
4

+1 for "it makes kittens sad". This should be a major criteria in accepting API changes.

doneal24
– doneal24

2025-04-29 18:41:45 +00:00
Commented Apr 29 at 18:41
2

(unsigned int) -1 is UINT_MAX, not INT_MAX, so that code was always broken. Regardless of whether the length arg to read is signed or unsigned, the return value is signed, and read is obligated to return a number between -1 and SSIZE_T_MAX. Of course, that doesn't stop people doing weird stuff in the belief that they're making the world safer...

Martin Kealey
– Martin Kealey

2025-04-29 23:33:06 +00:00
Commented Apr 29 at 23:33
1

@ilkkachu oof, forgot to answer the actual question. Sorry. the -1 page probably just comes from someone going "hm, no, we don't want to overwrite a full address space completely, because then, where does the code that called read live?"

Marcus Müller
– Marcus Müller

2025-04-30 13:57:34 +00:00
Commented Apr 30 at 13:57

| Show 3 more comments

Martin Kealey · Accepted Answer · 2025-05-06 05:20:47Z

7

I'm guessing there's some computation which needs to round the number of bytes up to a whole number of pages, and 0x7ffff000 bytes is exactly 0x7ffff 4KiB pages.

0x7fffffff would be rounded up to 0x80000000, which when stored in a 32-bit signed int would wrap around to -0x80000000 (-2147483648).

Linux uses int internally, so this explains this result where int or ssize_t (or both) are 32-bit, regardless of whether it's on 32-bit or 64-bit systems.

edited May 6 at 5:20

answered Apr 29 at 8:22

Martin Kealey

9147 silver badges12 bronze badges

Unfortunately I don't think anything forces a page to be 4KiB, even if that's a very common implementation.

Mark Ransom
– Mark Ransom

2025-04-29 18:50:57 +00:00
Commented Apr 29 at 18:50
@MarkRansom Nothing forces it, just like nothing forces a filesystem block to be 4KiB even though that’s a very common implementation. But there’s a lot of stuff that blindly assumes a 4 KiB page or block size.

Austin Hemmelgarn
– Austin Hemmelgarn

2025-04-29 20:56:25 +00:00
Commented Apr 29 at 20:56
6

4KiB is the default page size for the MMU that's embedded in a modern x86 CPU, and therefore it's the most common page size. On other architectures you may get different numbers, and indeed the x86 MMU also supports 2MiB and 1GiB page sizes. Because fractional pages waste physical memory, mmap generally gives you page-aligned memory, which in turn means that software sets the DMA start address to a page boundary.

Martin Kealey
– Martin Kealey

2025-04-29 23:25:34 +00:00
Commented Apr 29 at 23:25

Add a comment |

Stack Exchange Network

Why Linux read() avoids using full 2 GiB in one call

3 Answers 3

You must log in to answer this question.

Hot Network Questions

Why Linux read() avoids using full 2 GiB in one call

3 Answers 3

You must log in to answer this question.

Related

Hot Network Questions