| |
Log in / Subscribe / Register

BPF meets io_uring

We're bad at marketing

We can admit it, marketing is not our strong suit. Our strength is writing the kind of articles that developers, administrators, and free-software supporters depend on to know what is going on in the Linux world. Please subscribe today to help us keep doing that, and so we don’t have to get good at marketing.

By Jonathan Corbet
March 4, 2021
Over the last couple of years, a lot of development effort has gone into two kernel subsystems: BPF and io_uring. The BPF virtual machine allows programs from user space to be safely run within the context of the kernel, while io_uring addresses the longstanding problem of running system calls asynchronously. As the two subsystems expand, it was inevitable that the two would eventually meet; the first encounter happened in mid-February with this patch set from Pavel Begunkov adding the ability to run BPF programs from within io_uring.

The patch set itself is relatively straightforward, adding less than 300 lines of new code. It creates a new BPF program type (BPF_PROG_TYPE_IOURING) for programs that are meant to be run in the io_uring context. Any such programs must first be created with the bpf() system call, then registered with the ring in which they are intended to run using the new IORING_ATTACH_BPF command. Once that has been done, the IORING_OP_BPF operation will cause a program to be run within the ring. The final step in the patch series adds a helper function that BPF programs can use to submit new operations into the ring.

As a proof of concept, the patch series does a good job of showing how BPF programs might be run from an io_uring. This work does not, though, really enable any new capabilities in its current form, which may be part of why there have been no responses to it on the list. There is little value to running a BPF program asynchronously to submit another operation; one could simply submit that operation directly instead. As is acknowledged in the patch set, more infrastructure will be needed before this capability will become useful to users.

The obvious place where BPF can add value is making decisions based on the outcome of previous operations in the ring. Currently, these decisions must be made in user space, which involves potential delays as the relevant process is scheduled and run. Instead, when an operation completes, a BPF program might be able to decide what to do next without ever leaving the kernel. "What to do next" could include submitting more I/O operations, moving on to the next in a series of files to process, or aborting a series of commands if something unexpected happens.

Making that kind of decision requires the ability to run BPF programs in response to other events in the ring. The sequencing mechanisms built into io_uring now would suffice to run a program once a specific operation completes, but that program will not have access to much useful information about how the operation completed. Fixing that will, as Begunkov noted, require a way to pass the results of an operation into a BPF program when it runs. An alternative would be to tie programs directly to submitted operations (rather than making them separate operations, as is done in the patch set) that would simply run at completion time.

With that piece in place, and with the increasing number of system calls supported within io_uring, it will become possible to create complex, I/O-related programs that can run in kernel space for extended periods. Running BPF programs may look like an enhancement to io_uring, but it can also be seen as giving BPF the ability to perform I/O and run a wide range of system calls. It looks like a combination that people might do some surprising things with.

That said, this is not a feature that is likely to be widely used. On its own, io_uring brings a level of complexity that is only justified for workloads that will see a significant performance improvement from asynchronous processing. Adding BPF into the mix will increase the level of complexity significantly, and long sequences of operations and BPF programs could prove challenging to debug. Finally, loading io_uring programs requires either of the CAP_BPF or CAP_SYS_ADMIN capabilities, which means "root" in most configurations. As long as the current hostility toward unprivileged BPF programs remains, that is unlikely to change; as a result, relatively few programs are likely to use this feature.

Still, the combination of these two subsystems provides an interesting look at where Linux may go in the future. Linux will (probably) never be a unikernel system, but the line between user space and the kernel does appear to be getting increasingly blurry.

Index entries for this article
KernelBPF/io_uring
Kernelio_uring


BPF meets io_uring

Posted Mar 4, 2021 18:43 UTC (Thu) by tux3 (subscriber, #101245) [Link] (19 responses)

As foretold by prophecy...

The kernel/userland barrier just keeps being worked around in increasingly involved ways. Whether it's moving the TCP/IP stack to userspace, DIO, decoupling syscalls from context switches with the ring, or moving policy/control to kernelspace with BPF.

I wonder if this is all converging somewhere coherent, or if we'll just have to pick combinations of specialiazed speedhax depending on the application domain.

BPF meets io_uring

Posted Mar 5, 2021 11:57 UTC (Fri) by miquels (guest, #59247) [Link] (17 responses)

There's a tongue-in-cheek talk by Gary Bernhardt where he pretends to be in 2035, looking back on how noone uses javascript directly anymore, but how it is _the_ VM on which everythings runs, even the kernel. He does make a few good points, about how the VM costs 25% of performance, but all the hardware protection on the CPU costs 30% of performance so you might as well run everything in ring 0 and handle protection in the VM. This was before webassembly, now that we have that, the talk might well turn out to have been prophetic. It's here: https://www.destroyallsoftware.com/talks/the-birth-and-de... . And prophetic or not, it is funny :)

BPF meets io_uring

Posted Mar 5, 2021 16:37 UTC (Fri) by thoeme (subscriber, #2871) [Link]

Thanks for the link, it is definitely funny :-)

BPF meets io_uring

Posted Mar 5, 2021 17:57 UTC (Fri) by Sesse (subscriber, #53779) [Link] (15 responses)

I never understood the base assertion; he just says that the cost of separation is 30%? Where on earth does that come from? And why would it be cheaper to do it in software than in hardware?

BPF meets io_uring

Posted Mar 5, 2021 20:15 UTC (Fri) by excors (subscriber, #95769) [Link] (14 responses)

In principle it should be cheaper to implement some protection in software, because a trusted piece of software can guarantee that an untrusted piece of software will meet certain conditions (e.g. that it will never access an array out of bounds), either by construction (like a JS JIT compiler will never emit machine code that attempts out-of-bounds accesses) or by static analysis (like with a verifier for proof-carrying code). Once you have those guarantees, you can safely run the untrusted code with no further protection.

It's not practical for hardware to deduce those static guarantees because it's at a very different level of abstraction. All it can do by itself is dynamic bounds checks on every single memory access, which is fundamentally less efficient. Some of those checks can run in parallel with normal execution, so it costs transistors but not cycles, and transistors are cheap. But some checks can't.

The Bernhardt talk references "Deconstructing Process Isolation" (http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.1...) which says "We found that hardware-based isolation incurs non-trivial performance costs (up to 25–33%)". That's specifically about their Singularity microkernel OS, which involves a lot of message-passing IPC. Hardware isolation means virtual memory, separate page tables for each process (or for a small group of processes), TLB flushes on context switches, system calls done with sysenter, etc. Software isolation means memory safety of untrusted code is guaranteed by their compiler, so it all runs in ring 0 alongside the kernel core, using physical memory directly instead of virtual memory, and system calls are just function calls.

Also, hardware isolation apparently means their IPC system copies messages from one user process's heap into the kernel heap, deallocates from the first heap, and then copies from the kernel heap into the other user process's heap. Or for large messages it remaps physical pages from one heap to the other. (That sounds suboptimal. Surely it would be way better to have, say, some kind of ring buffer in shared memory between the processes?)

In that system, one benchmark (which reads tiny files and involves a lot of IPC between application, filesystem and disk driver) gave the quoted 25-33% reduction in performance when using hardware isolation.

I don't think it's reasonable to generalise from that specific benchmark in that specific probably-suboptimal prototype kernel into saying "the hardware protection on the CPU costs 30% of performance" - it's not like Linux would get 30% faster if you ran it without memory protection. On the other hand, that's because Linux has already been designed in a way that largely avoids those costs (e.g. by not being a microkernel), and has accepted compromises on security and reliability in exchange for performance. With more software-based protection, maybe you can get security and reliability *and* performance.

BPF meets io_uring

Posted Mar 5, 2021 22:00 UTC (Fri) by zlynx (guest, #2285) [Link] (1 responses)

Microsoft has been badly burned by asynchronous shared memory in the past. I believe that their security guidelines just don't permit it anymore. As I remember it there were new Windows NT, 2000 and XP TOCTTOU bugs almost every month.

That was almost safe before multiprocessors became so popular. With one CPU it was nearly impossible to get in between the NT kernel and its IPC.

It's just too easy to change data between the kernel checking it and using it. So it has to be copied to the kernel or otherwise isolated before using it.

BPF meets io_uring

Posted Mar 5, 2021 22:19 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]

Normal Windows syscalls are self-contained length-prefixed packets, so they are copied once into the kernel space.

Windows kernel can read data from userspace, but it's only used for syscalls that deal with bulk data transfer. This data is typically unstructured.

BPF meets io_uring

Posted Mar 5, 2021 22:54 UTC (Fri) by Sesse (subscriber, #53779) [Link] (3 responses)

Yes, so if you design your system maximally split up, and run a benchmark that runs maximum communication between those parts, you get 30% overhead? And if you design your system less split up (by implementing it in a VM, and trusting that VM blindly), it gets faster…

BPF meets io_uring

Posted Mar 6, 2021 0:03 UTC (Sat) by excors (subscriber, #95769) [Link] (2 responses)

Basically that, although they weren't "trusting that VM blindly" - they were also building a language and compiler based on C# with contracts that are statically checked using a theorem prover, and then aiming to compile to typed assembly language (which has "a proof of the type safety of compiled x86 code along with the code itself. A properly structured proof of type safety can be verified by a small, simple checker whose correctness is far easier to ensure than a compiler's"), so they were well aware of the importance of minimising the trusted computing base and using formal methods to help ensure it deserved that trust.

(The Singularity project (which was from Microsoft Research) ended some time around 2010, but it sounds like it was reasonably successful - it evolved into the Midori project (which sounds more production-oriented, whereas Singularity was more pure research), and then it sounds like Midori was technically successful but got shut down for messy reasons, but it had proved a bunch of useful ideas (both technical and cultural) that got adopted more widely within Microsoft. There's a lot of detail about those lessons at http://joeduffyblog.com/2015/11/03/blogging-about-midori/)

BPF meets io_uring

Posted Mar 6, 2021 8:01 UTC (Sat) by roc (subscriber, #30627) [Link] (1 responses)

It's unfortunate that Microsoft, with all their resources, didn't invent Rust years earlier.

BPF meets io_uring

Posted Mar 6, 2021 11:04 UTC (Sat) by excors (subscriber, #95769) [Link]

Those posts from Joe Duffy suggest that's largely what (a part of) Microsoft was trying to invent, and they just failed. E.g. specifically on concurrency safety:

> Out of the bunch, Rust has impressed me the most. They have delivered on much of what we set out to deliver with Midori, but actually shipped it (whereas we did not). My hat goes off to that team, seriously, because I know first hand what hard, hard, hard work this level of type system hacking is. [...] I trust their system more than ours, because they have shipped and had thousands of eyes and real-world experience applied to it.

(and that was 5 years ago).

If someone nowadays wanted to take the 'software isolated processes' concept into production, where safety is guaranteed by the language and verified by a small TCB and doesn't need to be enforced by expensive hardware, Rust sounds like a good place to start. (And would be nicer than just extending Linux until people start writing entire applications in BPF to avoid context switches.)

BPF meets io_uring

Posted Mar 7, 2021 20:43 UTC (Sun) by khim (subscriber, #9252) [Link] (6 responses)

> Software isolation means memory safety of untrusted code is guaranteed by their compiler

It doesn't work, unfortunately. Sun tried to patch JVM for years … but eventually have given up (well Oracle had given up at this point) and turned it into ActiveX.

Typical compiler (even JIT-compiler… maybe especially JIT-compiler) is just too complex of a beast to trust it without verification.

If you use separate muchsimpler verifier… then it may work. But not if you trust the compiler.

BPF meets io_uring

Posted Mar 7, 2021 23:16 UTC (Sun) by excors (subscriber, #95769) [Link] (5 responses)

Do you consider the BPF verifier+JIT to be trustworthy enough for its current use in the kernel? That seems like a successful(ish) example of software isolation, just applied to a tiny part of the OS instead of to nearly the whole thing (as Singularity was aiming for).

It looks like there has been an attempt to apply formal methods to BPF, which claims the JIT compiler can be feasibly verified: https://www.usenix.org/conference/osdi20/presentation/nelson

I don't see why you couldn't extend that to a more general-purpose language (maybe like Rust minus unsafe), with an untrusted compiler that does all the fancy analysis and optimisation and emits some kind of bytecode plus proof-of-correctness annotations, and then pass it to a trusted verifier+JIT which has itself been formally verified (and also is simple enough that people can review it manually). I'm not aware that anyone has done that yet, so it's probably a research project for some PhDs and then a lot more work to make it production-ready, but it doesn't sound impossible or obviously impractical. And then you just need to write a whole OS in that new language.

BPF meets io_uring

Posted Mar 8, 2021 10:04 UTC (Mon) by Wol (subscriber, #4433) [Link] (4 responses)

> I don't see why you couldn't extend that to a more general-purpose language

You cannot verify a Turing-complete language. Not possible.

Which is why BPF currently forbids loops. I wish they'd take a leaf out of the Fortran spec and say loop-control variables are read-only, at which point you can verify that the loop will terminate.

(The Fortran spec doesn't say control variables ARE read-only, but it explicitly permits storing them in a register unaffected by the rest of the code, so depending on your compiler, modifying the variable can have strange effects ...)

Cheers,
Wol

BPF meets io_uring

Posted Mar 8, 2021 12:07 UTC (Mon) by farnz (subscriber, #17727) [Link] (1 responses)

Why do you say it's not possible to verify a Turing-complete language? Formal methods people do so all the time - verifying C code, for example.

A verifier has three possible outputs - "not valid", "valid", and "indeterminate validity"; it's then up to the user of the verifier to decide what to do with the output. For example, it could declare that "not valid" is barred, "valid" is allowed, and "indeterminate validity" requires a special-case assertion of correctness from a privileged user to permit it to run.

BPF meets io_uring

Posted Mar 8, 2021 12:58 UTC (Mon) by Wol (subscriber, #4433) [Link]

I was thinking of verifying in the sense of a definite answer, not returning "indeterminate".

That's the problem with BPF and loops at the moment - a verifier cannot return "valid", so they get rejected.

Let's rephrase what I said - a verifier cannot be guaranteed to return a definitive answer, if the language is Turing complete. And that's the problem with BPF - if the problem space cannot be verified as valid, then the kernel won't run it. Which is why at present the kernel doesn't permit backward jumps in BPF. (I think they've found a way round that restriction a bit.)

Cheers,
Wol

BPF meets io_uring

Posted Mar 8, 2021 13:16 UTC (Mon) by excors (subscriber, #95769) [Link] (1 responses)

> You cannot verify a Turing-complete language. Not possible.

You can, as long as the output of the verifier is "this is safe" or "I don't know if this is safe, so I won't let you run it", for whatever definition of "safe" you want. In the trivial case a verifier can emit "don't know" for every program. The trick is in picking a useful definition of "safe", then making the provably-safe subset large enough to be useful in practice, and obvious enough to a programmer that they can stay within the provably-safe bounds and not get repeatedly frustrated by trying to write code that they know is safe but that gets rejected by the system.

The current state of the art indicates that's achievable in practice, even for high-performance general-purpose languages, though I think it's still an active research topic.

(If you're writing a whole OS in such a language, there will probably be a few places where you need to do things that are not provably safe, but in that case you can move the unprovably safe code into your trusted computing base and hopefully it's small enough that you can manually verify it with other techniques. Like if you were using Rust, 'unsafe' blocks are for unprovably safe code, and I guess you'd want to move all such blocks into external modules in the TCB, then enforce that applications cannot use 'unsafe' themselves and have to call those trusted modules instead.)

If your definition of "safe" allows non-halting programs, and just forbids e.g. memory safety issues, then the verifiable subset of your language can still be Turing complete: you can easily write a simulator for tape-based Turing machines in e.g. JavaScript, and the program will obviously be memory-safe (because JS always is), and the simulator can still compute anything that any Turing machine can (ignoring quibbles about the need for infinite memory). You don't have to sacrifice completeness unless you want to guarantee termination. (BPF does want to guarantee termination, but the more general concept of software isolation (in the Singularity sense) doesn't require that.)

> Which is why BPF currently forbids loops

BPF allows constant-bounded loops now: https://lwn.net/Articles/794934/

BPF meets io_uring

Posted Mar 27, 2021 9:35 UTC (Sat) by ksandstr (guest, #60862) [Link]

>> You cannot verify a Turing-complete language. Not possible.

>You can, as long as the output of the verifier is "this is safe" or "I don't know if this is safe, so I won't let you run it", for whatever definition of "safe" you want.

The subset of Turing-complete programs that can be verified does not sum up to a Turing-complete system. It follows that a function from an arbitrary program to a "verifiable, and known safe"/"not verifiable, or not known safe" result is at most identifying a Turing-incomplete subset by whether the program falls into it or not. This is what Wol's comment, an oft-repeated truth about Turing-complete languages (processors, VMs, whatever), means.

As for having programmers work in a Turing-incomplete language, good luck with that. People have been burning time and money on various hobby horse projects in that space since the early eighties, of which absolutely nothing has percolated into software developers' daily practice. It is literally the Philosopher's Stone of software, a windmill that many superficially smart people figure they could really take a proper tilt at if only they had a bigger horse.

Meanwhile things such as Ada, which yields many practical advantages while not compromising on Turing-completeness, languish in the dustbin of old and busted curiosities.

BPF meets io_uring

Posted Dec 29, 2023 16:04 UTC (Fri) by pgoetz (subscriber, #4931) [Link]

> It's not practical for hardware to deduce those static guarantees

I mean, this is why we have an operating system?

BPF meets io_uring

Posted Mar 5, 2021 15:16 UTC (Fri) by federico3 (guest, #101963) [Link]

> I wonder if this is all converging somewhere coherent, or if we'll just have to pick combinations of specialiazed speedhax depending on the application domain.

This is inevitable if you want to satisfy a lot of different needs.

BPF meets io_uring

Posted Mar 4, 2021 20:03 UTC (Thu) by josh (subscriber, #17465) [Link] (1 responses)

I wonder if the system administrator could authorize specific BPF programs, and unprivileged userspace could then use only those programs and no others? That would allow common libraries to come with sets of BPF programs they require, and distributions could then allow those programs automatically based on being installed in a system location.

BPF meets io_uring

Posted Mar 19, 2021 13:38 UTC (Fri) by HelloWorld (guest, #56129) [Link]

It should be pretty simple to implement. You just need a Daemon with root permissions that loads the BPF programs with bpf(2) and then hands the resulting file descriptor out to other processes using SCM_RIGHTS.

BPF meets io_uring

Posted Mar 4, 2021 21:03 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (6 responses)

So.... Is this enough to do a TLS handshake entirely in kernel-space?

BPF meets io_uring

Posted Mar 4, 2021 21:36 UTC (Thu) by krkhan (subscriber, #136994) [Link] (2 responses)

BPF meets io_uring

Posted Mar 4, 2021 21:40 UTC (Thu) by krkhan (subscriber, #136994) [Link] (1 responses)

Ah.. okay, I see it's just the record layer. But what prevents the full handshake from being done in the kernel without the BPF-io_uring patches?

BPF meets io_uring

Posted Mar 4, 2021 21:57 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

Well, you do need a fair amount of code to validate the ASN.1 certificates and negotiate the parameters. ASN.1 parsing is already present in Linux, so theoretically the amount of code to handle the full TLS handshake is not THAT large.

BPF meets io_uring

Posted Mar 5, 2021 10:32 UTC (Fri) by Sesse (subscriber, #53779) [Link] (2 responses)

Why would you want to do that? You surely cannot gain much performance, since the computational cost of the TLS handshake greatly outweighs the transition costs between kernel- and userspace.

BPF meets io_uring

Posted Mar 5, 2021 16:25 UTC (Fri) by gerdesj (subscriber, #5446) [Link] (1 responses)

You'll need it when you are creating a modern one of these: https://www.linuxjournal.com/article/4132

BPF meets io_uring

Posted Mar 7, 2021 22:35 UTC (Sun) by Sesse (subscriber, #53779) [Link]

I don't think that answers the question? Let me rephrase it; why would you want a kernel-side HTTP server with TLS support? (khttpd never really took off back in the day…)

IOSQE_IO_LINK, IORING_OP_BPF, IORING_OP_READ, and streams of length-prefixed packets

Posted Mar 20, 2021 4:11 UTC (Sat) by namibj (guest, #145316) [Link]

I think this on it's own may be useful already, without access to completion status.

A cluster compute framework (Timely Dataflow, for the curious) currently runs one send/receive thread for each connected peer.
It uses TCP connections and uses a custom message format.
The fixed-layout header contains routing information to target a specific worker thread and thread-local coroutine inbox, along with the message body's length.
For zero-copy deserialization, the message body needs to end up contiguous in the worker thread's address space (ideally NUMA-local to that thread).
Round-trip overhead largely prevents reassembly-free processing, so it currently has to memcpy messages that cross read(2) target buffer boundaries.

With BPF, it could read the header (so userspace can use it for logging etc.), and through IOSQE_IO_LINK, immediately run a small BPF program that looks at the now-in-userpsace header to extract target thread and length. After passing the target thread id from the header (combined with `clz(header.msg_len)`) to `bpf_map_lookup_elem()` to retrieve a BPF_MAP_TYPE_STACK reference, it claims a slab by `pop()`ing a pointer to the slab and it's `buf_index`.
Ignoring some bookkeeping needed to cope with incomplete reads and out-of-slabs situations, it can just finish by submitting an IORING_OP_READ_FIXED to DMA-deliver the message body into a NUMA-aware slab allocation, using IOSQE_IO_LINK to chain a IORING_OP_READ + IORING_OP_BPF pair that will deal with the next header in the stream.

This gets of course extra-powerful if that header-parsing eBPF program can run on a NIC, and if there's something concrete I could contribute to the Kernel to let eBPF inspect that header (inside a TCP stream, potentially spanning TCP packets) directly on a NIC (without first copying to a buffer in system memory, like the above sketch does), I'd like that as my first kernel contribution. I'd need to know where to start reading/poking and how much work to expect.

[0]: https://lwn.net/Articles/838884/


Copyright © 2021, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds