BPF meets io_uring
We're bad at marketingOver the last couple of years, a lot of development effort has gone into two kernel subsystems: BPF and io_uring. The BPF virtual machine allows programs from user space to be safely run within the context of the kernel, while io_uring addresses the longstanding problem of running system calls asynchronously. As the two subsystems expand, it was inevitable that the two would eventually meet; the first encounter happened in mid-February with this patch set from Pavel Begunkov adding the ability to run BPF programs from within io_uring.We can admit it, marketing is not our strong suit. Our strength is writing the kind of articles that developers, administrators, and free-software supporters depend on to know what is going on in the Linux world. Please subscribe today to help us keep doing that, and so we don’t have to get good at marketing.
The patch set itself is relatively straightforward, adding less than 300 lines of new code. It creates a new BPF program type (BPF_PROG_TYPE_IOURING) for programs that are meant to be run in the io_uring context. Any such programs must first be created with the bpf() system call, then registered with the ring in which they are intended to run using the new IORING_ATTACH_BPF command. Once that has been done, the IORING_OP_BPF operation will cause a program to be run within the ring. The final step in the patch series adds a helper function that BPF programs can use to submit new operations into the ring.
As a proof of concept, the patch series does a good job of showing how BPF programs might be run from an io_uring. This work does not, though, really enable any new capabilities in its current form, which may be part of why there have been no responses to it on the list. There is little value to running a BPF program asynchronously to submit another operation; one could simply submit that operation directly instead. As is acknowledged in the patch set, more infrastructure will be needed before this capability will become useful to users.
The obvious place where BPF can add value is making decisions based on the outcome of previous operations in the ring. Currently, these decisions must be made in user space, which involves potential delays as the relevant process is scheduled and run. Instead, when an operation completes, a BPF program might be able to decide what to do next without ever leaving the kernel. "What to do next" could include submitting more I/O operations, moving on to the next in a series of files to process, or aborting a series of commands if something unexpected happens.
Making that kind of decision requires the ability to run BPF programs in response to other events in the ring. The sequencing mechanisms built into io_uring now would suffice to run a program once a specific operation completes, but that program will not have access to much useful information about how the operation completed. Fixing that will, as Begunkov noted, require a way to pass the results of an operation into a BPF program when it runs. An alternative would be to tie programs directly to submitted operations (rather than making them separate operations, as is done in the patch set) that would simply run at completion time.
With that piece in place, and with the increasing number of system calls supported within io_uring, it will become possible to create complex, I/O-related programs that can run in kernel space for extended periods. Running BPF programs may look like an enhancement to io_uring, but it can also be seen as giving BPF the ability to perform I/O and run a wide range of system calls. It looks like a combination that people might do some surprising things with.
That said, this is not a feature that is likely to be widely used. On its own, io_uring brings a level of complexity that is only justified for workloads that will see a significant performance improvement from asynchronous processing. Adding BPF into the mix will increase the level of complexity significantly, and long sequences of operations and BPF programs could prove challenging to debug. Finally, loading io_uring programs requires either of the CAP_BPF or CAP_SYS_ADMIN capabilities, which means "root" in most configurations. As long as the current hostility toward unprivileged BPF programs remains, that is unlikely to change; as a result, relatively few programs are likely to use this feature.
Still, the combination of these two subsystems provides an interesting look
at where Linux may go in the future. Linux will (probably) never be a unikernel system, but
the line between user space and the kernel does appear to be getting
increasingly blurry.
| Index entries for this article | |
|---|---|
| Kernel | BPF/io_uring |
| Kernel | io_uring |
Posted Mar 4, 2021 18:43 UTC (Thu)
by tux3 (subscriber, #101245)
[Link] (19 responses)
The kernel/userland barrier just keeps being worked around in increasingly involved ways. Whether it's moving the TCP/IP stack to userspace, DIO, decoupling syscalls from context switches with the ring, or moving policy/control to kernelspace with BPF.
I wonder if this is all converging somewhere coherent, or if we'll just have to pick combinations of specialiazed speedhax depending on the application domain.
Posted Mar 5, 2021 11:57 UTC (Fri)
by miquels (guest, #59247)
[Link] (17 responses)
Posted Mar 5, 2021 16:37 UTC (Fri)
by thoeme (subscriber, #2871)
[Link]
Posted Mar 5, 2021 17:57 UTC (Fri)
by Sesse (subscriber, #53779)
[Link] (15 responses)
Posted Mar 5, 2021 20:15 UTC (Fri)
by excors (subscriber, #95769)
[Link] (14 responses)
It's not practical for hardware to deduce those static guarantees because it's at a very different level of abstraction. All it can do by itself is dynamic bounds checks on every single memory access, which is fundamentally less efficient. Some of those checks can run in parallel with normal execution, so it costs transistors but not cycles, and transistors are cheap. But some checks can't.
The Bernhardt talk references "Deconstructing Process Isolation" (http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.1...) which says "We found that hardware-based isolation incurs non-trivial performance costs (up to 25–33%)". That's specifically about their Singularity microkernel OS, which involves a lot of message-passing IPC. Hardware isolation means virtual memory, separate page tables for each process (or for a small group of processes), TLB flushes on context switches, system calls done with sysenter, etc. Software isolation means memory safety of untrusted code is guaranteed by their compiler, so it all runs in ring 0 alongside the kernel core, using physical memory directly instead of virtual memory, and system calls are just function calls.
Also, hardware isolation apparently means their IPC system copies messages from one user process's heap into the kernel heap, deallocates from the first heap, and then copies from the kernel heap into the other user process's heap. Or for large messages it remaps physical pages from one heap to the other. (That sounds suboptimal. Surely it would be way better to have, say, some kind of ring buffer in shared memory between the processes?)
In that system, one benchmark (which reads tiny files and involves a lot of IPC between application, filesystem and disk driver) gave the quoted 25-33% reduction in performance when using hardware isolation.
I don't think it's reasonable to generalise from that specific benchmark in that specific probably-suboptimal prototype kernel into saying "the hardware protection on the CPU costs 30% of performance" - it's not like Linux would get 30% faster if you ran it without memory protection. On the other hand, that's because Linux has already been designed in a way that largely avoids those costs (e.g. by not being a microkernel), and has accepted compromises on security and reliability in exchange for performance. With more software-based protection, maybe you can get security and reliability *and* performance.
Posted Mar 5, 2021 22:00 UTC (Fri)
by zlynx (guest, #2285)
[Link] (1 responses)
That was almost safe before multiprocessors became so popular. With one CPU it was nearly impossible to get in between the NT kernel and its IPC.
It's just too easy to change data between the kernel checking it and using it. So it has to be copied to the kernel or otherwise isolated before using it.
Posted Mar 5, 2021 22:19 UTC (Fri)
by Cyberax (✭ supporter ✭, #52523)
[Link]
Windows kernel can read data from userspace, but it's only used for syscalls that deal with bulk data transfer. This data is typically unstructured.
Posted Mar 5, 2021 22:54 UTC (Fri)
by Sesse (subscriber, #53779)
[Link] (3 responses)
Posted Mar 6, 2021 0:03 UTC (Sat)
by excors (subscriber, #95769)
[Link] (2 responses)
(The Singularity project (which was from Microsoft Research) ended some time around 2010, but it sounds like it was reasonably successful - it evolved into the Midori project (which sounds more production-oriented, whereas Singularity was more pure research), and then it sounds like Midori was technically successful but got shut down for messy reasons, but it had proved a bunch of useful ideas (both technical and cultural) that got adopted more widely within Microsoft. There's a lot of detail about those lessons at http://joeduffyblog.com/2015/11/03/blogging-about-midori/)
Posted Mar 6, 2021 8:01 UTC (Sat)
by roc (subscriber, #30627)
[Link] (1 responses)
Posted Mar 6, 2021 11:04 UTC (Sat)
by excors (subscriber, #95769)
[Link]
> Out of the bunch, Rust has impressed me the most. They have delivered on much of what we set out to deliver with Midori, but actually shipped it (whereas we did not). My hat goes off to that team, seriously, because I know first hand what hard, hard, hard work this level of type system hacking is. [...] I trust their system more than ours, because they have shipped and had thousands of eyes and real-world experience applied to it.
(and that was 5 years ago).
If someone nowadays wanted to take the 'software isolated processes' concept into production, where safety is guaranteed by the language and verified by a small TCB and doesn't need to be enforced by expensive hardware, Rust sounds like a good place to start. (And would be nicer than just extending Linux until people start writing entire applications in BPF to avoid context switches.)
Posted Mar 7, 2021 20:43 UTC (Sun)
by khim (subscriber, #9252)
[Link] (6 responses)
> Software isolation means memory safety of untrusted code is guaranteed by their compiler
It doesn't work, unfortunately. Sun tried to patch JVM for years
… but eventually have given up (well Oracle had given up at this point) and turned it into ActiveX. Typical compiler (even JIT-compiler… maybe especially JIT-compiler) is just too complex of a beast to trust it without verification. If you use separate muchsimpler verifier… then it may work. But not if you trust the compiler.
Posted Mar 7, 2021 23:16 UTC (Sun)
by excors (subscriber, #95769)
[Link] (5 responses)
It looks like there has been an attempt to apply formal methods to BPF, which claims the JIT compiler can be feasibly verified: https://www.usenix.org/conference/osdi20/presentation/nelson
I don't see why you couldn't extend that to a more general-purpose language (maybe like Rust minus unsafe), with an untrusted compiler that does all the fancy analysis and optimisation and emits some kind of bytecode plus proof-of-correctness annotations, and then pass it to a trusted verifier+JIT which has itself been formally verified (and also is simple enough that people can review it manually). I'm not aware that anyone has done that yet, so it's probably a research project for some PhDs and then a lot more work to make it production-ready, but it doesn't sound impossible or obviously impractical. And then you just need to write a whole OS in that new language.
Posted Mar 8, 2021 10:04 UTC (Mon)
by Wol (subscriber, #4433)
[Link] (4 responses)
You cannot verify a Turing-complete language. Not possible.
Which is why BPF currently forbids loops. I wish they'd take a leaf out of the Fortran spec and say loop-control variables are read-only, at which point you can verify that the loop will terminate.
(The Fortran spec doesn't say control variables ARE read-only, but it explicitly permits storing them in a register unaffected by the rest of the code, so depending on your compiler, modifying the variable can have strange effects ...)
Cheers,
Posted Mar 8, 2021 12:07 UTC (Mon)
by farnz (subscriber, #17727)
[Link] (1 responses)
Why do you say it's not possible to verify a Turing-complete language? Formal methods people do so all the time - verifying C code, for example.
A verifier has three possible outputs - "not valid", "valid", and "indeterminate validity"; it's then up to the user of the verifier to decide what to do with the output. For example, it could declare that "not valid" is barred, "valid" is allowed, and "indeterminate validity" requires a special-case assertion of correctness from a privileged user to permit it to run.
Posted Mar 8, 2021 12:58 UTC (Mon)
by Wol (subscriber, #4433)
[Link]
That's the problem with BPF and loops at the moment - a verifier cannot return "valid", so they get rejected.
Let's rephrase what I said - a verifier cannot be guaranteed to return a definitive answer, if the language is Turing complete. And that's the problem with BPF - if the problem space cannot be verified as valid, then the kernel won't run it. Which is why at present the kernel doesn't permit backward jumps in BPF. (I think they've found a way round that restriction a bit.)
Cheers,
Posted Mar 8, 2021 13:16 UTC (Mon)
by excors (subscriber, #95769)
[Link] (1 responses)
You can, as long as the output of the verifier is "this is safe" or "I don't know if this is safe, so I won't let you run it", for whatever definition of "safe" you want. In the trivial case a verifier can emit "don't know" for every program. The trick is in picking a useful definition of "safe", then making the provably-safe subset large enough to be useful in practice, and obvious enough to a programmer that they can stay within the provably-safe bounds and not get repeatedly frustrated by trying to write code that they know is safe but that gets rejected by the system.
The current state of the art indicates that's achievable in practice, even for high-performance general-purpose languages, though I think it's still an active research topic.
(If you're writing a whole OS in such a language, there will probably be a few places where you need to do things that are not provably safe, but in that case you can move the unprovably safe code into your trusted computing base and hopefully it's small enough that you can manually verify it with other techniques. Like if you were using Rust, 'unsafe' blocks are for unprovably safe code, and I guess you'd want to move all such blocks into external modules in the TCB, then enforce that applications cannot use 'unsafe' themselves and have to call those trusted modules instead.)
If your definition of "safe" allows non-halting programs, and just forbids e.g. memory safety issues, then the verifiable subset of your language can still be Turing complete: you can easily write a simulator for tape-based Turing machines in e.g. JavaScript, and the program will obviously be memory-safe (because JS always is), and the simulator can still compute anything that any Turing machine can (ignoring quibbles about the need for infinite memory). You don't have to sacrifice completeness unless you want to guarantee termination. (BPF does want to guarantee termination, but the more general concept of software isolation (in the Singularity sense) doesn't require that.)
> Which is why BPF currently forbids loops
BPF allows constant-bounded loops now: https://lwn.net/Articles/794934/
Posted Mar 27, 2021 9:35 UTC (Sat)
by ksandstr (guest, #60862)
[Link]
>You can, as long as the output of the verifier is "this is safe" or "I don't know if this is safe, so I won't let you run it", for whatever definition of "safe" you want.
The subset of Turing-complete programs that can be verified does not sum up to a Turing-complete system. It follows that a function from an arbitrary program to a "verifiable, and known safe"/"not verifiable, or not known safe" result is at most identifying a Turing-incomplete subset by whether the program falls into it or not. This is what Wol's comment, an oft-repeated truth about Turing-complete languages (processors, VMs, whatever), means.
As for having programmers work in a Turing-incomplete language, good luck with that. People have been burning time and money on various hobby horse projects in that space since the early eighties, of which absolutely nothing has percolated into software developers' daily practice. It is literally the Philosopher's Stone of software, a windmill that many superficially smart people figure they could really take a proper tilt at if only they had a bigger horse.
Meanwhile things such as Ada, which yields many practical advantages while not compromising on Turing-completeness, languish in the dustbin of old and busted curiosities.
Posted Dec 29, 2023 16:04 UTC (Fri)
by pgoetz (subscriber, #4931)
[Link]
I mean, this is why we have an operating system?
Posted Mar 5, 2021 15:16 UTC (Fri)
by federico3 (guest, #101963)
[Link]
This is inevitable if you want to satisfy a lot of different needs.
Posted Mar 4, 2021 20:03 UTC (Thu)
by josh (subscriber, #17465)
[Link] (1 responses)
Posted Mar 19, 2021 13:38 UTC (Fri)
by HelloWorld (guest, #56129)
[Link]
Posted Mar 4, 2021 21:03 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link] (6 responses)
Posted Mar 4, 2021 21:36 UTC (Thu)
by krkhan (subscriber, #136994)
[Link] (2 responses)
Posted Mar 4, 2021 21:40 UTC (Thu)
by krkhan (subscriber, #136994)
[Link] (1 responses)
Posted Mar 4, 2021 21:57 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link]
Posted Mar 5, 2021 10:32 UTC (Fri)
by Sesse (subscriber, #53779)
[Link] (2 responses)
Posted Mar 5, 2021 16:25 UTC (Fri)
by gerdesj (subscriber, #5446)
[Link] (1 responses)
Posted Mar 7, 2021 22:35 UTC (Sun)
by Sesse (subscriber, #53779)
[Link]
Posted Mar 20, 2021 4:11 UTC (Sat)
by namibj (guest, #145316)
[Link]
A cluster compute framework (Timely Dataflow, for the curious) currently runs one send/receive thread for each connected peer.
With BPF, it could read the header (so userspace can use it for logging etc.), and through IOSQE_IO_LINK, immediately run a small BPF program that looks at the now-in-userpsace header to extract target thread and length. After passing the target thread id from the header (combined with `clz(header.msg_len)`) to `bpf_map_lookup_elem()` to retrieve a BPF_MAP_TYPE_STACK reference, it claims a slab by `pop()`ing a pointer to the slab and it's `buf_index`.
This gets of course extra-powerful if that header-parsing eBPF program can run on a NIC, and if there's something concrete I could contribute to the Kernel to let eBPF inspect that header (inside a TCP stream, potentially spanning TCP packets) directly on a NIC (without first copying to a buffer in system memory, like the above sketch does), I'd like that as my first kernel contribution. I'd need to know where to start reading/poking and how much work to expect.
BPF meets io_uring
BPF meets io_uring
BPF meets io_uring
BPF meets io_uring
BPF meets io_uring
BPF meets io_uring
BPF meets io_uring
BPF meets io_uring
BPF meets io_uring
BPF meets io_uring
BPF meets io_uring
BPF meets io_uring
BPF meets io_uring
BPF meets io_uring
Wol
BPF meets io_uring
BPF meets io_uring
Wol
BPF meets io_uring
BPF meets io_uring
BPF meets io_uring
BPF meets io_uring
BPF meets io_uring
BPF meets io_uring
BPF meets io_uring
BPF meets io_uring
BPF meets io_uring
BPF meets io_uring
BPF meets io_uring
BPF meets io_uring
BPF meets io_uring
IOSQE_IO_LINK, IORING_OP_BPF, IORING_OP_READ, and streams of length-prefixed packets
It uses TCP connections and uses a custom message format.
The fixed-layout header contains routing information to target a specific worker thread and thread-local coroutine inbox, along with the message body's length.
For zero-copy deserialization, the message body needs to end up contiguous in the worker thread's address space (ideally NUMA-local to that thread).
Round-trip overhead largely prevents reassembly-free processing, so it currently has to memcpy messages that cross read(2) target buffer boundaries.
Ignoring some bookkeeping needed to cope with incomplete reads and out-of-slabs situations, it can just finish by submitting an IORING_OP_READ_FIXED to DMA-deliver the message body into a NUMA-aware slab allocation, using IOSQE_IO_LINK to chain a IORING_OP_READ + IORING_OP_BPF pair that will deal with the next header in the stream.
Copyright © 2021, Eklektix, Inc.
This article may be redistributed under the terms of the
Creative
Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds
