Commits
master
Name already in use
Commits on Jul 27, 2023
-
nginxconfig: Removeworker_processesand `events.worker_connecti……ons`. This is closer to the default nginx settings with respect to concurrency. Turn off access log, as it is heavy on I/O and would not be used in a production setup (whether with or without gVisor). Update nginx to version `1.25.1`. PiperOrigin-RevId: 551669286
-
site: note that flags passed to run should be replicated for restore
PiperOrigin-RevId: 551588446
-
Merge pull request #8990 from sitano:ivan_ptrace_eperm_guide
PiperOrigin-RevId: 551559276
Commits on Jul 24, 2023
-
gVisor
fiobenchmarks: Uselibaiowhere it makes sense.Enforce that `IODepth` must be `1` when using the `sync` IO engine, since it has no effect with that engine. Also add unit names to `tools.Fio` struct fields for easier readability. PiperOrigin-RevId: 550710273
-
This change introduces the nsfs file system. Each new namespace allocates a new nsfs inode. Here are reasons why we need these inodes: * each namespace has to have an unique id. * proc/pid/ns/ contains one entry for each namespace. Bind mounting one of the files in this directory to somewhere else in the filesystem keeps the corresponding namespace alive even if all processes currently in the namespace terminate. * setns() allows the calling process to join an existing namespace specified by a file descriptor. PiperOrigin-RevId: 550694515
-
Better memory reporting for multi-container
Right now, the entire sandbox memory is reported per-container, confusing users and tools that aggregate per-container memory to compute sandbox/pod memory. So instead, split memory usage amoung all containers in the system, except for the root container which is ignored by K8s. This way pod memory usage is shown correctly in graphs. Updates #172 PiperOrigin-RevId: 550670618
-
Fix fio "regex"s in buildkite file.
PiperOrigin-RevId: 550601194
-
Commits on Jul 21, 2023
-
Add methods for generating PCI sysfs paths and registering accel devi…
…ces. The TPU userspace driver needs access to specific PCI device information located in Linux sysfs. We mirror the sysfs paths the driver reads on the host in the Sentry sysfs. This way we can ensure we only expose the host device information that's strictly necessary for TPU to run. PiperOrigin-RevId: 550005271
-
Remove last remaining !go1.22 build tag
The last remaining !go1.22 build is protecting the definition of pkg/sync.maptype, which is a copy of runtime.maptype. We need to ensure these definitions match so we can safely access the hasher field. At its core, this CL achieves this check by ensuring that unsafe.Offsetof(maptype{}.Hasher) matches the offset in the runtime version of the type. Several things happen along the way to achieve this: * As of May 2023, runtime.maptype is actually a type alias for internal/abi.MapType. checkoffset was failing to record the offsets because it skipped type aliases for no good reason. Simply removing the type alias check is sufficient to make type aliases work. (This part of the CL is technically unnecessary because this CL ultimately references internal/abi.MapType directly in anticipation of removal of the type alias. But there is no reason not to allow type aliases). * The checkconst / checkoffset regexp unintentionally does not allow / in package paths, even though the rest of the package supports /. Fix this. * checkconst was comparing the literal AST expression string against the runtime value (i.e., "unsafe.Offsetof(maptype{}.Hasher)" vs "72", which fails comparison. Switch to getting the resolved constant value from the type checker. * nogo/check.importer only loads package facts on direct import (stored in importer.cache). If a package is not directly imported ImportPackageFact will not find the facts. Typically packages need to ensure they directly depend on packages they want facts from (e.g., pkg/sync has a dummy import of runtime in runtime.go). This doesn't work for internal/abi because we cannot directly import an internal package. Work around this as a hack by unconditionally "importing" internal/abi when analyzing any package. With regard to the last point, not that the nogo/defs.bzl nogo integration only provides facts from the direct dependencies and the entire stdlib (since the stdlib is analyzed as one bundle). So this trick only works for a stdlib package. A bazel package indirect dependency would be missing facts altogether. PiperOrigin-RevId: 549999084 -
Add nvproxy support for V100 Nvidia GPUs.
Tested on 1 V100 GPU: ``` $ docker run --runtime=runsc --rm --gpus all nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubi8 [Vector addition of 50000 elements] Copy input data from the host memory to the CUDA device CUDA kernel launch with 196 blocks of 256 threads Copy output data from the CUDA device to the host memory Test PASSED Done ``` PiperOrigin-RevId: 549837326
-
PiperOrigin-RevId: 549820338
Commits on Jul 20, 2023
-
Merge pull request #9007 from andrew-d:andrew/tcp-forwarder-on-ignored
PiperOrigin-RevId: 549727271
-
Add seccomp filters for TPU proxying and stub out accel fd methods.
PiperOrigin-RevId: 549718797
-
Add config flags and sandbox chroot configuration for TPU proxying.
PiperOrigin-RevId: 549662855
-
Plumb memory cgroup id in memmap.IncRef.
Update the memmap IncRef method to pass memory cgroup id and store it in the FrameRefSet which will be used for memory accounting. During DecRef, the memCgID from the FrameRefSet will be retrieved and passed to MemoryLocked.Dec to remove the memory from the cgroup. PiperOrigin-RevId: 549656411
-
Add
O_DIRECTversion offiobenchmarks to track direct I/O perfor……mance. PiperOrigin-RevId: 549470615
Commits on Jul 19, 2023
-
Run
nvidia-container-cli configurein the Gofer mount namespace.This change adds a new synchronization FD to the Gofer startup sequence, passed as `sync-nvproxy-fd`. The `runsc create` process uses this to wait for the Gofer to start, then to run `nvidia-container-cli configure [...] --pid=$GOFER_PID`. This causes the mounts that `nvidia-container-cli configure` does to be performed in the mount namespace of the Gofer, rather than the `runsc create` process. This avoids polluting the main mount namespace with NVIDIA-specific mountpoints, and ties the lifetime of these mounts to the lifetime of the Gofer process, which means they are cleaned up automatically when the sandbox exits. Due to the added complexity in Gofer startup, this CL also introduces a `goferSyncFDs` struct that encodes some of the logic around these FDs, and better documents how they interact with the Gofer and the container startup sequence. One suggestion by Ayush was to move `nvidia-container-cli` to be done after `createGoferProcess` returns. Unfortunately this isn't possible without having to return the nvproxy FD in the `createGoferProcess` return signature, since FDs can only be donated before the Gofer process has started. This would make the signature uglier. So instead, this CL takes the approach of a single `nvproxyConfigureGofer` function called during Gofer initialization. It creates and donates the FD to the Gofer command, and returns a callback function called after the Gofer process is started, where it finally runs `nvidia-container-cli configure` and then notifies the Gofer through this same FD it created. This encapsulates all the logic within `nvproxyConfigureGofer` and is the cleanest I could think of. Tested manually using a fresh Debian machine, with: ```shell $ sudo mkdir -p /tmp/bundle-cuda/rootfs $ docker export $(docker create nvidia/cuda:11.6.2-base-ubuntu20.04) \ | sudo tar -xf - -C /tmp/bundle-cuda/rootfs $ sudo runc spec --bundle=/tmp/bundle-cuda $ $EDITOR /tmp/bundle-cuda/config.json # Add NVIDIA_VISIBLE_DEVICES=0 and NVIDIA_DRIVER_CAPABILITIES=all to env $ sudo ./runsc -nvproxy -nvproxy-docker create --bundle=/tmp/bundle-cuda mycuda $ sudo ./runsc -nvproxy -nvproxy-docker start mycuda $ sudo ./runsc -nvproxy -nvproxy-docker exec mycuda nvidia-smi -L (Works) $ sudo ./runsc delete --force mycuda # And verified at each step that `grep bundle-cuda /proc/mounts` was empty. ``` And also verified that regular use through Docker also works. Fixes #9142. PiperOrigin-RevId: 549427158 -
Don't run benchmarks on ptrace on buildkite.
PiperOrigin-RevId: 549384309
-
Add accel and gasket ABI definitions.
PiperOrigin-RevId: 549376196
-
Allow walking on FIFO and UDS in lisafs.
The flags --host-uds and --host-fifo only control whether the application can open/connect or create/bind these special files. Stat-ing a host FIFO or UDS should not be blocked. PiperOrigin-RevId: 549199286
Commits on Jul 18, 2023
-
Increment/decrement memory accounted per cgroup.
- Adds a new field in the usageInfo to store the memory cgroup id. - Creates a map of cgroup ids and memory stats to track the memory per cgroup in MemoryLocked struct. - Introduces new methods to increment, decrement, move, copy and get the total memory usage per cgroup. PiperOrigin-RevId: 549148091
-
Pass NV2080_CTRL_CMD_MC_SERVICE_INTERRUPTS through nvproxy.
Fixes #9176 PiperOrigin-RevId: 549125072
-
Remove panic in ConsumeCoverageData() when no coverage is observed.
A call to ConsumeCoverageData() can observe zero incremental coverage immediately after a concurrent call to ConsumeCoverageData() unlocks coverageMu if sync.Mutex.Lock/Unlock are excluded from coverage instrumentation. PiperOrigin-RevId: 549119637
-
Update host redirect handling for gvisor.dev
We now make sure that the requested domain is a valid domain based on the custom domain and project ID settings before redirecting. PiperOrigin-RevId: 548859553
Commits on Jul 17, 2023
-
kernfs: Don't try to cache anonymous inodes.
They have no parent, so are not reachable again. PiperOrigin-RevId: 548765107
-
gvisor.dev homepage: Minor fixes.
This CL does the following: - Add `<strong>emphasis</strong>` on the important keywords for each panel - Mention "LLM-generated code" as code that can be sandboxed in gVisor - Change "GPU support" header to "GPU & CUDA support" - Add PNG transparency to images where it was missing - Remove link to raw image file on architecture diagram - Adjust panel icon size - Adjust `<h2>` margins in panels - Small CSS cleanups PiperOrigin-RevId: 548764864
-
Do not hold metadataMu on gofer O_DIRECT read path.
dentry.writeback() takes dataMu when it needs to. This lock seems to be unnecessary. PiperOrigin-RevId: 548763586
-
pkg/tcpip/transport/tcp: add statistics for dropped connections
When the TCP forwarder ignores a connection due to having too many in-flight connections, it's not easy to log a message or update a metric for later debugging. Add a metric that will be incremented in this case so that the user of the Forwarder can observe this. Signed-off-by: Andrew Dunham <andrew@du.nham.ca>
Commits on Jul 15, 2023
-
The test checked the RTO value(500ms) for the first retransmit by rounding the value to seconds which resulted in 1s. In some cases, when the RTO calculated was slightly less than 500ms (~499 ms) the test failed. Fix this by checking the absolute difference when the calculated rto is less than expected rto. Before: http://sponge2/6a8d125a-ff90-4090-8565-76b9f8a91573 After: http://sponge2/386f9716-bdc1-4079-848d-4ebc23b70167 PiperOrigin-RevId: 548273122
Commits on Jul 14, 2023
-
Impose default tmpfs size limits correctly.
Syzkaller came up with workloads that fallocate(2) 1 TB in /tmp. The host mlock(2) or madvise(2) syscalls on memfd(2) files end up hanging for multiple minutes in such situations causing the watchdog to mark the calling goroutine as stuck. memfd(2) files have not size limits. Linux fails such fallocate(2) attempts in /tmp with ENOSPC. In Linux tmpfs (shmem), when size= mount option is not specified, the default size limit for the mount is set to 50% of physical RAM size. But in gVisor, it is set to MaxInt64. Which is why Linux fails with ENOSPC and gVisor doesn't. In runsc, the physcial RAM size is already exposed to the containerized application via `/proc/meminfo` which uses `usage.*TotalMemoryBytes`. These fields are configured using the `MemTotal:` field from host `/proc/meminfo`. So use that information to set the default size limit correctly. Reported-by: syzbot+4aa3d6d42b063a11c850@syzkaller.appspotmail.com PiperOrigin-RevId: 548252095
-
Implement PR_{S,G}ET_CHILD_SUBREAPER.
Closes #2323 PiperOrigin-RevId: 548205854
-
Enforce --host-fifo flag in directfs.
The flag is only enforced in lisafs gofer as of now. This change plumbs a gofer client flag which disallowing opening FIFO from the host filesystem. PiperOrigin-RevId: 548193558
-
netstack: remove finished TODO
Fixes #6015. PiperOrigin-RevId: 548191467
-
Use write(2) host syscall to perform writes on disk-backed MemoryFiles.
Prepopulating pages for disk-backed MemoryFiles has proved to be futile. The mf.MapInternal()+safemem.CopySeq() approach used right now incurs a lot of page faults without page population. Page-by-page faults incurs a lot of context switching. On the other hand, the write syscall makes one context switch to kernel, and faults all the pages that are touched during write. Note that safemem.CopySeq() avoids a syscall and hence can be faster sometimes when the underlying page is populated. But with disk writebacks, it is hard to predict/account what is populated. Writebacks can happen asynchronously based on system load. Benchmark results show that FIO write performance improves a lot on rootfs: ``` goos: linux goarch: amd64 cpu: Intel(R) Xeon(R) CPU @ 2.80GHz │ benchout.runsc-before │ benchout.runsc-after │ │ sec/op │ sec/op vs base │ BuildABSL/page_cache.clean/filesystem.bindfs-4 90.20 ± 2% 89.44 ± 1% ~ (p=0.382 n=8) BuildGRPC/page_cache.clean/filesystem.bindfs-4 626.0 ± 1% 626.8 ± 0% ~ (p=0.505 n=8) RubySpecTest/page_cache.clean/filesystem.bindfs-4 52.11 ± 1% 52.37 ± 1% ~ (p=0.105 n=8) Fio/operation.write/blockSize.4K/filesystem.rootfs-4 2.509m ± 0% 2.509m ± 0% ~ (p=0.878 n=8) Fio/operation.write/blockSize.64K/filesystem.rootfs-4 2.009m ± 0% 1.507m ± 0% -24.98% (p=0.000 n=8) Fio/operation.write/blockSize.1024K/filesystem.rootfs-4 2.008m ± 0% 1.508m ± 0% -24.90% (p=0.000 n=8) │ benchout.runsc-before │ benchout.runsc-after │ │ bandwidth.bytes_per_second │ bandwidth.bytes_per_second vs base │ Fio/operation.write/blockSize.4K/filesystem.rootfs-4 649.1M ± 2% 705.2M ± 2% +8.64% (p=0.000 n=8) Fio/operation.write/blockSize.64K/filesystem.rootfs-4 991.1M ± 1% 1499.1M ± 3% +51.25% (p=0.000 n=8) Fio/operation.write/blockSize.1024K/filesystem.rootfs-4 1.198G ± 2% 1.945G ± 2% +62.34% (p=0.000 n=8) │ benchout.runsc-before │ benchout.runsc-after │ │ io_ops.ops_per_second │ io_ops.ops_per_second vs base │ Fio/operation.write/blockSize.4K/filesystem.rootfs-4 158.5k ± 2% 172.2k ± 2% +8.64% (p=0.000 n=8) Fio/operation.write/blockSize.64K/filesystem.rootfs-4 15.12k ± 1% 22.87k ± 3% +51.25% (p=0.000 n=8) Fio/operation.write/blockSize.1024K/filesystem.rootfs-4 1.143k ± 2% 1.855k ± 2% +62.34% (p=0.000 n=8) │ benchout.runsc-before │ benchout.runsc-after │ │ load.sec │ load.sec vs base │ RubySpecTest/page_cache.clean/filesystem.bindfs-4 7.555 ± 1% 7.585 ± 1% ~ (p=0.457 n=8) ``` PiperOrigin-RevId: 548021076

