master
Commits on Sep 23, 2022
-
[ci/release] Enforce DeleteOnTermination is True for all Ebs volumes (#…
-
[ci] Only target m5.2xlarge instances for rllib tests (#28736)
The rest of the tests can run on the default instances (m5.xlarge). Signed-off-by: Kai Fricke <kai@anyscale.com>
-
[cost-reduction] reduce both the time and machine cost for test_many_…
…tasks In this way we can reduce the test cost.
-
[RLlib] Better error message for partial space mismatch (Dict|Tuple) …
…between env-returned values and given action/obs space. (#27785)
-
-
-
[ci] Re-write bazelrc on PR builds (#28728)
`install-bazel.sh` is only called in the base image build, which is a branch build - however, this sets a wrong caching behavior in PR images that inherit from this image. Instead we should re-write the bazelrc everytime. Signed-off-by: Kai Fricke <kai@anyscale.com>
-
[Datasets] Add metadata override and inference in
Dataset.to_dask(). (#28625) Adds an option to override the metadata when converting a Ray Dataset to a Dask DataFrame. If no override is provided, Datasets will infer the Dask DataFrame metadata using the Dataset schema (this should be cheaper than Dask's metadata inference, which involves launching a task).
-
Handle starting worker throttling inside worker pool (#28551)
Currently, worker pool has throttling of how many workers can be started simultaneously (i.e. maximum_startup_concurrency_). Right now if a PopWorker call cannot be fulfilled due to throttling, it will fail and the caller (i.e. local task manger) will handle the retry. The issue is that when PopWorker fails, local task manager will release the resources claimed by the task. As a result, even though the node already has enough tasks to use up all the resources, it will still show available resources and attract more tasks than it can handle. Instead of letting local task manager handles the throttling, it should be handled internally in worker pool since throttling is a transient thing and is not a real error. It's effectively the same as longer worker startup time. Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Commits on Sep 22, 2022
-
[docs] Add basic parallel execution guide for Tune and cleanup order …
…of guides (#28677) Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
-
Remove RAY_RAYLET_NODE_ID (#28715)
Since this feature has been reverted, this is no longer needed. #28678 Signed-off-by: rickyyx <rickyx@anyscale.com>
-
Add API latency and call counts metrics to dashboard APIs (#28279)
Adds basic latency and call count metrics for dashboard API endpoints. This willl allow us to more easily debug issues where the dashboard apis are unresponsive or slow. Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
-
-
-
-
-
[AIR] Maintain dtype info in LightGBMPredictor (#28673)
We always convert to numpy and then back to dataframe in `LightGBMPredictor`, and try to infer dtypes in between. This is imprecise and allows for an edge case where a Categorical column composed of integers is classified as an int column, and it also decreases performance. This PR keeps dtype information if possible by not converting to numpy unnecessarily. The inference logic is still present for the tensor column case - I am not familiar enough with it to fix it here (if it needs fixing in the first place). Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
-
[tune] Test background syncer serialization (#28699)
Syncer (and thus Trial and Trial Runner) serialization fails because of threading objects that are not correctly unset on getstate. Signed-off-by: Kai Fricke <kai@anyscale.com>
-
[ci] Requirements contains duplicate of 'starlette' (#28698)
Requirements contains duplicate of 'starlette' Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
-
[Tune] [PBT] Maintain consistent
Trial/TrialRunnerstate when pau……sing and resuming trial (#28511) When running synchronous PBT while checkpointing every time a perturbation happens, the experiment can reach a state where trial A is RUNNING but hanging forever without ever performing another train step, and trial B is PAUSED waiting for A to reach the specified perturbation_interval. Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
-
[ci] Fix mac pipeline (use python 2 in CI scripts) (#28695)
determine_tests_to_run.py uses python2 on mac, so we need to keep compatibility there Signed-off-by: Kai Fricke <kai@anyscale.com>
-
[ci] Move to new hierarchical docker structure + pipeline (#28641)
This PR moves our buildkite pipeline to a new hierarchical structure and will be used with the new buildkite pipeline. When merging this PR, the old behavior will still work, i.e. the old pipeline is still in place. After merging this PR, we can build the base images for the master branch, and then switch the CI pipelines to use the new build structure. Once this switch has been done, the following files will be removed: - `./buildkite/pipeline.yml` - this has been split into pipeline.test.yml and pipeline.build.yml - `./buildkite/Dockerfile` - this has been moved (and split) to `./ci/docker/` - `./buildkite/Dockerfile.gpu` - this has been moved (and split) to `./ci/docker/` The new structure is as follow: - `./ci/docker` contains hierarchical docker files that will be built by the pipeline. - `Dockerfile.base_test` contains common dependencies - `Dockerfile.base_build` inherits from it and adds build-specific dependencies, e.g. llvm, nvm, java - `Dockerfile.base_ml` inherits from `base_test` and adds ML dependencies, e.g. torch, tensorflow - `Dockerfile.base_gpu` depends on a cuda image and otherwise has the same contents as `base_test` and `base_ml` combined In each build, we do the following - `Dockerfile.build` is built on top of `Dockerfile.base_build`. Dependencies are re-installed, which is mostly a no-op (except if they changed from when the base image was built) - `Dockerfile.test` is built on top of `Dockerfile.base_test`, and the extracted Ray installation from`Dockerfile.build` is injected - The same is true respectively for `ml` and `gpu`. The pipelines have been split, and a new attribute `NO_WHEELS_REQUIRED` is added, identifying tests that can be early-started. Early start means that the last available branch image is used and the current code revision is checked out upon it. See https://github.com/ray-project/buildkite-ci-pipelines/ for the pipeline logic. Additionally, this PR identified two CI regressions that haven't been caught previously, namely the minimal install tests that didn't properly install the respective Python versions, and some runtime environment tests that don't work with later Ray versions. These should be addressed separately and I'll create issues for them once this PR is merged. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com> Co-authored-by: Artur Niederfahrenhorst <artur@anyscale.com>
-
[Doc] Revamp ray core design patterns doc [8/n]: pass large arg by va…
…lue (#28660) Add a new anti-pattern of passing large arg by value. Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
-
[Datasets] Add initial aggregate benchmark (#28486)
This PR is to add initial aggregate benchmark (for h2oai benchmark - https://github.com/h2oai/db-benchmark). To follow the convention in h2oai benchmark, the benchmark is run on single node (https://h2oai.github.io/db-benchmark/#environment-configuration). No fandamental blocker for us to run the benchmark in multiple nodes (just a matter to change our `yaml` file). The benchmark has 3 input files setting - 0.5GB, 5GB and 50GB. Here we start with 0.5GB input file. Followup PR will add benchmark for 5GB and 50GB (just a matter to generate input file, no benchmark code change needed). NOTE: Didn't optimize the benchmark queries yet, and just write the most straight-forward version of code here. We can use this as a baseline to fix the performance gap and optimize it. A typical benchmark workflow would be: 1. Create `xxx_benchmark.py` file for the specific APIs to benchmark (e.g. `split_benchmark.py` for split-related APIs). 2. Use `Benchmark` class to run benchmark. 3. Check in benchmark code after testing locally and workspace. 4. Monitor nightly tests result. 5. Create Preset/Databricks dashboard and alert on benchmark result.
-
[KubeRay][Operator] Improve migration notes (#28672)
Improves legacy Ray operator -> KubeRay migration notes by - Fixing a formatting issue - Adding a note not to specify metadata.name for pod templates
Commits on Sep 21, 2022
-
[core] Support generators to allow tasks to return a dynamic number o…
…f objects (#28291) This adds support for tasks that need to return a dynamic number of objects. When a remote generator function is invoked and num_returns for the task is 1, the worker will dynamically allocate ray.put IDs for these objects and store an ObjectRefGenerator as its return value. This allows the worker to choose how many objects to return and to keep heap memory low, since it does not need to keep all objects in memory simultaneously. Unlike normal ray.put(), we assign the task caller as the owner of the object. This is to improve fault tolerance, as the owner can recover dynamically generated objects through the normal lineage reconstruction codepath. The main complication has to do with notifying the task caller that it owns these objects. We do this in two places, which is necessary because the protocols are asynchronous, so either message can arrive first. When the task reply is received. When the primary raylet subscribes to the eviction notice from the owner. To register the dynamic return, the owner adds the ObjectRef to the ref counter and marks that it is contained in the generator object. Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu> Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com> Co-authored-by: Eric Liang <ekhliang@gmail.com>
-
[doc] Add documentation around trial/experiment checkpoint. (#28303)
As this has been a continuous confusion for both OSS and product. Signed-off-by: xwjiang2010 xwjiang2010@gmail.com
-
[Train] Automatically set
NCCL_SOCKET_IFNAMEto use ethernet (#28633)Signed-off-by: Amog Kamsetty amogkamsetty@yahoo.com Automatically set NCCL_SOCKET_IFNAME to prefer ethernet. Also adds a FAQ section in the docs on how to diagnose this issue. Closes #26663
-
[Train] Immediately fail if application errors on any worker (#28314)
Signed-off-by: Amog Kamsetty amogkamsetty@yahoo.com Resolves https://discuss.ray.io/t/torch-trainer-gets-stuck/7447 Previously with Ray Train, if 1 worker raises an Exception in the application code, this Exception is not raised in the driver until all workers have finished executing the training function. This can lead to hanging behaviors where other workers are waiting on a collective/torch.distributed.barrier() call for example. With this PR, if any worker fails in the application code, the entire training job is immediately terminated and the exception is raised on the driver.
-
[Serve] Replace
list_named_actorsin Serve tests withlist_actors…… from new State API (#28543)

