Add SLURM scripts for OLMo SFT with resume support#1368
Conversation
This PR adds SLURM-compatible scripts for running OLMo SFT training on clusters without Beaker/cloud infrastructure. Changes: 1. Resume checkpoint support for convert_sft_data_for_olmocore.py: - Add --resume flag to continue from last checkpoint after interruption - Add --checkpoint_interval to control checkpoint frequency (default 100k) - Checkpoints saved atomically to _checkpoint.json - Automatic cleanup on successful completion - Fixed shuffle seed (42) for reproducible resume 2. SLURM data preparation scripts: - prepare_dolci_think_data.sh: Tokenize Dolci-Think-SFT (~22B tokens) - prepare_dolci_instruct_data.sh: Tokenize Dolci-Instruct-SFT (~1.8B) - Both include resume support for time-limited queues 3. SLURM training scripts (requires OLMo-core clone): - train_dolci_think.sh: Train Think SFT (lr=5e-5, ~24h on 8x H100) - train_dolci_instruct.sh: Train Instruct SFT (lr=8e-5, ~4h on 8x H100) - Hyperparameters from OLMo-3 paper Table 47 (arxiv 2512.13961) - Uses OLMOCORE_PATH env var to locate OLMo-core installation 4. Tests for checkpoint functionality (15 tests) Usage: # Data preparation sbatch scripts/slurm/sft/prepare_dolci_think_data.sh # Training (after cloning OLMo-core) OLMOCORE_PATH=/path/to/OLMo-core \ DATASET_PATH=./data/dolci_think_sft_tokenized \ BASE_CKPT=/path/to/OLMo-3-7B \ sbatch scripts/slurm/sft/train_dolci_think.sh
Summary of ChangesHello @ferreirafabio, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the usability of OLMo SFT workflows on high-performance computing clusters by introducing comprehensive SLURM support. It addresses critical challenges associated with long-duration data processing, such as preemptive job scheduling, through the implementation of a robust checkpoint and resume system for data tokenization. This ensures that extensive data preparation tasks can be reliably completed, even with interruptions, and provides standardized scripts for both data preparation and model training on SLURM. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces SLURM scripts and checkpoint/resume functionality for SFT data conversion, which is a valuable addition for handling long-running jobs on clusters. The implementation is robust, featuring atomic checkpoint saves and a good set of unit tests. My review focuses on improving performance, maintainability, and test correctness. I've identified an opportunity to optimize the data processing loop when resuming from a checkpoint to avoid unnecessary iteration. I've also suggested refactoring the state management to reduce code duplication. Additionally, I've noted a few minor issues in the new test file, including an incorrect path in the docstring and incomplete test assertions. The new SLURM scripts and documentation are well-structured and clear. Overall, this is a great contribution that enhances the usability of the project on HPC systems.
- Fix pytest path in test docstring (scripts/data -> open_instruct) - Remove unnecessary sys.path manipulation from test file - Add missing assertions for per_dataset_tokens, per_dataset_trainable_tokens, per_dataset_filtered in roundtrip test - Refactor state management to use centralized dictionary - Use dataset.select() for efficient resume (avoid iterating skipped samples)
- Add --shuffle_seed argument with default=42 to ConvertSFTDataArguments - Use args.shuffle_seed instead of hardcoded seed=42 in shuffle call - Fix code formatting in test_checkpoint.py
|
@hamishivi thanks! Sure, quality checks should now pass and I made the seed configurable via --shuffle_seed (default: 42). |
hamishivi
left a comment
There was a problem hiding this comment.
Looks good, could you add a changelog item?
I'm working on fixing/bypassing the gpu / unit tests, this is a problem in our CI, not your fault.
Done! Feel free to let me know if I can be of any help. |
|
I am pushing a PR to fix this; should be in later today. Thanks for your patience! |
|
Ok, I think we're good to go. Just waiting on the GPU tests to pass and then I'll merge it. |
|
@hamishivi @finbarrtimbers not sure if anything is blocking on my end. I think we are good to merge and can close the PR? Thanks |
|
Yes! Trying now. |
* Add SLURM scripts for OLMo SFT with resume support This PR adds SLURM-compatible scripts for running OLMo SFT training on clusters without Beaker/cloud infrastructure. Changes: 1. Resume checkpoint support for convert_sft_data_for_olmocore.py: - Add --resume flag to continue from last checkpoint after interruption - Add --checkpoint_interval to control checkpoint frequency (default 100k) - Checkpoints saved atomically to _checkpoint.json - Automatic cleanup on successful completion - Fixed shuffle seed (42) for reproducible resume 2. SLURM data preparation scripts: - prepare_dolci_think_data.sh: Tokenize Dolci-Think-SFT (~22B tokens) - prepare_dolci_instruct_data.sh: Tokenize Dolci-Instruct-SFT (~1.8B) - Both include resume support for time-limited queues 3. SLURM training scripts (requires OLMo-core clone): - train_dolci_think.sh: Train Think SFT (lr=5e-5, ~24h on 8x H100) - train_dolci_instruct.sh: Train Instruct SFT (lr=8e-5, ~4h on 8x H100) - Hyperparameters from OLMo-3 paper Table 47 (arxiv 2512.13961) - Uses OLMOCORE_PATH env var to locate OLMo-core installation 4. Tests for checkpoint functionality (15 tests) Usage: # Data preparation sbatch scripts/slurm/sft/prepare_dolci_think_data.sh # Training (after cloning OLMo-core) OLMOCORE_PATH=/path/to/OLMo-core \ DATASET_PATH=./data/dolci_think_sft_tokenized \ BASE_CKPT=/path/to/OLMo-3-7B \ sbatch scripts/slurm/sft/train_dolci_think.sh * Address code review feedback - Fix pytest path in test docstring (scripts/data -> open_instruct) - Remove unnecessary sys.path manipulation from test file - Add missing assertions for per_dataset_tokens, per_dataset_trainable_tokens, per_dataset_filtered in roundtrip test - Refactor state management to use centralized dictionary - Use dataset.select() for efficient resume (avoid iterating skipped samples) * Address reviewer feedback: make shuffle_seed configurable - Add --shuffle_seed argument with default=42 to ConvertSFTDataArguments - Use args.shuffle_seed instead of hardcoded seed=42 in shuffle call - Fix code formatting in test_checkpoint.py * Add changelog entry for SLURM SFT scripts PR --------- Co-authored-by: Hamish Ivison <[email protected]> Co-authored-by: Finbarr Timbers <[email protected]>
* Add SLURM scripts for OLMo SFT with resume support This PR adds SLURM-compatible scripts for running OLMo SFT training on clusters without Beaker/cloud infrastructure. Changes: 1. Resume checkpoint support for convert_sft_data_for_olmocore.py: - Add --resume flag to continue from last checkpoint after interruption - Add --checkpoint_interval to control checkpoint frequency (default 100k) - Checkpoints saved atomically to _checkpoint.json - Automatic cleanup on successful completion - Fixed shuffle seed (42) for reproducible resume 2. SLURM data preparation scripts: - prepare_dolci_think_data.sh: Tokenize Dolci-Think-SFT (~22B tokens) - prepare_dolci_instruct_data.sh: Tokenize Dolci-Instruct-SFT (~1.8B) - Both include resume support for time-limited queues 3. SLURM training scripts (requires OLMo-core clone): - train_dolci_think.sh: Train Think SFT (lr=5e-5, ~24h on 8x H100) - train_dolci_instruct.sh: Train Instruct SFT (lr=8e-5, ~4h on 8x H100) - Hyperparameters from OLMo-3 paper Table 47 (arxiv 2512.13961) - Uses OLMOCORE_PATH env var to locate OLMo-core installation 4. Tests for checkpoint functionality (15 tests) Usage: # Data preparation sbatch scripts/slurm/sft/prepare_dolci_think_data.sh # Training (after cloning OLMo-core) OLMOCORE_PATH=/path/to/OLMo-core \ DATASET_PATH=./data/dolci_think_sft_tokenized \ BASE_CKPT=/path/to/OLMo-3-7B \ sbatch scripts/slurm/sft/train_dolci_think.sh * Address code review feedback - Fix pytest path in test docstring (scripts/data -> open_instruct) - Remove unnecessary sys.path manipulation from test file - Add missing assertions for per_dataset_tokens, per_dataset_trainable_tokens, per_dataset_filtered in roundtrip test - Refactor state management to use centralized dictionary - Use dataset.select() for efficient resume (avoid iterating skipped samples) * Address reviewer feedback: make shuffle_seed configurable - Add --shuffle_seed argument with default=42 to ConvertSFTDataArguments - Use args.shuffle_seed instead of hardcoded seed=42 in shuffle call - Fix code formatting in test_checkpoint.py * Add changelog entry for SLURM SFT scripts PR --------- Co-authored-by: Hamish Ivison <[email protected]> Co-authored-by: Finbarr Timbers <[email protected]>
Hey,
This PR adds SLURM-compatible scripts for running OLMo SFT training on clusters without Beaker/cloud infrastructure, as discussed with @hamishivi in #1325.
Moreover, tokenizing large SFT datasets can take >12 hours (even with over 300 cpu processes). Many SLURM clusters enforce preemptive scheduling and limited job time (e.g., 6 hours). Without resume support, any interruption means starting over from scratch. Therefore, this PR adds also checkpoint/resume support to
convert_sft_data_for_olmocore.py, so users can simply resubmit the same script to continue where they left off.Quick overview: files added
Changes:
Resume checkpoint support for
convert_sft_data_for_olmocore.py:SLURM data preparation scripts:
prepare_dolci_think_data.sh: Tokenize Dolci-Think-SFTprepare_dolci_instruct_data.sh: Tokenize Dolci-Instruct-SFTSLURM training scripts (requires OLMo-core clone):
train_dolci_think.sh: Train Think SFTtrain_dolci_instruct.sh: Train Instruct SFTOLMOCORE_PATHenv var to locate OLMo-core installationUsage:
Data preparation
sbatch scripts/slurm/sft/prepare_dolci_think_data.sh
Training (after cloning OLMo-core)
OLMOCORE_PATH=/path/to/OLMo-core
DATASET_PATH=./data/dolci_think_sft_tokenized
BASE_CKPT=/path/to/OLMo-3-7B
sbatch scripts/slurm/sft/train_dolci_think.sh
(Disclaimer: I created this PR with the help of Opus 4.5, e.g. for writing the scripts/slurm/sft/README).
I hope this can be helpful to all users / maintainers of open-instruct. Please let me know if there are questions / or requests for changes.
GPU_TESTS=01KFGX8Q53EJTWGNKRFAN6M0N5