Add SLURM scripts for OLMo SFT with resume support by ferreirafabio · Pull Request #1368 · allenai/open-instruct

ferreirafabio · 2026-01-15T15:54:41Z

Hey,

This PR adds SLURM-compatible scripts for running OLMo SFT training on clusters without Beaker/cloud infrastructure, as discussed with @hamishivi in #1325.

Moreover, tokenizing large SFT datasets can take >12 hours (even with over 300 cpu processes). Many SLURM clusters enforce preemptive scheduling and limited job time (e.g., 6 hours). Without resume support, any interruption means starting over from scratch. Therefore, this PR adds also checkpoint/resume support to convert_sft_data_for_olmocore.py, so users can simply resubmit the same script to continue where they left off.

Quick overview: files added

  scripts/slurm/sft/
  ├── README.md                        # Documentation
  ├── prepare_dolci_think_data.sh      # Data prep for Think (~22B tokens)
  ├── prepare_dolci_instruct_data.sh   # Data prep for Instruct (~1.8B tokens)
  ├── train_dolci_think.sh             # Training script (lr=5e-5)
  └── train_dolci_instruct.sh          # Training script (lr=8e-5)
  open_instruct/
  └── test_checkpoint.py               # 15 unit tests for checkpoint functions

  scripts/data/
  └── convert_sft_data_for_olmocore.py # Modified: added resume support

Changes:

Resume checkpoint support for convert_sft_data_for_olmocore.py:
- Add --resume flag to continue from last checkpoint after interruption
- Add --checkpoint_interval to control checkpoint frequency (default 100k)
- Checkpoints saved atomically to _checkpoint.json
- Automatic cleanup on successful completion
- Fixed shuffle seed (42) for reproducible resume
SLURM data preparation scripts:
- prepare_dolci_think_data.sh: Tokenize Dolci-Think-SFT
- prepare_dolci_instruct_data.sh: Tokenize Dolci-Instruct-SFT
- Both include resume support for time-limited queues
SLURM training scripts (requires OLMo-core clone):
- train_dolci_think.sh: Train Think SFT
- train_dolci_instruct.sh: Train Instruct SFT
- Hyperparameters from OLMo-3 paper (Table 47)
- Uses OLMOCORE_PATH env var to locate OLMo-core installation

Usage:

Data preparation

sbatch scripts/slurm/sft/prepare_dolci_think_data.sh

Training (after cloning OLMo-core)

OLMOCORE_PATH=/path/to/OLMo-core
DATASET_PATH=./data/dolci_think_sft_tokenized
BASE_CKPT=/path/to/OLMo-3-7B
sbatch scripts/slurm/sft/train_dolci_think.sh

(Disclaimer: I created this PR with the help of Opus 4.5, e.g. for writing the scripts/slurm/sft/README).

I hope this can be helpful to all users / maintainers of open-instruct. Please let me know if there are questions / or requests for changes.

GPU_TESTS=01KFGX8Q53EJTWGNKRFAN6M0N5

This PR adds SLURM-compatible scripts for running OLMo SFT training on clusters without Beaker/cloud infrastructure. Changes: 1. Resume checkpoint support for convert_sft_data_for_olmocore.py: - Add --resume flag to continue from last checkpoint after interruption - Add --checkpoint_interval to control checkpoint frequency (default 100k) - Checkpoints saved atomically to _checkpoint.json - Automatic cleanup on successful completion - Fixed shuffle seed (42) for reproducible resume 2. SLURM data preparation scripts: - prepare_dolci_think_data.sh: Tokenize Dolci-Think-SFT (~22B tokens) - prepare_dolci_instruct_data.sh: Tokenize Dolci-Instruct-SFT (~1.8B) - Both include resume support for time-limited queues 3. SLURM training scripts (requires OLMo-core clone): - train_dolci_think.sh: Train Think SFT (lr=5e-5, ~24h on 8x H100) - train_dolci_instruct.sh: Train Instruct SFT (lr=8e-5, ~4h on 8x H100) - Hyperparameters from OLMo-3 paper Table 47 (arxiv 2512.13961) - Uses OLMOCORE_PATH env var to locate OLMo-core installation 4. Tests for checkpoint functionality (15 tests) Usage: # Data preparation sbatch scripts/slurm/sft/prepare_dolci_think_data.sh # Training (after cloning OLMo-core) OLMOCORE_PATH=/path/to/OLMo-core \ DATASET_PATH=./data/dolci_think_sft_tokenized \ BASE_CKPT=/path/to/OLMo-3-7B \ sbatch scripts/slurm/sft/train_dolci_think.sh

gemini-code-assist · 2026-01-15T15:55:02Z

Summary of Changes

Hello @ferreirafabio, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the usability of OLMo SFT workflows on high-performance computing clusters by introducing comprehensive SLURM support. It addresses critical challenges associated with long-duration data processing, such as preemptive job scheduling, through the implementation of a robust checkpoint and resume system for data tokenization. This ensures that extensive data preparation tasks can be reliably completed, even with interruptions, and provides standardized scripts for both data preparation and model training on SLURM.

Highlights

SLURM Integration: Introduced SLURM-compatible scripts for OLMo Supervised Fine-Tuning (SFT) data preparation and model training, enabling execution on cluster environments without Beaker/cloud infrastructure.
Resume Support for Data Preparation: Added robust checkpointing and resume functionality to convert_sft_data_for_olmocore.py, allowing long-running data tokenization jobs to recover from interruptions and continue processing.
New Data Preparation Scripts: Provided dedicated SLURM scripts (prepare_dolci_think_data.sh, prepare_dolci_instruct_data.sh) for tokenizing Dolci-Think-SFT and Dolci-Instruct-SFT datasets, both leveraging the new resume feature.
New Training Scripts: Included SLURM scripts (train_dolci_think.sh, train_dolci_instruct.sh) for training OLMo-3 7B models on the prepared Dolci SFT datasets, configured with hyperparameters from the OLMo-3 paper.
Checkpoint Functionality Tests: Added a new unit test file (open_instruct/test_checkpoint.py) to thoroughly validate the atomic save, load, and remove operations of the checkpoint mechanism.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces SLURM scripts and checkpoint/resume functionality for SFT data conversion, which is a valuable addition for handling long-running jobs on clusters. The implementation is robust, featuring atomic checkpoint saves and a good set of unit tests. My review focuses on improving performance, maintainability, and test correctness. I've identified an opportunity to optimize the data processing loop when resuming from a checkpoint to avoid unnecessary iteration. I've also suggested refactoring the state management to reduce code duplication. Additionally, I've noted a few minor issues in the new test file, including an incorrect path in the docstring and incomplete test assertions. The new SLURM scripts and documentation are well-structured and clear. Overall, this is a great contribution that enhances the usability of the project on HPC systems.

- Fix pytest path in test docstring (scripts/data -> open_instruct) - Remove unnecessary sys.path manipulation from test file - Add missing assertions for per_dataset_tokens, per_dataset_trainable_tokens, per_dataset_filtered in roundtrip test - Refactor state management to use centralized dictionary - Use dataset.select() for efficient resume (avoid iterating skipped samples)

hamishivi

Looks good, I just have one minor comment. Could you also make sure the quality check tests pass?

- Add --shuffle_seed argument with default=42 to ConvertSFTDataArguments - Use args.shuffle_seed instead of hardcoded seed=42 in shuffle call - Fix code formatting in test_checkpoint.py

ferreirafabio · 2026-01-19T12:09:07Z

@hamishivi thanks! Sure, quality checks should now pass and I made the seed configurable via --shuffle_seed (default: 42).

hamishivi

Looks good, could you add a changelog item?
I'm working on fixing/bypassing the gpu / unit tests, this is a problem in our CI, not your fault.

ferreirafabio · 2026-01-20T23:05:11Z

Looks good, could you add a changelog item? I'm working on fixing/bypassing the gpu / unit tests, this is a problem in our CI, not your fault.

Done! Feel free to let me know if I can be of any help.

finbarrtimbers · 2026-01-21T15:43:10Z

I am pushing a PR to fix this; should be in later today. Thanks for your patience!

finbarrtimbers · 2026-01-21T18:50:05Z

Ok, I think we're good to go. Just waiting on the GPU tests to pass and then I'll merge it.

ferreirafabio · 2026-01-24T08:55:21Z

@hamishivi @finbarrtimbers not sure if anything is blocking on my end. I think we are good to merge and can close the PR? Thanks

finbarrtimbers · 2026-01-24T15:27:27Z

Yes! Trying now.

* Add SLURM scripts for OLMo SFT with resume support This PR adds SLURM-compatible scripts for running OLMo SFT training on clusters without Beaker/cloud infrastructure. Changes: 1. Resume checkpoint support for convert_sft_data_for_olmocore.py: - Add --resume flag to continue from last checkpoint after interruption - Add --checkpoint_interval to control checkpoint frequency (default 100k) - Checkpoints saved atomically to _checkpoint.json - Automatic cleanup on successful completion - Fixed shuffle seed (42) for reproducible resume 2. SLURM data preparation scripts: - prepare_dolci_think_data.sh: Tokenize Dolci-Think-SFT (~22B tokens) - prepare_dolci_instruct_data.sh: Tokenize Dolci-Instruct-SFT (~1.8B) - Both include resume support for time-limited queues 3. SLURM training scripts (requires OLMo-core clone): - train_dolci_think.sh: Train Think SFT (lr=5e-5, ~24h on 8x H100) - train_dolci_instruct.sh: Train Instruct SFT (lr=8e-5, ~4h on 8x H100) - Hyperparameters from OLMo-3 paper Table 47 (arxiv 2512.13961) - Uses OLMOCORE_PATH env var to locate OLMo-core installation 4. Tests for checkpoint functionality (15 tests) Usage: # Data preparation sbatch scripts/slurm/sft/prepare_dolci_think_data.sh # Training (after cloning OLMo-core) OLMOCORE_PATH=/path/to/OLMo-core \ DATASET_PATH=./data/dolci_think_sft_tokenized \ BASE_CKPT=/path/to/OLMo-3-7B \ sbatch scripts/slurm/sft/train_dolci_think.sh * Address code review feedback - Fix pytest path in test docstring (scripts/data -> open_instruct) - Remove unnecessary sys.path manipulation from test file - Add missing assertions for per_dataset_tokens, per_dataset_trainable_tokens, per_dataset_filtered in roundtrip test - Refactor state management to use centralized dictionary - Use dataset.select() for efficient resume (avoid iterating skipped samples) * Address reviewer feedback: make shuffle_seed configurable - Add --shuffle_seed argument with default=42 to ConvertSFTDataArguments - Use args.shuffle_seed instead of hardcoded seed=42 in shuffle call - Fix code formatting in test_checkpoint.py * Add changelog entry for SLURM SFT scripts PR --------- Co-authored-by: Hamish Ivison <[email protected]> Co-authored-by: Finbarr Timbers <[email protected]>

gemini-code-assist Bot reviewed Jan 15, 2026

View reviewed changes

hamishivi requested changes Jan 17, 2026

View reviewed changes

Comment thread scripts/data/convert_sft_data_for_olmocore.py Outdated

hamishivi and others added 2 commits January 17, 2026 13:13

Merge branch 'main' into add-slurm-sft-scripts-dolci-resume

51a10cf

Address reviewer feedback: make shuffle_seed configurable

12dfc71

- Add --shuffle_seed argument with default=42 to ConvertSFTDataArguments - Use args.shuffle_seed instead of hardcoded seed=42 in shuffle call - Fix code formatting in test_checkpoint.py

Merge branch 'main' into add-slurm-sft-scripts-dolci-resume

ac76e04

ferreirafabio requested a review from hamishivi January 20, 2026 09:18

hamishivi added 3 commits January 20, 2026 09:05

Merge branch 'main' into add-slurm-sft-scripts-dolci-resume

5e8f0dd

Merge branch 'main' into add-slurm-sft-scripts-dolci-resume

87d76b2

Merge branch 'main' into add-slurm-sft-scripts-dolci-resume

e59185a

hamishivi approved these changes Jan 20, 2026

View reviewed changes

Add changelog entry for SLURM SFT scripts PR

9b44a7b

Merge branch 'main' into add-slurm-sft-scripts-dolci-resume

8eacbdb

Merge branch 'main' into add-slurm-sft-scripts-dolci-resume

05c47c1

finbarrtimbers enabled auto-merge January 24, 2026 15:27

finbarrtimbers added this pull request to the merge queue Jan 24, 2026

Merged via the queue into allenai:main with commit 97253fe Jan 24, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SLURM scripts for OLMo SFT with resume support#1368

Add SLURM scripts for OLMo SFT with resume support#1368
finbarrtimbers merged 11 commits into
allenai:mainfrom
ferreirafabio:add-slurm-sft-scripts-dolci-resume

ferreirafabio commented Jan 15, 2026 •

edited by finbarrtimbers

Loading

gemini-code-assist Bot commented Jan 15, 2026

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hamishivi left a comment •

edited

Loading

Uh oh!

ferreirafabio commented Jan 19, 2026

hamishivi left a comment

ferreirafabio commented Jan 20, 2026

finbarrtimbers commented Jan 21, 2026

finbarrtimbers commented Jan 21, 2026

ferreirafabio commented Jan 24, 2026 •

edited

Loading

finbarrtimbers commented Jan 24, 2026

Uh oh!

Labels

3 participants

Conversation

ferreirafabio commented Jan 15, 2026 • edited by finbarrtimbers Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Quick overview: files added

Data preparation

Training (after cloning OLMo-core)

gemini-code-assist Bot commented Jan 15, 2026

Summary of Changes

Highlights

Footnotes

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hamishivi left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ferreirafabio commented Jan 19, 2026

hamishivi left a comment

Choose a reason for hiding this comment

ferreirafabio commented Jan 20, 2026

finbarrtimbers commented Jan 21, 2026

finbarrtimbers commented Jan 21, 2026

ferreirafabio commented Jan 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

finbarrtimbers commented Jan 24, 2026

Uh oh!

Labels

3 participants

ferreirafabio commented Jan 15, 2026 •

edited by finbarrtimbers

Loading

hamishivi left a comment •

edited

Loading

ferreirafabio commented Jan 24, 2026 •

edited

Loading