Skip to content

Add SLURM scripts for OLMo SFT with resume support#1368

Merged
finbarrtimbers merged 11 commits into
allenai:mainfrom
ferreirafabio:add-slurm-sft-scripts-dolci-resume
Jan 24, 2026
Merged

Add SLURM scripts for OLMo SFT with resume support#1368
finbarrtimbers merged 11 commits into
allenai:mainfrom
ferreirafabio:add-slurm-sft-scripts-dolci-resume

Conversation

@ferreirafabio
Copy link
Copy Markdown
Contributor

@ferreirafabio ferreirafabio commented Jan 15, 2026

Hey,

This PR adds SLURM-compatible scripts for running OLMo SFT training on clusters without Beaker/cloud infrastructure, as discussed with @hamishivi in #1325.

Moreover, tokenizing large SFT datasets can take >12 hours (even with over 300 cpu processes). Many SLURM clusters enforce preemptive scheduling and limited job time (e.g., 6 hours). Without resume support, any interruption means starting over from scratch. Therefore, this PR adds also checkpoint/resume support to convert_sft_data_for_olmocore.py, so users can simply resubmit the same script to continue where they left off.

Quick overview: files added

  scripts/slurm/sft/
  ├── README.md                        # Documentation
  ├── prepare_dolci_think_data.sh      # Data prep for Think (~22B tokens)
  ├── prepare_dolci_instruct_data.sh   # Data prep for Instruct (~1.8B tokens)
  ├── train_dolci_think.sh             # Training script (lr=5e-5)
  └── train_dolci_instruct.sh          # Training script (lr=8e-5)
  open_instruct/
  └── test_checkpoint.py               # 15 unit tests for checkpoint functions

  scripts/data/
  └── convert_sft_data_for_olmocore.py # Modified: added resume support

Changes:

  1. Resume checkpoint support for convert_sft_data_for_olmocore.py:

    • Add --resume flag to continue from last checkpoint after interruption
    • Add --checkpoint_interval to control checkpoint frequency (default 100k)
    • Checkpoints saved atomically to _checkpoint.json
    • Automatic cleanup on successful completion
    • Fixed shuffle seed (42) for reproducible resume
  2. SLURM data preparation scripts:

    • prepare_dolci_think_data.sh: Tokenize Dolci-Think-SFT
    • prepare_dolci_instruct_data.sh: Tokenize Dolci-Instruct-SFT
    • Both include resume support for time-limited queues
  3. SLURM training scripts (requires OLMo-core clone):

    • train_dolci_think.sh: Train Think SFT
    • train_dolci_instruct.sh: Train Instruct SFT
    • Hyperparameters from OLMo-3 paper (Table 47)
    • Uses OLMOCORE_PATH env var to locate OLMo-core installation

Usage:

Data preparation

sbatch scripts/slurm/sft/prepare_dolci_think_data.sh

Training (after cloning OLMo-core)

OLMOCORE_PATH=/path/to/OLMo-core
DATASET_PATH=./data/dolci_think_sft_tokenized
BASE_CKPT=/path/to/OLMo-3-7B
sbatch scripts/slurm/sft/train_dolci_think.sh

(Disclaimer: I created this PR with the help of Opus 4.5, e.g. for writing the scripts/slurm/sft/README).

I hope this can be helpful to all users / maintainers of open-instruct. Please let me know if there are questions / or requests for changes.

GPU_TESTS=01KFGX8Q53EJTWGNKRFAN6M0N5

This PR adds SLURM-compatible scripts for running OLMo SFT training on
clusters without Beaker/cloud infrastructure.

Changes:

1. Resume checkpoint support for convert_sft_data_for_olmocore.py:
   - Add --resume flag to continue from last checkpoint after interruption
   - Add --checkpoint_interval to control checkpoint frequency (default 100k)
   - Checkpoints saved atomically to _checkpoint.json
   - Automatic cleanup on successful completion
   - Fixed shuffle seed (42) for reproducible resume

2. SLURM data preparation scripts:
   - prepare_dolci_think_data.sh: Tokenize Dolci-Think-SFT (~22B tokens)
   - prepare_dolci_instruct_data.sh: Tokenize Dolci-Instruct-SFT (~1.8B)
   - Both include resume support for time-limited queues

3. SLURM training scripts (requires OLMo-core clone):
   - train_dolci_think.sh: Train Think SFT (lr=5e-5, ~24h on 8x H100)
   - train_dolci_instruct.sh: Train Instruct SFT (lr=8e-5, ~4h on 8x H100)
   - Hyperparameters from OLMo-3 paper Table 47 (arxiv 2512.13961)
   - Uses OLMOCORE_PATH env var to locate OLMo-core installation

4. Tests for checkpoint functionality (15 tests)

Usage:
  # Data preparation
  sbatch scripts/slurm/sft/prepare_dolci_think_data.sh

  # Training (after cloning OLMo-core)
  OLMOCORE_PATH=/path/to/OLMo-core \
  DATASET_PATH=./data/dolci_think_sft_tokenized \
  BASE_CKPT=/path/to/OLMo-3-7B \
  sbatch scripts/slurm/sft/train_dolci_think.sh
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @ferreirafabio, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the usability of OLMo SFT workflows on high-performance computing clusters by introducing comprehensive SLURM support. It addresses critical challenges associated with long-duration data processing, such as preemptive job scheduling, through the implementation of a robust checkpoint and resume system for data tokenization. This ensures that extensive data preparation tasks can be reliably completed, even with interruptions, and provides standardized scripts for both data preparation and model training on SLURM.

Highlights

  • SLURM Integration: Introduced SLURM-compatible scripts for OLMo Supervised Fine-Tuning (SFT) data preparation and model training, enabling execution on cluster environments without Beaker/cloud infrastructure.
  • Resume Support for Data Preparation: Added robust checkpointing and resume functionality to convert_sft_data_for_olmocore.py, allowing long-running data tokenization jobs to recover from interruptions and continue processing.
  • New Data Preparation Scripts: Provided dedicated SLURM scripts (prepare_dolci_think_data.sh, prepare_dolci_instruct_data.sh) for tokenizing Dolci-Think-SFT and Dolci-Instruct-SFT datasets, both leveraging the new resume feature.
  • New Training Scripts: Included SLURM scripts (train_dolci_think.sh, train_dolci_instruct.sh) for training OLMo-3 7B models on the prepared Dolci SFT datasets, configured with hyperparameters from the OLMo-3 paper.
  • Checkpoint Functionality Tests: Added a new unit test file (open_instruct/test_checkpoint.py) to thoroughly validate the atomic save, load, and remove operations of the checkpoint mechanism.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces SLURM scripts and checkpoint/resume functionality for SFT data conversion, which is a valuable addition for handling long-running jobs on clusters. The implementation is robust, featuring atomic checkpoint saves and a good set of unit tests. My review focuses on improving performance, maintainability, and test correctness. I've identified an opportunity to optimize the data processing loop when resuming from a checkpoint to avoid unnecessary iteration. I've also suggested refactoring the state management to reduce code duplication. Additionally, I've noted a few minor issues in the new test file, including an incorrect path in the docstring and incomplete test assertions. The new SLURM scripts and documentation are well-structured and clear. Overall, this is a great contribution that enhances the usability of the project on HPC systems.

Comment thread open_instruct/test_checkpoint.py Outdated
Comment thread open_instruct/test_checkpoint.py Outdated
Comment thread open_instruct/test_checkpoint.py
Comment thread scripts/data/convert_sft_data_for_olmocore.py Outdated
Comment thread scripts/data/convert_sft_data_for_olmocore.py Outdated
- Fix pytest path in test docstring (scripts/data -> open_instruct)
- Remove unnecessary sys.path manipulation from test file
- Add missing assertions for per_dataset_tokens, per_dataset_trainable_tokens,
  per_dataset_filtered in roundtrip test
- Refactor state management to use centralized dictionary
- Use dataset.select() for efficient resume (avoid iterating skipped samples)
Copy link
Copy Markdown
Collaborator

@hamishivi hamishivi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I just have one minor comment. Could you also make sure the quality check tests pass?

Comment thread scripts/data/convert_sft_data_for_olmocore.py Outdated
hamishivi and others added 2 commits January 17, 2026 13:13
- Add --shuffle_seed argument with default=42 to ConvertSFTDataArguments
- Use args.shuffle_seed instead of hardcoded seed=42 in shuffle call
- Fix code formatting in test_checkpoint.py
@ferreirafabio
Copy link
Copy Markdown
Contributor Author

@hamishivi thanks! Sure, quality checks should now pass and I made the seed configurable via --shuffle_seed (default: 42).

Copy link
Copy Markdown
Collaborator

@hamishivi hamishivi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, could you add a changelog item?
I'm working on fixing/bypassing the gpu / unit tests, this is a problem in our CI, not your fault.

@ferreirafabio
Copy link
Copy Markdown
Contributor Author

Looks good, could you add a changelog item? I'm working on fixing/bypassing the gpu / unit tests, this is a problem in our CI, not your fault.

Done! Feel free to let me know if I can be of any help.

@finbarrtimbers
Copy link
Copy Markdown
Collaborator

I am pushing a PR to fix this; should be in later today. Thanks for your patience!

@finbarrtimbers
Copy link
Copy Markdown
Collaborator

Ok, I think we're good to go. Just waiting on the GPU tests to pass and then I'll merge it.

@ferreirafabio
Copy link
Copy Markdown
Contributor Author

ferreirafabio commented Jan 24, 2026

@hamishivi @finbarrtimbers not sure if anything is blocking on my end. I think we are good to merge and can close the PR? Thanks

@finbarrtimbers
Copy link
Copy Markdown
Collaborator

Yes! Trying now.

@finbarrtimbers finbarrtimbers added this pull request to the merge queue Jan 24, 2026
Merged via the queue into allenai:main with commit 97253fe Jan 24, 2026
7 checks passed
sang1583535 pushed a commit to sang1583535/open-instruct that referenced this pull request Feb 3, 2026
* Add SLURM scripts for OLMo SFT with resume support

This PR adds SLURM-compatible scripts for running OLMo SFT training on
clusters without Beaker/cloud infrastructure.

Changes:

1. Resume checkpoint support for convert_sft_data_for_olmocore.py:
   - Add --resume flag to continue from last checkpoint after interruption
   - Add --checkpoint_interval to control checkpoint frequency (default 100k)
   - Checkpoints saved atomically to _checkpoint.json
   - Automatic cleanup on successful completion
   - Fixed shuffle seed (42) for reproducible resume

2. SLURM data preparation scripts:
   - prepare_dolci_think_data.sh: Tokenize Dolci-Think-SFT (~22B tokens)
   - prepare_dolci_instruct_data.sh: Tokenize Dolci-Instruct-SFT (~1.8B)
   - Both include resume support for time-limited queues

3. SLURM training scripts (requires OLMo-core clone):
   - train_dolci_think.sh: Train Think SFT (lr=5e-5, ~24h on 8x H100)
   - train_dolci_instruct.sh: Train Instruct SFT (lr=8e-5, ~4h on 8x H100)
   - Hyperparameters from OLMo-3 paper Table 47 (arxiv 2512.13961)
   - Uses OLMOCORE_PATH env var to locate OLMo-core installation

4. Tests for checkpoint functionality (15 tests)

Usage:
  # Data preparation
  sbatch scripts/slurm/sft/prepare_dolci_think_data.sh

  # Training (after cloning OLMo-core)
  OLMOCORE_PATH=/path/to/OLMo-core \
  DATASET_PATH=./data/dolci_think_sft_tokenized \
  BASE_CKPT=/path/to/OLMo-3-7B \
  sbatch scripts/slurm/sft/train_dolci_think.sh

* Address code review feedback

- Fix pytest path in test docstring (scripts/data -> open_instruct)
- Remove unnecessary sys.path manipulation from test file
- Add missing assertions for per_dataset_tokens, per_dataset_trainable_tokens,
  per_dataset_filtered in roundtrip test
- Refactor state management to use centralized dictionary
- Use dataset.select() for efficient resume (avoid iterating skipped samples)

* Address reviewer feedback: make shuffle_seed configurable

- Add --shuffle_seed argument with default=42 to ConvertSFTDataArguments
- Use args.shuffle_seed instead of hardcoded seed=42 in shuffle call
- Fix code formatting in test_checkpoint.py

* Add changelog entry for SLURM SFT scripts PR

---------

Co-authored-by: Hamish Ivison <[email protected]>
Co-authored-by: Finbarr Timbers <[email protected]>
lukashelff pushed a commit to lukashelff/open-instruct-slurm that referenced this pull request Feb 19, 2026
* Add SLURM scripts for OLMo SFT with resume support

This PR adds SLURM-compatible scripts for running OLMo SFT training on
clusters without Beaker/cloud infrastructure.

Changes:

1. Resume checkpoint support for convert_sft_data_for_olmocore.py:
   - Add --resume flag to continue from last checkpoint after interruption
   - Add --checkpoint_interval to control checkpoint frequency (default 100k)
   - Checkpoints saved atomically to _checkpoint.json
   - Automatic cleanup on successful completion
   - Fixed shuffle seed (42) for reproducible resume

2. SLURM data preparation scripts:
   - prepare_dolci_think_data.sh: Tokenize Dolci-Think-SFT (~22B tokens)
   - prepare_dolci_instruct_data.sh: Tokenize Dolci-Instruct-SFT (~1.8B)
   - Both include resume support for time-limited queues

3. SLURM training scripts (requires OLMo-core clone):
   - train_dolci_think.sh: Train Think SFT (lr=5e-5, ~24h on 8x H100)
   - train_dolci_instruct.sh: Train Instruct SFT (lr=8e-5, ~4h on 8x H100)
   - Hyperparameters from OLMo-3 paper Table 47 (arxiv 2512.13961)
   - Uses OLMOCORE_PATH env var to locate OLMo-core installation

4. Tests for checkpoint functionality (15 tests)

Usage:
  # Data preparation
  sbatch scripts/slurm/sft/prepare_dolci_think_data.sh

  # Training (after cloning OLMo-core)
  OLMOCORE_PATH=/path/to/OLMo-core \
  DATASET_PATH=./data/dolci_think_sft_tokenized \
  BASE_CKPT=/path/to/OLMo-3-7B \
  sbatch scripts/slurm/sft/train_dolci_think.sh

* Address code review feedback

- Fix pytest path in test docstring (scripts/data -> open_instruct)
- Remove unnecessary sys.path manipulation from test file
- Add missing assertions for per_dataset_tokens, per_dataset_trainable_tokens,
  per_dataset_filtered in roundtrip test
- Refactor state management to use centralized dictionary
- Use dataset.select() for efficient resume (avoid iterating skipped samples)

* Address reviewer feedback: make shuffle_seed configurable

- Add --shuffle_seed argument with default=42 to ConvertSFTDataArguments
- Use args.shuffle_seed instead of hardcoded seed=42 in shuffle call
- Fix code formatting in test_checkpoint.py

* Add changelog entry for SLURM SFT scripts PR

---------

Co-authored-by: Hamish Ivison <[email protected]>
Co-authored-by: Finbarr Timbers <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

3 participants