THIS PROJECT IS ARCHIVED

Intel will not provide or guarantee development of or support for this project, including but not limited to, maintenance, bug fixes, new releases or updates.
Patches to this project are no longer accepted by Intel.
If you have an ongoing need to use this project, are interested in independently developing it, or would like to maintain patches for the community, please create your own fork of the project.

Statistical Calibrated Activation Pruning (SCAP)

This repo contains the reference codes for "Post-Training Statistical Calibration for Higher Activation Sparsity".

If you find our work useful in your research, please consider citing our paper:

@InProceedings{chua2024scap,
  title     = {Post-Training Statistical Calibration for Higher Activation Sparsity},
  author    = {Chua, Vui Seng and Pan, Yujie and Jain, Nilesh},
  booktitle = {Proceedings of The 4th NeurIPS Efficient Natural Language and Speech Processing Workshop},
  year      = {2024},
  volume    = {262},
  series    = {Proceedings of Machine Learning Research}
}

Abstract

We present Statistical Calibrated Activation Pruning (SCAP), a post-training activation pruning framework that (1) generalizes sparsification by input activations of Fully-Connected layers for generic and flexible application across Transformers, and (2) features a simple Mode-Centering technique to pre-calibrate activation distributions for maximizing post-training sparsity. Our results demonstrate robust Pareto efficiency compared to prior methods, translating to a 1.5× additional LLM decoding speedup against CATS at iso model quality. SCAP effectiveness is empirically verified across a wide range of models, including recent Transformer Decoders, MoE, Mamba2, Encoding Transformer, and pre-quantized models, highlighting its practicality and scalability.

Setup

Please follow the steps below.

# recommended python version: 3.10.13
python -m venv ./scap_env
source ./scap_env/bin/activate

# install torch
pip install torch==2.3.1 --index-url https://download.pytorch.org/whl/cu121

# install dependencies
pip install transformers==4.44.0 datasets==2.21.0 accelerate tqdm rich seaborn matplotlib wheel \
    git+https://github.com/EleutherAI/lm-evaluation-harness.git@906ef948dc8dbb4c84e1bb0f2861b1aba30ab533

# install gemv kernel
pip install triton "git+https://github.com/ScalingIntelligence/CATS.git@0bda7708b835f20c59f4dd59d3d32b0c5f2f6376#egg=flash_gemv&subdirectory=flash_gemv"

Reproducing the results

1. Run calibration

Get the calibrated thresholds of SCAP for each model and sparsity config.

bash scripts/01.calibration.bash

You can skip this calibration step, as we have already uploaded the following model configs in the repo.

Model ID	Config in the bash	Up/gate sparsity	Down sparsity
meta-llama/Llama-2-7b-hf	up,zero,0.35,gate,zero,0.35,down,zero,0.55	35% without mode centering	55% without mode centering
mistralai/Mistral-7B-v0.1	up,zero,0.3,gate,zero,0.3,down,zero,0.7	30% without mode centering	70% without mode centering
mosaicml/mpt-7b	down,kde,0.5	/	50% with kde peak as mode
tiiuae/falcon-7b	down,median,0.5	/	50% with median as mode

The resulting calibrated_thresholds.json file at results/scap/ folder shows the mode and threshold for each FFN layer specified in the config.

2. Evaluation on zero-shot tasks

Evaluate the zero-shot tasks listed in the paper, i.e., winogrande, piqa, sciq, hellaswag, boolq, arc_easy, arc_challenge. Results are at results/scap/ folder.

bash scripts/02.evaluate_zero_shot_tasks.bash

The resulting evaluation_results.json file contains: (1) evaluation metrics for each task; (2) averaged actual input sparsity for each layer.

3. Inference with sparse kernel

We show the actual inference of SCAP optimized models with the sparse GEMV kernel.

bash scripts/03.inference_demo.bash

Acknowledgement

This work is built atop CATS, which we believe also extends from DejaVu. Credits go to the original authors of these projects.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
results/scap		results/scap
scripts		scripts
utils		utils
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
calibrate.py		calibrate.py
evaluate.py		evaluate.py
inference_demo.py		inference_demo.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

THIS PROJECT IS ARCHIVED

Statistical Calibrated Activation Pruning (SCAP)

Abstract

Setup

Reproducing the results

1. Run calibration

2. Evaluation on zero-shot tasks

3. Inference with sparse kernel

Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

THIS PROJECT IS ARCHIVED

Statistical Calibrated Activation Pruning (SCAP)

Abstract

Setup

Reproducing the results

1. Run calibration

2. Evaluation on zero-shot tasks

3. Inference with sparse kernel

Acknowledgement

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages