Intel will not provide or guarantee development of or support for this project, including but not limited to, maintenance, bug fixes, new releases or updates.
Patches to this project are no longer accepted by Intel.
If you have an ongoing need to use this project, are interested in independently developing it, or would like to maintain patches for the community, please create your own fork of the project.
This repo contains the reference codes for "Post-Training Statistical Calibration for Higher Activation Sparsity".
If you find our work useful in your research, please consider citing our paper:
@InProceedings{chua2024scap,
title = {Post-Training Statistical Calibration for Higher Activation Sparsity},
author = {Chua, Vui Seng and Pan, Yujie and Jain, Nilesh},
booktitle = {Proceedings of The 4th NeurIPS Efficient Natural Language and Speech Processing Workshop},
year = {2024},
volume = {262},
series = {Proceedings of Machine Learning Research}
}We present Statistical Calibrated Activation Pruning (SCAP), a post-training activation pruning framework that (1) generalizes sparsification by input activations of Fully-Connected layers for generic and flexible application across Transformers, and (2) features a simple Mode-Centering technique to pre-calibrate activation distributions for maximizing post-training sparsity. Our results demonstrate robust Pareto efficiency compared to prior methods, translating to a 1.5× additional LLM decoding speedup against CATS at iso model quality. SCAP effectiveness is empirically verified across a wide range of models, including recent Transformer Decoders, MoE, Mamba2, Encoding Transformer, and pre-quantized models, highlighting its practicality and scalability.
Please follow the steps below.
# recommended python version: 3.10.13
python -m venv ./scap_env
source ./scap_env/bin/activate
# install torch
pip install torch==2.3.1 --index-url https://download.pytorch.org/whl/cu121
# install dependencies
pip install transformers==4.44.0 datasets==2.21.0 accelerate tqdm rich seaborn matplotlib wheel \
git+https://github.com/EleutherAI/lm-evaluation-harness.git@906ef948dc8dbb4c84e1bb0f2861b1aba30ab533
# install gemv kernel
pip install triton "git+https://github.com/ScalingIntelligence/CATS.git@0bda7708b835f20c59f4dd59d3d32b0c5f2f6376#egg=flash_gemv&subdirectory=flash_gemv"Get the calibrated thresholds of SCAP for each model and sparsity config.
bash scripts/01.calibration.bashYou can skip this calibration step, as we have already uploaded the following model configs in the repo.
| Model ID | Config in the bash | Up/gate sparsity | Down sparsity |
|---|---|---|---|
| meta-llama/Llama-2-7b-hf | up,zero,0.35,gate,zero,0.35,down,zero,0.55 | 35% without mode centering | 55% without mode centering |
| mistralai/Mistral-7B-v0.1 | up,zero,0.3,gate,zero,0.3,down,zero,0.7 | 30% without mode centering | 70% without mode centering |
| mosaicml/mpt-7b | down,kde,0.5 | / | 50% with kde peak as mode |
| tiiuae/falcon-7b | down,median,0.5 | / | 50% with median as mode |
The resulting calibrated_thresholds.json file at results/scap/ folder shows the mode and threshold for each FFN layer specified in the config.
Evaluate the zero-shot tasks listed in the paper, i.e., winogrande, piqa, sciq, hellaswag, boolq, arc_easy, arc_challenge.
Results are at results/scap/ folder.
bash scripts/02.evaluate_zero_shot_tasks.bashThe resulting evaluation_results.json file contains: (1) evaluation metrics for each task; (2) averaged actual input sparsity for each layer.
We show the actual inference of SCAP optimized models with the sparse GEMV kernel.
bash scripts/03.inference_demo.bashThis work is built atop CATS, which we believe also extends from DejaVu. Credits go to the original authors of these projects.