LLM Compressor with Alauda AI
This document describes how to use the LLM Compressor integration with the Alauda AI platform to perform model compression workflows. The Alauda AI integration of LLM Compressor provides two example workflows:
- A workbench image and the data-free compressor notebook that demonstrate how to compress a model.
- A workbench image and the calibration compressor notebook that demonstrate how to compress a model using a calibration dataset.
TOC
Supported Model Compression Workflows
On the Alauda AI platform, you can use the Workbench feature to run LLM Compressor on models stored in your model repository. The following workflow outlines the typical steps for compressing a model.
Create a Workbench
Follow the instructions in Create Workbench to create a new Workbench instance. Note that model compression is currently supported only within JupyterLab.
Create a Model Repository and Upload Models
Refer to Upload Models Using Notebook for detailed steps on creating a model repository and uploading your model files. The example notebooks in this guide use the TinyLlama-1.1B-Chat-v1.0 model.
- Model to compress. You can modify this line if you want to use your own model.
- This recipe will quantize all Linear layers except those in the
lm_head, which is often sensitive to quantization. TheW4A16scheme compresses weights to4-bitintegers while retaining16-bitactivations.
(Optional) Prepare and Upload a Dataset
If you plan to use the data-free compressor notebook, you can skip this step.
To use the calibration compressor notebook, you must prepare and upload a calibration dataset. Prepare your dataset using the same process described in Upload Models Using Notebook. The example calibration notebook uses the ultrachat_200k dataset.
- Create the calibration dataset, using Huggingface datasets API. You can modify this line if you want to use your own dataset.
- Select number of samples. 512 samples is a good place to start. Increasing the number of samples can improve accuracy.
- Load dataset.
- Shuffle and grab only the number of samples we need.
- Preprocess and tokenize into format the model uses.
(Optional) Upload Dataset into S3 Storage
If you wish to upload datasets into S3, you can first install the boto3 tool and then run those codes in JupyterLab.
- You can modify this line if you want to use your own dataset.
- Configure multipart upload with 100 MB chunks and a maximum of 10 concurrent threads.
(Optional) Use Dataset in S3 Storage
If you wish to use datasets from S3, you can first install the s3fs tool and then modify the dataset loading section in the example by following the code below.
- Set environment variables (as a backup, some underlying components will use them).
- Define storage configuration; you must explicitly specify the endpoint_url to connect to MinIO.
- If the dataset is split, this is equivalent to
split="train_sft"in the example.
Clone Models and Datasets in JupyterLab
In the JupyterLab terminal, use git clone to download the model repository (and dataset, if applicable) to your workspace. The data-free compressor notebook does not require a dataset.
Create and Run Compression Notebooks
Download the appropriate example notebook for your use case: the calibration compressor notebook if you are using a dataset, or the data-free compressor notebook otherwise. Create a new notebook (for example, compressor.ipynb) in JupyterLab and paste the contents of the example notebook into it. Run the cells to perform model compression.
Upload the Compressed Model to the Repository
Once compression is complete, upload the compressed model back to the model repository using the steps outlined in Upload Models Using Notebook.
- Save model and tokenizer. You can modify this line if you want to change the name of output.
Deploy and Use the Compressed Model for Inference
Quantized and sparse models that you create with LLM Compressor are saved using the compressed-tensors library (an extension of Safetensors).
The compression format matches the model's quantization or sparsity type. These formats are natively supported in vLLM, enabling fast inference through optimized deployment kernels by using Alauda AI Inference Server.
Follow the instructions in create inference service to complete this step.