LLM Compressor with Alauda AI

This document describes how to use the LLM Compressor integration with the Alauda AI platform to perform model compression workflows. The Alauda AI integration of LLM Compressor provides two example workflows:

TOC

Supported Model Compression Workflows

On the Alauda AI platform, you can use the Workbench feature to run LLM Compressor on models stored in your model repository. The following workflow outlines the typical steps for compressing a model.

Create a Workbench

Follow the instructions in Create Workbench to create a new Workbench instance. Note that model compression is currently supported only within JupyterLab.

Create a Model Repository and Upload Models

Refer to Upload Models Using Notebook for detailed steps on creating a model repository and uploading your model files. The example notebooks in this guide use the TinyLlama-1.1B-Chat-v1.0 model.

data-free-compressor.ipynb
from llmcompressor.modifiers.quantization import QuantizationModifier

model_id = "./TinyLlama-1.1B-Chat-v1.0"
recipe = QuantizationModifier(targets="Linear", scheme="W4A16", ignore=["lm_head"])
  1. Model to compress. You can modify this line if you want to use your own model.
  2. This recipe will quantize all Linear layers except those in the lm_head, which is often sensitive to quantization. The W4A16 scheme compresses weights to 4-bit integers while retaining 16-bit activations.

(Optional) Prepare and Upload a Dataset

NOTE

If you plan to use the data-free compressor notebook, you can skip this step.

To use the calibration compressor notebook, you must prepare and upload a calibration dataset. Prepare your dataset using the same process described in Upload Models Using Notebook. The example calibration notebook uses the ultrachat_200k dataset.

calibration-compressor.ipynb
from datasets import load_dataset

dataset_id = "./ultrachat_200k"

num_calibration_samples = 512 if use_gpu else 4
max_sequence_length = 2048 if use_gpu else 16

ds = load_dataset(dataset_id, split="train_sft")
ds = ds.shuffle(seed=42).select(range(num_calibration_samples))

def preprocess(example): 
    text = tokenizer.apply_chat_template(
        example["messages"],
        tokenize=False,
    )
    return tokenizer(
        text,
        padding=False,
        max_length=max_sequence_length,
        truncation=True,
        add_special_tokens=False,
    )

ds = ds.map(preprocess, remove_columns=ds.column_names)
  1. Create the calibration dataset, using Huggingface datasets API. You can modify this line if you want to use your own dataset.
  2. Select number of samples. 512 samples is a good place to start. Increasing the number of samples can improve accuracy.
  3. Load dataset.
  4. Shuffle and grab only the number of samples we need.
  5. Preprocess and tokenize into format the model uses.

(Optional) Upload Dataset into S3 Storage

If you wish to upload datasets into S3, you can first install the boto3 tool and then run those codes in JupyterLab.

~/.venv/bin/python -m pip install boto3 -i https://pypi.tuna.tsinghua.edu.cn/simple
import os
from boto3.s3.transfer import TransferConfig
import boto3

local_folder = "./ultrachat_200k"
bucket_name = "datasets"

config = TransferConfig(
    multipart_threshold=100*1024*1024,
    max_concurrency=10,
    multipart_chunksize=100*1024*1024,
    use_threads=True
)

for root, dirs, files in os.walk(local_folder):
    for filename in files:
        local_path = os path.join(root, filename)
        relative_path = os.path.relpath(local_path, local_folder)
        s3_key = f"ultrachat_200k/{relative_path.replace(os.sep, '/')}"
        s3.upload_file(local_path, bucket_name, s3_key, Config=config)
        print(f"Uploaded {local_path} -> {s3_key}")
  1. You can modify this line if you want to use your own dataset.
  2. Configure multipart upload with 100 MB chunks and a maximum of 10 concurrent threads.

(Optional) Use Dataset in S3 Storage

If you wish to use datasets from S3, you can first install the s3fs tool and then modify the dataset loading section in the example by following the code below.

~/.venv/bin/python -m pip install s3fs -i https://pypi.tuna.tsinghua.edu.cn/simple
calibration-compressor.ipynb
import os
from datasets import load_dataset

os.environ["AWS_ACCESS_KEY_ID"] = "@7Apples@"
os.environ["AWS_SECRET_ACCESS_KEY"] = "07Apples@"

storage_options = {
  "key": "07Apples@",
  "secret": "O7Apples@",
  "client_kwargs": {
    "endpoint_url": "http://minio.minio-system.svc.cluster.local:80"
  }
}

ds = load_dataset(
      'parquet',
      data_files='s3://datasets/ultrachat_200k/data/train_sft-*.parquet', 
      storage_options=storage_options, 
      split="train"
)
  1. Set environment variables (as a backup, some underlying components will use them).
  2. Define storage configuration; you must explicitly specify the endpoint_url to connect to MinIO.
  3. If the dataset is split, this is equivalent to split="train_sft" in the example.

Clone Models and Datasets in JupyterLab

In the JupyterLab terminal, use git clone to download the model repository (and dataset, if applicable) to your workspace. The data-free compressor notebook does not require a dataset.

Create and Run Compression Notebooks

Download the appropriate example notebook for your use case: the calibration compressor notebook if you are using a dataset, or the data-free compressor notebook otherwise. Create a new notebook (for example, compressor.ipynb) in JupyterLab and paste the contents of the example notebook into it. Run the cells to perform model compression.

Upload the Compressed Model to the Repository

Once compression is complete, upload the compressed model back to the model repository using the steps outlined in Upload Models Using Notebook.

model_dir = "./" + model_id.split("/")[-1] + "-W4A16"
model.save_pretrained(model_dir)
tokenizer.save_pretrained(model_dir);
  1. Save model and tokenizer. You can modify this line if you want to change the name of output.

Deploy and Use the Compressed Model for Inference

Quantized and sparse models that you create with LLM Compressor are saved using the compressed-tensors library (an extension of Safetensors). The compression format matches the model's quantization or sparsity type. These formats are natively supported in vLLM, enabling fast inference through optimized deployment kernels by using Alauda AI Inference Server. Follow the instructions in create inference service to complete this step.