## LLM Compressor Workbench -- Getting Started

This notebook will demonstrate how common [LLM Compressor](https://github.com/vllm-project/llm-compressor) flows can be run on the Alauda AI.

We will show how a user can compress a Large Language Model, without data.

### Data-Free Model Compression

In [None]:
from llmcompressor.modifiers.quantization import QuantizationModifier

# model to compress
model_id = "./TinyLlama-1.1B-Chat-v1.0"

# This recipe will quantize all Linear layers except those in the `lm_head`,
#  which is often sensitive to quantization. The W4A16 scheme compresses
#  weights to 4-bit integers while retaining 16-bit activations.
recipe = QuantizationModifier(targets="Linear", scheme="W4A16", ignore=["lm_head"])

In [None]:
# Load up model using huggingface API
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    model_id, device_map="auto", torch_dtype="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)

In [None]:
# Run compression using `oneshot`
from llmcompressor import oneshot

model = oneshot(model=model, recipe=recipe, tokenizer=tokenizer)

In [None]:
# Save model and tokenizer
model_dir = "./" + model_id.split("/")[-1] + "-W4A16"
model.save_pretrained(model_dir)
tokenizer.save_pretrained(model_dir);