transformers#
What it is#
transformers is Hugging Face’s flagship Python library for loading and running pre-trained neural networks — language models, vision models, speech models, and multimodal models. It provides a unified AutoModel / AutoTokenizer / pipeline API on top of PyTorch, TensorFlow, JAX/Flax, and (increasingly) ONNX Runtime backends.
The library is tightly coupled to the Hugging Face Hub — from_pretrained("model-id") downloads weights, tokenizer files, and config from a hub repo. The Hub now hosts well over a million model checkpoints.
Install#
pip install transformers
Output: installs the library but no ML backend — you still need PyTorch, TF, or JAX separately
pip install "transformers[torch]"
Output: installs transformers plus a compatible PyTorch wheel
pip install "transformers[torch]" accelerate
Output: the standard “modern LLM” combo — adds device_map="auto" and multi-GPU support
uv add transformers torch accelerate
Output: dependencies resolved + added to pyproject.toml
poetry add transformers torch
Output: updated lockfile + virtualenv install
Versioning & Python support#
- Current stable is the
4.xseries (and has been for the entire LLM era). Major bumps are rare; minor releases (4.45,4.46, …) ship roughly monthly and frequently add new model architectures. - Python
3.9+on current releases;3.10+recommended. - Loose semver — minor releases may add new APIs and deprecate old ones, but rarely break existing model loading. Patch releases are pure bug fixes.
- The library lags slightly behind the latest models on the Hub for architecture support (a brand-new model often needs
trust_remote_code=Trueuntil its class is upstreamed). - Pinning matters:
transformers==4.Xmatched withtorch>=Yis the typical compat matrix; the Hugging Face release notes call out the floors.
Package metadata#
- Maintainer: Hugging Face (the
huggingfaceGitHub org) - Project home: github.com/huggingface/transformers
- Docs: huggingface.co/docs/transformers
- PyPI: pypi.org/project/transformers
- License: Apache-2.0
- Governance: commercial company + huge open-source contributor base
- First released: 2018 (originally
pytorch-pretrained-bert) - Downloads: tens of millions per month
Optional dependencies & extras#
transformers defines many [extra] groups. The most relevant:
| Extra | Pulls in |
|---|---|
transformers[torch] | PyTorch wheel matched to the library’s tested floor |
transformers[tf] | TensorFlow 2.x |
transformers[flax] | JAX + Flax |
transformers[sentencepiece] | The sentencepiece tokenizer (required for LLaMA, T5, Mistral, etc.) |
transformers[tokenizers] | The Rust-backed tokenizers library (pulled in by default in most paths) |
transformers[onnxruntime] | ONNX export + inference with onnxruntime |
transformers[serving] | Adds FastAPI for transformers serve |
transformers[vision] | Pillow + image-processing deps for vision models |
transformers[audio] | librosa, soundfile for speech models |
transformers[all] | Everything above (very large install) |
Companion packages from the same org:
accelerate— multi-GPU / mixed-precision /device_map="auto"peft— parameter-efficient fine-tuning (LoRA, QLoRA, adapters)bitsandbytes— 8-bit and 4-bit quantisationdatasets— Hugging Face dataset loading and streamingsafetensors— fast, memory-safe checkpoint format (now default)huggingface-hub— Hub client, used byfrom_pretrained
Alternatives#
| Package | Trade-off |
|---|---|
vllm | Production-grade LLM inference server — way faster throughput than raw transformers.generate(). Use when serving at scale. |
text-generation-inference (TGI) | Hugging Face’s own production serving stack. |
onnxruntime | Run exported ONNX models with no Python ML framework. Smaller deploy footprint. |
tensorflow-hub | TF-native pre-trained model hub. Mostly superseded by Hugging Face Hub today. |
mlx (Apple Silicon) | Native Apple Silicon inference; some HF models mirrored. Use for Mac-local LLMs. |
llama-cpp-python | GGUF-quantised CPU/GPU inference. Use when you need llama.cpp’s quant formats. |
Common gotchas#
- Model card vs Hub repo vs Inference API are different things. The model “card” is the README; the Hub repo holds weights; the Inference API is a separate hosted service.
from_pretrainedonly touches the Hub repo. trust_remote_code=Trueis code execution. Many cutting-edge models ship custom modeling code that loads via this flag — it runs arbitrary Python from the Hub. Only enable for repos you trust, ideally pinned to a revision SHA.device_map="auto"needsaccelerate. Without it, the model loads to CPU and you wonder why inference is glacial.- FlashAttention is opt-in. Pass
attn_implementation="flash_attention_2"tofrom_pretrainedand installflash-attnseparately — it’s not in any extra. - Tokenizer mismatch with sentencepiece-based models. Loading LLaMA / Mistral / T5 without
transformers[sentencepiece]raises a crypticImportError. Install the extra. - Big models OOM silently on Windows.
device_map="auto"will happily spill to disk viaaccelerateon Linux, but pagefile semantics on Windows hit hard. Use Linux / WSL for >7B models. pipeline()is slow per-call. It re-runs framework overhead each invocation. Build the model + tokenizer manually and batch inputs for throughput.- Hub auth required for gated models. LLaMA, Gemma, and others need
huggingface-cli loginand an approved license click on the website beforefrom_pretrainedworks.
Ecosystem integrations#
transformers is the trunk of a wider Hugging Face ecosystem. The companion packages overlap less than their names suggest — each owns a slice of the lifecycle.
| Package | What it owns |
|---|---|
accelerate | Multi-GPU placement, mixed-precision, device_map="auto", distributed launch (accelerate launch …). |
peft | Parameter-efficient fine-tuning: LoRA, QLoRA, prefix tuning, IA3, adapters. |
bitsandbytes | 8-bit and 4-bit quantisation kernels for CUDA. Powers BitsAndBytesConfig. |
optimum | Hardware-specific backends: ONNX Runtime, OpenVINO, TensorRT, Intel Habana, AWS Neuron. |
safetensors | Memory-mapped, pickle-free checkpoint format. Default for new uploads. |
datasets | Streaming / mapped tabular dataset library. The standard pre-Trainer data layer. |
tokenizers | Rust-backed fast tokenizers — AutoTokenizer instantiates these by default. |
huggingface-hub | Hub client. from_pretrained calls it transitively; CLI tools (huggingface-cli) come from here. |
evaluate | Standard metrics (accuracy, f1, rouge, bleu, bertscore). Compatible with Trainer.compute_metrics. |
trl | Reinforcement learning from human feedback — SFTTrainer, DPOTrainer, PPOTrainer. |
transformers.js | Browser-side inference (different package; same model files). |
Inference-server siblings (not strict integrations but the same model files):
vllm— high-throughput LLM serving with continuous batching and paged attention.text-generation-inference(TGI) — Hugging Face’s own production server.llama-cpp-python— GGUF-quantised inference on CPU + small GPUs.
Most production setups mix these — fine-tune with transformers/peft, serve with vLLM, observe with langsmith or OpenTelemetry.
Real-world recipes#
Recipe: chat-template inference server#
A FastAPI service that wraps a chat model with proper chat templating, streaming, and prompt-cache-friendly batching.
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from transformers import AutoTokenizer, AutoModelForCausalLM, TextIteratorStreamer
from threading import Thread
import torch
MODEL = "meta-llama/Llama-3.1-8B-Instruct"
tok = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForCausalLM.from_pretrained(
MODEL,
torch_dtype=torch.bfloat16,
device_map="auto",
attn_implementation="sdpa", # safe default; flash_attention_2 if installed
)
tok.pad_token = tok.eos_token
app = FastAPI()
@app.post("/chat")
def chat(payload: dict):
prompt = tok.apply_chat_template(payload["messages"], tokenize=False, add_generation_prompt=True)
inputs = tok(prompt, return_tensors="pt").to(model.device)
streamer = TextIteratorStreamer(tok, skip_prompt=True, skip_special_tokens=True)
Thread(target=model.generate, kwargs=dict(
**inputs, streamer=streamer, max_new_tokens=512, do_sample=True, temperature=0.7,
)).start()
return StreamingResponse((t for t in streamer), media_type="text/plain")
Output: clients receive tokens incrementally; one worker per GPU is the right shape.
Recipe: LoRA fine-tune + merge for serving#
Train a small adapter with PEFT, merge it back into the base, then export to a serving format:
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
base = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B-Instruct",
torch_dtype=torch.bfloat16, device_map="auto")
adapter = PeftModel.from_pretrained(base, "./lora-out/final")
merged = adapter.merge_and_unload()
merged.save_pretrained("./llama3-merged", safe_serialization=True)
AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct").save_pretrained("./llama3-merged")
Output: ./llama3-merged is a complete fine-tuned checkpoint ready for vLLM / TGI / any HF-compatible loader.
Recipe: streaming embedding pipeline over a parquet corpus#
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModel
import torch, torch.nn.functional as F
tok = AutoTokenizer.from_pretrained("BAAI/bge-small-en-v1.5")
model = AutoModel.from_pretrained("BAAI/bge-small-en-v1.5").to("cuda").eval()
@torch.no_grad()
def embed(batch):
enc = tok(batch["text"], padding=True, truncation=True, return_tensors="pt", max_length=512).to("cuda")
h = model(**enc).last_hidden_state[:, 0] # CLS pooling for BGE
h = F.normalize(h, p=2, dim=1)
return {"embedding": h.cpu().tolist()}
ds = load_dataset("parquet", data_files="docs-*.parquet", streaming=True)["train"]
out = ds.map(embed, batched=True, batch_size=64)
out.to_parquet("docs-embedded.parquet")
Output: memory-bounded embedding pipeline that scales to corpora larger than RAM.
Recipe: zero-shot classification ladder#
Triage incoming support tickets cheaply with a zero-shot model, escalate to an LLM only if confidence is low.
from transformers import pipeline
zsc = pipeline("zero-shot-classification", model="MoritzLaurer/deberta-v3-base-zeroshot-v2.0", device=0)
CATEGORIES = ["billing", "bug report", "feature request", "account access", "other"]
def route(ticket: str):
res = zsc(ticket, candidate_labels=CATEGORIES, multi_label=False)
label, score = res["labels"][0], res["scores"][0]
return label if score >= 0.75 else "escalate-to-llm"
Output: ~10× cheaper than running every ticket through an LLM, with calibrated confidence to fall back.
Recipe: offline-first deployment#
Pre-bake the model into the container; lock down Hub access at runtime:
FROM python:3.12-slim
RUN pip install --no-cache-dir transformers torch accelerate
RUN python -c "from transformers import AutoModel, AutoTokenizer; \
AutoModel.from_pretrained('intfloat/e5-small-v2'); \
AutoTokenizer.from_pretrained('intfloat/e5-small-v2')"
ENV HF_HUB_OFFLINE=1 TRANSFORMERS_OFFLINE=1
COPY . /app
CMD ["python", "/app/serve.py"]
Output: model weights baked into the image; runtime has no Hub dependency.
Performance tuning#
The default model.generate(...) call leaves a lot of throughput on the table. The biggest wins, in rough order of impact:
- Batch on the server, not the client. For inference servers, batch concurrent requests inside one
generatecall when possible. Even mismatched-length prompts benefit from padded batching. - Attention implementation.
attn_implementation="sdpa"(PyTorch’s scaled-dot-product attention) is a safe default on modern PyTorch.flash_attention_2is faster and lower-memory but needspip install flash-attnand supported GPUs (Ampere+). - KV cache is on by default for
generate. Don’t disable it; checkmodel.config.use_cache=True. torch.compilefor stable shapes.model = torch.compile(model, mode="reduce-overhead")improves throughput materially on Ampere+ GPUs once the graph stabilises. Compile time is heavy — only worth it for long-running servers.- Quantisation. 4-bit NF4 (
BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4")) cuts memory ~4× with usually <1% quality loss. Pre-quantised GPTQ / AWQ checkpoints load faster at inference time. - Mixed precision.
torch_dtype=torch.bfloat16on Ampere+;float16on older cards. Save 2× memory vsfloat32. - Gradient checkpointing for training. Trade recompute for memory:
training_args.gradient_checkpointing = Truelets larger batches/sequences fit. - Use
optim="adamw_bnb_8bit"for training — 8-bit optimiser states roughly halve VRAM usage. - DataLoader workers. For training,
num_workers=4(or higher) +pin_memory=Trueremoves CPU bottlenecks. TORCH_LOGS=recompileswhile developing — catches silent recompilation pitfalls undertorch.compile.- Multi-GPU.
device_map="auto"splits across GPUs naively (layer-pipelined). Useaccelerate launch --multi_gpu+ Trainer for proper data-parallel training; FSDP or DeepSpeed for tensor-parallel training. - CPU inference.
optimum-intel+ OpenVINO accelerates BERT-class encoders by 3-5× on Xeon CPUs. ONNX Runtime is a portable alternative.
For production inference at scale, hand off to vLLM or TGI. transformers.generate() was never optimised for throughput — those servers add continuous batching, paged attention, and tensor parallelism that you’d otherwise reimplement.
Version migration guide#
The 4.x line has lasted the entire modern LLM era. Within 4.x, churn comes from monthly minor releases that frequently add models and occasionally tighten APIs.
| Roughly | What tends to break |
|---|---|
<4.30 | Pre-safetensors default. Many tutorials still call .bin loading explicitly. |
~4.35 | attn_implementation arg standardised. Older code passes attention via separate flags. |
~4.40 | device_map="auto" paths assume accelerate is installed (it isn’t always). |
~4.45 | Chat template defaults tightened — some older instruct models require apply_chat_template explicitly. |
| Latest | Steady drumbeat of new model architectures; trust_remote_code=True is the bridge until the class is upstreamed. |
Migration discipline:
- Pin both library and torch.
transformers==4.X+torch==Y.Zare the supported pair listed in release notes. - Re-test fine-tunes after upgrades. New defaults (loss-fns, gradient clipping, LR scheduler) occasionally affect downstream metrics.
from_pretrained(...)arg deprecations are noisy but generally non-breaking for a release or two. Address before they go silent.- Tokenizer special-token handling has tightened over time. Pin the tokenizer JSON when reproducibility matters.
Hedge: when in doubt, the Hugging Face Transformers Release Notes on GitHub list every behavioral change per release — search there rather than guessing.
Security considerations#
transformers runs arbitrary model weights, occasionally arbitrary model code, and pulls files from the public Hub. The threat surface deserves attention.
trust_remote_code=Trueis code execution. Many cutting-edge models ship custom modeling code that loads via this flag. Audit the repo and pin a revision SHA (revision="abc123") before enabling. Treat unknown repos like unknown PyPI packages.- Pickle in
.bincheckpoints. Pre-safetensors.binfiles are Python pickles — loading one runs arbitrary code. Prefersafetensors; the loader rejects mixed loads in strict mode. - Hub supply chain. Anyone can publish to the Hub. Pin
revision=<commit-sha>for production use, not branch names. - Tokenizer files.
tokenizer.jsonis data, not code, but malicious vocab can produce token IDs that confuse downstream code. Defence-in-depth: validate vocab size againstmodel.config.vocab_size. - Gated model licenses. LLaMA, Gemma, Mistral, others have license obligations beyond Apache-2.0. Track which models touch your build pipeline.
- PII in prompts and outputs. Local inference avoids sending data to a third party — but logs, traces, and dataset caches will still capture it. Apply the same redaction policy you’d use for hosted APIs.
- Adversarial prompts. Local models don’t have the abuse mitigations a hosted service does. If users supply prompts directly, add a content classifier upstream.
- Network egress at load time.
from_pretrainedhits the Hub. UseTRANSFORMERS_OFFLINE=1+ a pre-baked cache in air-gapped environments.
Troubleshooting common errors#
OSError: Can't load tokenizer for ...— usually a missing extra.pip install sentencepiece(LLaMA/T5/Mistral) ortiktoken(some OpenAI-derived tokenizers).ImportError: This requires you to install the latest version of bitsandbytes—pip install -U bitsandbytesand ensure CUDA matches your PyTorch build.device_map="auto"errors on CPU-only machines. Installaccelerateeven on CPU; without it the device-mapping logic short-circuits to errors.CUDA out of memory. Trylow_cpu_mem_usage=True, quantisation (4-bit), gradient checkpointing, smaller batch, shorter sequences, ordevice_map="auto"with offloading.- Generated text is gibberish for instruct models. You forgot
apply_chat_template. Decoder-only chat models trained with template tokens produce nonsense if you skip them. pad_tokenerror during batched generation. Settokenizer.pad_token = tokenizer.eos_tokenandmodel.config.pad_token_id = tokenizer.eos_token_idbefore the call.- Whisper transcriptions repeat indefinitely. Pass
condition_on_previous_text=Falseand chunk inputs withchunk_length_s=30. trust_remote_codewarning. Either accept and pin a revision SHA, or wait for the architecture to be upstreamed.- Tokenizer mismatch between training and inference. Always save the tokenizer alongside the model (
tokenizer.save_pretrained(dir)); loading from the base checkpoint after fine-tuning corrupts special tokens.
When NOT to use this#
- Production LLM inference at scale. Use vLLM or TGI. They add continuous batching, paged attention, and proper request scheduling.
transformers.generate()runs one request at a time and idles GPU. - Edge / mobile deployment. ONNX Runtime, MLC LLM, llama.cpp, and Core ML are the right tools.
transformersisn’t built for that runtime envelope. - Pure embedding inference at high QPS.
sentence-transformersis purpose-built;fastembed(ONNX) is leaner. - Hosted-API parity. If you only need a remote model, the provider SDK (
openai,anthropic) is one HTTP call away — no PyTorch needed. - Tiny ML. For sub-100 MB classifiers,
scikit-learn+transformers.AutoTokenizerfor features is often cleaner than the full pipeline.
Production deployment#
transformers is a library; production deployment means picking the right server pattern around it. For most teams the right answer is “use vLLM” — but plenty of workloads run transformers directly in production.
Single-worker GPU server. One AutoModelForCausalLM instance held by one FastAPI process per GPU. Simple, debuggable, and adequate for low-QPS internal services. Use a process supervisor (gunicorn / supervisord) — multiple workers per GPU just thrash VRAM.
CPU encoder server. BERT-class encoders served via optimum-onnxruntime or optimum-openvino on Xeon CPUs are the canonical “embedding micro-service” shape. ~10-100× cheaper than a GPU box for the same QPS.
Container shape. Build the image with the model pre-downloaded (saves cold-start time and reduces Hub dependency at runtime). Use multi-stage builds — the base CUDA image is large; copy only the runtime layer.
Model versioning. from_pretrained("name", revision="abc123") pins the exact Hub commit. Track the SHA in requirements.txt (as a comment) so rollback is auditable.
Inference batching. Either implement client-side micro-batching (collect ~10ms of requests, run one generate) or hand the workload to vLLM, which does this automatically and far better.
GPU utilisation telemetry. Export nvidia-smi metrics through dcgm-exporter to Prometheus. If GPU utilisation sits below 30%, you’re paying for idle silicon — batch harder, or stop using a GPU.
Health checks. /health should run a trivial 1-token generate — this catches NaN-loaded models that a process-level check wouldn’t.
Cost & rate-limit management#
Self-hosted transformers swaps API-side rate limits for GPU-side rate limits. The economics change, but the bookkeeping doesn’t disappear.
- GPU-hour budgets. Track
(QPS × p99_latency × dollars_per_GPU_hour)per model. For an embedding model on an A10, that’s pennies per million calls; for an 8B chat model on an H100, it’s measurable. - Choose the smallest model that meets quality. A 1.5B model that hits your accuracy bar beats a 70B model on TCO by 20-50×.
- Quantise. 4-bit NF4 with
BitsAndBytesConfigcuts VRAM ~4× and often serves 2× more concurrent requests on the same GPU. - Batch. Server-side batching is where most production savings come from. vLLM ships with continuous batching; if you’re sticking with
transformers, group requests within 20-50ms windows. - Shut down cold GPUs. Spot/preemptible instances are 60-90% cheaper. For workloads tolerant of restart latency, that’s the easy win.
- Embedding caching. For embedding workloads, cache by content hash — many corpora have 10-30% duplicate documents.
- Local quotas. A single misbehaving caller can saturate your GPU. Rate-limit per tenant at the application layer.
Multi-provider patterns#
transformers itself isn’t multi-provider — but it sits next to provider SDKs in many production architectures. The common patterns:
- Routing cheap workloads to local, expensive workloads to hosted. A
route(query) -> providerfunction based on query complexity, latency budget, or model strengths. Local embedding + hosted chat is the classic split. - Fine-tune local, serve from
transformers, fall back to hosted on rate-limit. For high-availability services, treat the local model as primary and a hosted API as failover. - vLLM as the OpenAI-compatible front door. vLLM serves
transformers-format models over an OpenAI-compatible HTTP API. Any client that speaks OpenAI’s wire format (LangChain, LiteLLM, openai-python) talks to it unchanged. - LiteLLM proxy in front of mixed local + hosted. Same routing logic, centralised — one base URL for your application, multiple providers behind it.
- Tokenizer parity. Cross-provider routing needs tokenizer compatibility for token-budget calculations.
tiktokenhandles OpenAI;transformers.AutoTokenizerhandles everything else.
See also#
- AI: transformers — pipelines, generation, fine-tuning
- Packages: pip-sentence-transformers — embedding-focused sibling
- Concept: api — client-library design