skip to content

transformers — Hugging Face Model Library

Package-level reference for the Hugging Face transformers library on PyPI — install extras, backend choice, versioning, and alternatives.

15 min read 10 snippets deep dive

transformers#

What it is#

transformers is Hugging Face’s flagship Python library for loading and running pre-trained neural networks — language models, vision models, speech models, and multimodal models. It provides a unified AutoModel / AutoTokenizer / pipeline API on top of PyTorch, TensorFlow, JAX/Flax, and (increasingly) ONNX Runtime backends.

The library is tightly coupled to the Hugging Face Hubfrom_pretrained("model-id") downloads weights, tokenizer files, and config from a hub repo. The Hub now hosts well over a million model checkpoints.

Install#

pip install transformers

Output: installs the library but no ML backend — you still need PyTorch, TF, or JAX separately

pip install "transformers[torch]"

Output: installs transformers plus a compatible PyTorch wheel

pip install "transformers[torch]" accelerate

Output: the standard “modern LLM” combo — adds device_map="auto" and multi-GPU support

uv add transformers torch accelerate

Output: dependencies resolved + added to pyproject.toml

poetry add transformers torch

Output: updated lockfile + virtualenv install

Versioning & Python support#

  • Current stable is the 4.x series (and has been for the entire LLM era). Major bumps are rare; minor releases (4.45, 4.46, …) ship roughly monthly and frequently add new model architectures.
  • Python 3.9+ on current releases; 3.10+ recommended.
  • Loose semver — minor releases may add new APIs and deprecate old ones, but rarely break existing model loading. Patch releases are pure bug fixes.
  • The library lags slightly behind the latest models on the Hub for architecture support (a brand-new model often needs trust_remote_code=True until its class is upstreamed).
  • Pinning matters: transformers==4.X matched with torch>=Y is the typical compat matrix; the Hugging Face release notes call out the floors.

Package metadata#

Optional dependencies & extras#

transformers defines many [extra] groups. The most relevant:

ExtraPulls in
transformers[torch]PyTorch wheel matched to the library’s tested floor
transformers[tf]TensorFlow 2.x
transformers[flax]JAX + Flax
transformers[sentencepiece]The sentencepiece tokenizer (required for LLaMA, T5, Mistral, etc.)
transformers[tokenizers]The Rust-backed tokenizers library (pulled in by default in most paths)
transformers[onnxruntime]ONNX export + inference with onnxruntime
transformers[serving]Adds FastAPI for transformers serve
transformers[vision]Pillow + image-processing deps for vision models
transformers[audio]librosa, soundfile for speech models
transformers[all]Everything above (very large install)

Companion packages from the same org:

  • accelerate — multi-GPU / mixed-precision / device_map="auto"
  • peft — parameter-efficient fine-tuning (LoRA, QLoRA, adapters)
  • bitsandbytes — 8-bit and 4-bit quantisation
  • datasets — Hugging Face dataset loading and streaming
  • safetensors — fast, memory-safe checkpoint format (now default)
  • huggingface-hub — Hub client, used by from_pretrained

Alternatives#

PackageTrade-off
vllmProduction-grade LLM inference server — way faster throughput than raw transformers.generate(). Use when serving at scale.
text-generation-inference (TGI)Hugging Face’s own production serving stack.
onnxruntimeRun exported ONNX models with no Python ML framework. Smaller deploy footprint.
tensorflow-hubTF-native pre-trained model hub. Mostly superseded by Hugging Face Hub today.
mlx (Apple Silicon)Native Apple Silicon inference; some HF models mirrored. Use for Mac-local LLMs.
llama-cpp-pythonGGUF-quantised CPU/GPU inference. Use when you need llama.cpp’s quant formats.

Common gotchas#

  1. Model card vs Hub repo vs Inference API are different things. The model “card” is the README; the Hub repo holds weights; the Inference API is a separate hosted service. from_pretrained only touches the Hub repo.
  2. trust_remote_code=True is code execution. Many cutting-edge models ship custom modeling code that loads via this flag — it runs arbitrary Python from the Hub. Only enable for repos you trust, ideally pinned to a revision SHA.
  3. device_map="auto" needs accelerate. Without it, the model loads to CPU and you wonder why inference is glacial.
  4. FlashAttention is opt-in. Pass attn_implementation="flash_attention_2" to from_pretrained and install flash-attn separately — it’s not in any extra.
  5. Tokenizer mismatch with sentencepiece-based models. Loading LLaMA / Mistral / T5 without transformers[sentencepiece] raises a cryptic ImportError. Install the extra.
  6. Big models OOM silently on Windows. device_map="auto" will happily spill to disk via accelerate on Linux, but pagefile semantics on Windows hit hard. Use Linux / WSL for >7B models.
  7. pipeline() is slow per-call. It re-runs framework overhead each invocation. Build the model + tokenizer manually and batch inputs for throughput.
  8. Hub auth required for gated models. LLaMA, Gemma, and others need huggingface-cli login and an approved license click on the website before from_pretrained works.

Ecosystem integrations#

transformers is the trunk of a wider Hugging Face ecosystem. The companion packages overlap less than their names suggest — each owns a slice of the lifecycle.

PackageWhat it owns
accelerateMulti-GPU placement, mixed-precision, device_map="auto", distributed launch (accelerate launch …).
peftParameter-efficient fine-tuning: LoRA, QLoRA, prefix tuning, IA3, adapters.
bitsandbytes8-bit and 4-bit quantisation kernels for CUDA. Powers BitsAndBytesConfig.
optimumHardware-specific backends: ONNX Runtime, OpenVINO, TensorRT, Intel Habana, AWS Neuron.
safetensorsMemory-mapped, pickle-free checkpoint format. Default for new uploads.
datasetsStreaming / mapped tabular dataset library. The standard pre-Trainer data layer.
tokenizersRust-backed fast tokenizers — AutoTokenizer instantiates these by default.
huggingface-hubHub client. from_pretrained calls it transitively; CLI tools (huggingface-cli) come from here.
evaluateStandard metrics (accuracy, f1, rouge, bleu, bertscore). Compatible with Trainer.compute_metrics.
trlReinforcement learning from human feedback — SFTTrainer, DPOTrainer, PPOTrainer.
transformers.jsBrowser-side inference (different package; same model files).

Inference-server siblings (not strict integrations but the same model files):

  • vllm — high-throughput LLM serving with continuous batching and paged attention.
  • text-generation-inference (TGI) — Hugging Face’s own production server.
  • llama-cpp-python — GGUF-quantised inference on CPU + small GPUs.

Most production setups mix these — fine-tune with transformers/peft, serve with vLLM, observe with langsmith or OpenTelemetry.

Real-world recipes#

Recipe: chat-template inference server#

A FastAPI service that wraps a chat model with proper chat templating, streaming, and prompt-cache-friendly batching.

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from transformers import AutoTokenizer, AutoModelForCausalLM, TextIteratorStreamer
from threading import Thread
import torch

MODEL = "meta-llama/Llama-3.1-8B-Instruct"

tok   = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForCausalLM.from_pretrained(
    MODEL,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    attn_implementation="sdpa",   # safe default; flash_attention_2 if installed
)
tok.pad_token = tok.eos_token

app = FastAPI()

@app.post("/chat")
def chat(payload: dict):
    prompt = tok.apply_chat_template(payload["messages"], tokenize=False, add_generation_prompt=True)
    inputs = tok(prompt, return_tensors="pt").to(model.device)
    streamer = TextIteratorStreamer(tok, skip_prompt=True, skip_special_tokens=True)
    Thread(target=model.generate, kwargs=dict(
        **inputs, streamer=streamer, max_new_tokens=512, do_sample=True, temperature=0.7,
    )).start()
    return StreamingResponse((t for t in streamer), media_type="text/plain")

Output: clients receive tokens incrementally; one worker per GPU is the right shape.

Recipe: LoRA fine-tune + merge for serving#

Train a small adapter with PEFT, merge it back into the base, then export to a serving format:

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

base = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B-Instruct",
                                            torch_dtype=torch.bfloat16, device_map="auto")
adapter = PeftModel.from_pretrained(base, "./lora-out/final")
merged  = adapter.merge_and_unload()
merged.save_pretrained("./llama3-merged", safe_serialization=True)
AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct").save_pretrained("./llama3-merged")

Output: ./llama3-merged is a complete fine-tuned checkpoint ready for vLLM / TGI / any HF-compatible loader.

Recipe: streaming embedding pipeline over a parquet corpus#

from datasets import load_dataset
from transformers import AutoTokenizer, AutoModel
import torch, torch.nn.functional as F

tok = AutoTokenizer.from_pretrained("BAAI/bge-small-en-v1.5")
model = AutoModel.from_pretrained("BAAI/bge-small-en-v1.5").to("cuda").eval()

@torch.no_grad()
def embed(batch):
    enc = tok(batch["text"], padding=True, truncation=True, return_tensors="pt", max_length=512).to("cuda")
    h = model(**enc).last_hidden_state[:, 0]  # CLS pooling for BGE
    h = F.normalize(h, p=2, dim=1)
    return {"embedding": h.cpu().tolist()}

ds = load_dataset("parquet", data_files="docs-*.parquet", streaming=True)["train"]
out = ds.map(embed, batched=True, batch_size=64)
out.to_parquet("docs-embedded.parquet")

Output: memory-bounded embedding pipeline that scales to corpora larger than RAM.

Recipe: zero-shot classification ladder#

Triage incoming support tickets cheaply with a zero-shot model, escalate to an LLM only if confidence is low.

from transformers import pipeline
zsc = pipeline("zero-shot-classification", model="MoritzLaurer/deberta-v3-base-zeroshot-v2.0", device=0)

CATEGORIES = ["billing", "bug report", "feature request", "account access", "other"]

def route(ticket: str):
    res = zsc(ticket, candidate_labels=CATEGORIES, multi_label=False)
    label, score = res["labels"][0], res["scores"][0]
    return label if score >= 0.75 else "escalate-to-llm"

Output: ~10× cheaper than running every ticket through an LLM, with calibrated confidence to fall back.

Recipe: offline-first deployment#

Pre-bake the model into the container; lock down Hub access at runtime:

FROM python:3.12-slim
RUN pip install --no-cache-dir transformers torch accelerate
RUN python -c "from transformers import AutoModel, AutoTokenizer; \
    AutoModel.from_pretrained('intfloat/e5-small-v2'); \
    AutoTokenizer.from_pretrained('intfloat/e5-small-v2')"
ENV HF_HUB_OFFLINE=1 TRANSFORMERS_OFFLINE=1
COPY . /app
CMD ["python", "/app/serve.py"]

Output: model weights baked into the image; runtime has no Hub dependency.

Performance tuning#

The default model.generate(...) call leaves a lot of throughput on the table. The biggest wins, in rough order of impact:

  • Batch on the server, not the client. For inference servers, batch concurrent requests inside one generate call when possible. Even mismatched-length prompts benefit from padded batching.
  • Attention implementation. attn_implementation="sdpa" (PyTorch’s scaled-dot-product attention) is a safe default on modern PyTorch. flash_attention_2 is faster and lower-memory but needs pip install flash-attn and supported GPUs (Ampere+).
  • KV cache is on by default for generate. Don’t disable it; check model.config.use_cache=True.
  • torch.compile for stable shapes. model = torch.compile(model, mode="reduce-overhead") improves throughput materially on Ampere+ GPUs once the graph stabilises. Compile time is heavy — only worth it for long-running servers.
  • Quantisation. 4-bit NF4 (BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4")) cuts memory ~4× with usually <1% quality loss. Pre-quantised GPTQ / AWQ checkpoints load faster at inference time.
  • Mixed precision. torch_dtype=torch.bfloat16 on Ampere+; float16 on older cards. Save 2× memory vs float32.
  • Gradient checkpointing for training. Trade recompute for memory: training_args.gradient_checkpointing = True lets larger batches/sequences fit.
  • Use optim="adamw_bnb_8bit" for training — 8-bit optimiser states roughly halve VRAM usage.
  • DataLoader workers. For training, num_workers=4 (or higher) + pin_memory=True removes CPU bottlenecks.
  • TORCH_LOGS=recompiles while developing — catches silent recompilation pitfalls under torch.compile.
  • Multi-GPU. device_map="auto" splits across GPUs naively (layer-pipelined). Use accelerate launch --multi_gpu + Trainer for proper data-parallel training; FSDP or DeepSpeed for tensor-parallel training.
  • CPU inference. optimum-intel + OpenVINO accelerates BERT-class encoders by 3-5× on Xeon CPUs. ONNX Runtime is a portable alternative.

For production inference at scale, hand off to vLLM or TGI. transformers.generate() was never optimised for throughput — those servers add continuous batching, paged attention, and tensor parallelism that you’d otherwise reimplement.

Version migration guide#

The 4.x line has lasted the entire modern LLM era. Within 4.x, churn comes from monthly minor releases that frequently add models and occasionally tighten APIs.

RoughlyWhat tends to break
<4.30Pre-safetensors default. Many tutorials still call .bin loading explicitly.
~4.35attn_implementation arg standardised. Older code passes attention via separate flags.
~4.40device_map="auto" paths assume accelerate is installed (it isn’t always).
~4.45Chat template defaults tightened — some older instruct models require apply_chat_template explicitly.
LatestSteady drumbeat of new model architectures; trust_remote_code=True is the bridge until the class is upstreamed.

Migration discipline:

  1. Pin both library and torch. transformers==4.X + torch==Y.Z are the supported pair listed in release notes.
  2. Re-test fine-tunes after upgrades. New defaults (loss-fns, gradient clipping, LR scheduler) occasionally affect downstream metrics.
  3. from_pretrained(...) arg deprecations are noisy but generally non-breaking for a release or two. Address before they go silent.
  4. Tokenizer special-token handling has tightened over time. Pin the tokenizer JSON when reproducibility matters.

Hedge: when in doubt, the Hugging Face Transformers Release Notes on GitHub list every behavioral change per release — search there rather than guessing.

Security considerations#

transformers runs arbitrary model weights, occasionally arbitrary model code, and pulls files from the public Hub. The threat surface deserves attention.

  • trust_remote_code=True is code execution. Many cutting-edge models ship custom modeling code that loads via this flag. Audit the repo and pin a revision SHA (revision="abc123") before enabling. Treat unknown repos like unknown PyPI packages.
  • Pickle in .bin checkpoints. Pre-safetensors .bin files are Python pickles — loading one runs arbitrary code. Prefer safetensors; the loader rejects mixed loads in strict mode.
  • Hub supply chain. Anyone can publish to the Hub. Pin revision=<commit-sha> for production use, not branch names.
  • Tokenizer files. tokenizer.json is data, not code, but malicious vocab can produce token IDs that confuse downstream code. Defence-in-depth: validate vocab size against model.config.vocab_size.
  • Gated model licenses. LLaMA, Gemma, Mistral, others have license obligations beyond Apache-2.0. Track which models touch your build pipeline.
  • PII in prompts and outputs. Local inference avoids sending data to a third party — but logs, traces, and dataset caches will still capture it. Apply the same redaction policy you’d use for hosted APIs.
  • Adversarial prompts. Local models don’t have the abuse mitigations a hosted service does. If users supply prompts directly, add a content classifier upstream.
  • Network egress at load time. from_pretrained hits the Hub. Use TRANSFORMERS_OFFLINE=1 + a pre-baked cache in air-gapped environments.

Troubleshooting common errors#

  • OSError: Can't load tokenizer for ... — usually a missing extra. pip install sentencepiece (LLaMA/T5/Mistral) or tiktoken (some OpenAI-derived tokenizers).
  • ImportError: This requires you to install the latest version of bitsandbytespip install -U bitsandbytes and ensure CUDA matches your PyTorch build.
  • device_map="auto" errors on CPU-only machines. Install accelerate even on CPU; without it the device-mapping logic short-circuits to errors.
  • CUDA out of memory. Try low_cpu_mem_usage=True, quantisation (4-bit), gradient checkpointing, smaller batch, shorter sequences, or device_map="auto" with offloading.
  • Generated text is gibberish for instruct models. You forgot apply_chat_template. Decoder-only chat models trained with template tokens produce nonsense if you skip them.
  • pad_token error during batched generation. Set tokenizer.pad_token = tokenizer.eos_token and model.config.pad_token_id = tokenizer.eos_token_id before the call.
  • Whisper transcriptions repeat indefinitely. Pass condition_on_previous_text=False and chunk inputs with chunk_length_s=30.
  • trust_remote_code warning. Either accept and pin a revision SHA, or wait for the architecture to be upstreamed.
  • Tokenizer mismatch between training and inference. Always save the tokenizer alongside the model (tokenizer.save_pretrained(dir)); loading from the base checkpoint after fine-tuning corrupts special tokens.

When NOT to use this#

  • Production LLM inference at scale. Use vLLM or TGI. They add continuous batching, paged attention, and proper request scheduling. transformers.generate() runs one request at a time and idles GPU.
  • Edge / mobile deployment. ONNX Runtime, MLC LLM, llama.cpp, and Core ML are the right tools. transformers isn’t built for that runtime envelope.
  • Pure embedding inference at high QPS. sentence-transformers is purpose-built; fastembed (ONNX) is leaner.
  • Hosted-API parity. If you only need a remote model, the provider SDK (openai, anthropic) is one HTTP call away — no PyTorch needed.
  • Tiny ML. For sub-100 MB classifiers, scikit-learn + transformers.AutoTokenizer for features is often cleaner than the full pipeline.

Production deployment#

transformers is a library; production deployment means picking the right server pattern around it. For most teams the right answer is “use vLLM” — but plenty of workloads run transformers directly in production.

Single-worker GPU server. One AutoModelForCausalLM instance held by one FastAPI process per GPU. Simple, debuggable, and adequate for low-QPS internal services. Use a process supervisor (gunicorn / supervisord) — multiple workers per GPU just thrash VRAM.

CPU encoder server. BERT-class encoders served via optimum-onnxruntime or optimum-openvino on Xeon CPUs are the canonical “embedding micro-service” shape. ~10-100× cheaper than a GPU box for the same QPS.

Container shape. Build the image with the model pre-downloaded (saves cold-start time and reduces Hub dependency at runtime). Use multi-stage builds — the base CUDA image is large; copy only the runtime layer.

Model versioning. from_pretrained("name", revision="abc123") pins the exact Hub commit. Track the SHA in requirements.txt (as a comment) so rollback is auditable.

Inference batching. Either implement client-side micro-batching (collect ~10ms of requests, run one generate) or hand the workload to vLLM, which does this automatically and far better.

GPU utilisation telemetry. Export nvidia-smi metrics through dcgm-exporter to Prometheus. If GPU utilisation sits below 30%, you’re paying for idle silicon — batch harder, or stop using a GPU.

Health checks. /health should run a trivial 1-token generate — this catches NaN-loaded models that a process-level check wouldn’t.

Cost & rate-limit management#

Self-hosted transformers swaps API-side rate limits for GPU-side rate limits. The economics change, but the bookkeeping doesn’t disappear.

  • GPU-hour budgets. Track (QPS × p99_latency × dollars_per_GPU_hour) per model. For an embedding model on an A10, that’s pennies per million calls; for an 8B chat model on an H100, it’s measurable.
  • Choose the smallest model that meets quality. A 1.5B model that hits your accuracy bar beats a 70B model on TCO by 20-50×.
  • Quantise. 4-bit NF4 with BitsAndBytesConfig cuts VRAM ~4× and often serves 2× more concurrent requests on the same GPU.
  • Batch. Server-side batching is where most production savings come from. vLLM ships with continuous batching; if you’re sticking with transformers, group requests within 20-50ms windows.
  • Shut down cold GPUs. Spot/preemptible instances are 60-90% cheaper. For workloads tolerant of restart latency, that’s the easy win.
  • Embedding caching. For embedding workloads, cache by content hash — many corpora have 10-30% duplicate documents.
  • Local quotas. A single misbehaving caller can saturate your GPU. Rate-limit per tenant at the application layer.

Multi-provider patterns#

transformers itself isn’t multi-provider — but it sits next to provider SDKs in many production architectures. The common patterns:

  • Routing cheap workloads to local, expensive workloads to hosted. A route(query) -> provider function based on query complexity, latency budget, or model strengths. Local embedding + hosted chat is the classic split.
  • Fine-tune local, serve from transformers, fall back to hosted on rate-limit. For high-availability services, treat the local model as primary and a hosted API as failover.
  • vLLM as the OpenAI-compatible front door. vLLM serves transformers-format models over an OpenAI-compatible HTTP API. Any client that speaks OpenAI’s wire format (LangChain, LiteLLM, openai-python) talks to it unchanged.
  • LiteLLM proxy in front of mixed local + hosted. Same routing logic, centralised — one base URL for your application, multiple providers behind it.
  • Tokenizer parity. Cross-provider routing needs tokenizer compatibility for token-budget calculations. tiktoken handles OpenAI; transformers.AutoTokenizer handles everything else.

See also#