skip to content

qdrant-client — Python Client for Qdrant

Package-level reference for qdrant-client on PyPI — install variants, server version matching, gRPC vs HTTP, fastembed extras, and alternatives.

14 min read 13 snippets deep dive

qdrant-client#

What it is#

qdrant-client is the official Python SDK for Qdrant, a Rust-written vector database focused on production-grade similarity search with rich payload filtering. The client speaks both REST (over HTTP) and gRPC to a remote Qdrant server, and ships a built-in in-memory mode (QdrantClient(":memory:")) plus an on-disk single-node mode for local development without a running server.

Reach for qdrant-client when you want strong filtering on structured payloads alongside dense (or sparse, or hybrid) vector search, and you are comfortable running a separate Qdrant server in production. Reach for chromadb when you prefer an embedded, zero-infrastructure store; reach for weaviate-client when you want hybrid search with BM25 baked in.

Install#

pip install qdrant-client

Output: (none — exits 0 on success)

uv add qdrant-client

Output: dependency resolved + added to pyproject.toml

poetry add qdrant-client

Output: updated lockfile + virtualenv install

pip install "qdrant-client[fastembed]"           # adds on-device embedding via fastembed
pip install "qdrant-client[fastembed-gpu]"       # GPU build of fastembed (CUDA)

Output: Qdrant client plus the chosen embedding bundle

Versioning & Python support#

  • The client follows Qdrant server major/minor versions closely — qdrant-client~=1.10 pairs with Qdrant server 1.10.x. Cross-major combinations should be avoided; cross-minor usually works but new server features (e.g. multivector, sparse indices, named vectors) only land in the client a release or two later.
  • Recent versions support Python 3.9+. Wheels are pure-Python; gRPC support pulls in grpcio (a binary wheel).
  • The Python client occasionally lags the server — a feature visible in the server’s REST API may not yet have a typed Python wrapper. Workaround: drop to client.http (the auto-generated REST methods) or post the raw payload via requests.
  • Pre-1.0 to 1.x jump (early 2024) added breaking changes around collection-config schemas; the 1.10 line further reshaped VectorParams and OptimizersConfig. Pin a tight range.

Package metadata#

Optional dependencies & extras#

  • qdrant-client[fastembed] — adds the fastembed library so you can produce embeddings on the client side without pulling in sentence-transformers or calling a remote API. Uses ONNX Runtime under the hood.
  • qdrant-client[fastembed-gpu] — same, but with the CUDA build of ONNX Runtime for GPU-accelerated embedding.
  • The base install already supports both HTTP (via httpx) and gRPC (via grpcio) — you opt into gRPC at runtime with QdrantClient(prefer_grpc=True), no extra is needed.

Common companions installed alongside:

  • fastembed — standalone version of the same embedding library, if you want to share it across multiple processes.
  • sentence-transformers — alternative client-side embeddings.
  • openai / cohere / voyageai — remote embedding APIs you can wire into client.upsert(...).
  • langchain-qdrant and llama-index-vector-stores-qdrant — framework adapters.

Alternatives#

PackageTrade-off
chromadbEmbedded, zero-infrastructure. Use for prototypes or small RAG apps.
weaviate-clientHybrid vector + BM25 with schema-first GraphQL. Use when keyword search matters as much as vectors.
pymilvusMilvus client. Use for very large multi-billion-vector workloads.
pinecone-clientFully-hosted SaaS. Use when you want to outsource ops entirely.
lancedbEmbedded columnar DB on Lance/Arrow. Use when your data is already columnar.
pgvector (via psycopg/SQLAlchemy)Postgres extension. Use when you already run Postgres and want one less moving part.

Common gotchas#

  1. HTTP vs gRPC ports. The default Qdrant server exposes 6333 (REST) and 6334 (gRPC). prefer_grpc=True requires the gRPC port to be reachable — many Docker examples only publish 6333, so gRPC silently falls back to HTTP or hangs.
  2. Collection-config schema reshape in 1.10. VectorParams, OptimizersConfigDiff, and HnswConfigDiff were tightened; old recreate_collection(...) calls written for 1.7-era examples now raise validation errors. Regenerate from current docs.
  3. recreate_collection deletes data. It is delete + create in one call, not a no-op when the schema already matches. Use create_collection with a try/except, or check collection_exists, in code that should be idempotent.
  4. In-memory :memory: is single-process only. It exists for unit tests; do not use it as a “lightweight production” mode. The on-disk single-node mode (path=...) is more durable but still single-writer.
  5. Client occasionally lags server features. Sparse vectors, multivector storage, and quantization parameters often appear in the server REST API a release before they get typed Python wrappers. Drop to client.http for the gap.
  6. grpcio wheel size is non-trivial. Adds ~10 MB to the install. Slim Docker images that don’t actually use gRPC can pin a constraint to skip it, or use a build without it.
  7. API key vs JWT auth. Self-hosted Qdrant supports a static API key; Qdrant Cloud uses JWT-style tokens. Both go in the api_key= constructor argument, but cluster-scoped permissions differ.

Real-world recipes#

The recipes below focus on the install / transport / collection-config choices each pattern implies — the sections/ai/qdrant companion covers the points/filters API in depth.

In-memory client for unit testsQdrantClient(":memory:") boots an in-process Qdrant simulation. Useful for CI; not durable, not the same code path as the production server.

from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, Distance, PointStruct

client = QdrantClient(":memory:")
client.create_collection(
    "kb",
    vectors_config=VectorParams(size=384, distance=Distance.COSINE),
)
client.upsert(
    "kb",
    points=[PointStruct(id=1, vector=[0.1] * 384, payload={"src": "intro"})],
)
print(client.search("kb", query_vector=[0.1] * 384, limit=1))

Output: the upserted point with its score and payload; no server is running — everything is in-process

gRPC client against a real server — for batch uploads and high-throughput query loads, gRPC is materially faster than REST. Make sure port 6334 is exposed.

from qdrant_client import QdrantClient

client = QdrantClient(
    host="qdrant.internal",
    grpc_port=6334,
    prefer_grpc=True,
    api_key="...",     # or jwt token for Qdrant Cloud
    https=True,
    timeout=30,
)

Output: a client that uses gRPC for batch operations and falls back to HTTP for endpoints not yet on gRPC

Batch upload with progress and retriesupload_points and upload_collection are the bulk-load entry points. They batch internally and recover from transient errors.

from qdrant_client.models import PointStruct

client.upload_points(
    collection_name="kb",
    points=(PointStruct(id=i, vector=emb, payload=meta)
            for i, emb, meta in iter_rows()),
    batch_size=512,
    parallel=4,
    max_retries=3,
)

Output: points stream into the collection in 4 parallel batches of 512; the call blocks until the generator is exhausted

HNSW + quantization for a billion-scale collection — Qdrant supports scalar (int8) and product (PQ) quantization at the index level, trading recall for memory and disk.

from qdrant_client.models import (
    VectorParams, Distance, HnswConfigDiff,
    ScalarQuantization, ScalarQuantizationConfig, ScalarType,
    OptimizersConfigDiff,
)

client.create_collection(
    "huge",
    vectors_config=VectorParams(
        size=768,
        distance=Distance.COSINE,
        on_disk=True,
    ),
    hnsw_config=HnswConfigDiff(m=32, ef_construct=256, on_disk=True),
    quantization_config=ScalarQuantization(
        scalar=ScalarQuantizationConfig(type=ScalarType.INT8, always_ram=True),
    ),
    optimizers_config=OptimizersConfigDiff(memmap_threshold=50_000),
)

Output: a collection that holds vectors on disk, keeps the int8-quantized index in RAM for fast search, and memmaps the raw vectors past 50k points

Hybrid (sparse + dense) with named vectors — Qdrant supports multiple named vectors per point. The classic pattern is a dense vector for semantics and a sparse vector (BM42 / SPLADE) for keyword match, fused server-side or in application code.

from qdrant_client.models import (
    VectorParams, Distance, SparseVectorParams, SparseVector, NamedVector,
)

client.create_collection(
    "hybrid",
    vectors_config={"dense": VectorParams(size=384, distance=Distance.COSINE)},
    sparse_vectors_config={"keywords": SparseVectorParams()},
)
# Query with a dense vector, then rerank with a sparse query (RRF in app code)
dense_hits = client.search("hybrid", query_vector=NamedVector(name="dense", vector=q_dense), limit=50)

Output: the collection has both a dense and a sparse vector slot per point; the search call uses the named dense vector explicitly

Production deployment#

Qdrant in production is almost always a separate Docker / Kubernetes deployment with the Python client speaking gRPC over a private network. The :memory: and on-disk single-node modes are for development; do not scale them.

Topology checklist:

ConcernSingle-node DockerCluster (multi-node)Qdrant Cloud
Replicas1configurable per collectionconfigurable per collection
Shards1per-collection shard countper-collection shard count
Backupsfilesystem snapshot of /qdrant/storageper-shard snapshot APImanaged
Authstatic API key via envstatic API key per nodeJWT tokens with role claims
Transport6333 (HTTP) + 6334 (gRPC)same per-nodeTLS-only
Telemetrysends anonymous pingssamemanaged

Sharding and replication. Set at collection creation:

from qdrant_client.models import VectorParams, Distance

client.create_collection(
    "kb",
    vectors_config=VectorParams(size=384, distance=Distance.COSINE),
    shard_number=4,
    replication_factor=2,
)

Output: the collection is split across 4 shards (per-node by default) with 2 replicas of each shard, distributed across cluster nodes

Snapshots. Snapshots are per-collection and per-shard. client.create_snapshot("kb") produces a tar archive on the server’s snapshot directory; restore via recover_snapshot. Coordinate with the writer to get a consistent view — Qdrant does not freeze writes during snapshot creation.

Multi-tenancy. The robust pattern is filter-per-tenant with a tenant_id payload field plus a payload index for fast filtering. Collection-per-tenant works for low tenant counts but the cluster’s collection-metadata overhead caps the practical limit at low thousands.

client.create_payload_index(
    "kb",
    field_name="tenant_id",
    field_schema="keyword",
)
# Every query carries a tenant filter as a non-optional precondition
hits = client.search(
    "kb",
    query_vector=q,
    query_filter=Filter(must=[FieldCondition(key="tenant_id", match=MatchValue(value=tid))]),
    limit=10,
)

Output: the payload index makes the prefilter fast; the application enforces tenant isolation by injecting the filter on every call

Index tuning & retrieval quality#

Qdrant’s HNSW parameters can be set at collection creation and updated later (a key difference from Chroma — the index is rebuilt incrementally). The three knobs are m, ef_construct, and ef.

from qdrant_client.models import HnswConfigDiff

client.update_collection(
    "kb",
    hnsw_config=HnswConfigDiff(m=48, ef_construct=400),
)
# Per-query ef is the search-time knob
hits = client.search(
    "kb",
    query_vector=q,
    limit=10,
    search_params={"hnsw_ef": 256},
)

Output: the collection rebuilds its HNSW index in the background; per-query hnsw_ef controls recall vs latency on each call

Trade-off table:

ParameterDefaultHigher valueEffect
m1632–64better recall, more RAM per vector
ef_construct100256–512better index quality, slower build
hnsw_ef (query)128256–1024better recall, slower query
quantizationoffint8 / PQsmaller index, slight recall hit

On-disk vs in-memory. Vectors can live on_disk=True (memmapped) while the HNSW graph stays in RAM. Past a few million vectors this is the only practical setup; the memory budget is dominated by the graph, not the raw vectors.

Quantization. Scalar int8 quantization cuts RAM ~4× with typically <1% recall loss. Product Quantization (PQ) goes further (~16×) at higher recall cost. Use always_ram=True to pin the quantized index in memory while raw vectors page from disk.

Hybrid recipes. Qdrant supports server-side Reciprocal Rank Fusion in recent versions; for older clients/servers, fuse client-side. The reference pattern: dense top-50 + sparse top-50, fuse with RRF, take top-10.

Version migration guide#

The qdrant-client library tracks server major/minor closely. Cross-major combinations should be avoided; cross-minor usually works but newer server features lag in typed Python wrappers.

0.x → 1.0 (early 2024):

  • Collection-config schemas reshaped. VectorParams, OptimizersConfigDiff, and HnswConfigDiff moved under qdrant_client.models with stricter validation. Old recreate_collection(...) calls now raise.
  • recreate_collection became create_collection + delete_collection. The combined helper still exists but is data-destructive; use the explicit pair for idempotent code.

1.x minor-to-minor (notable):

  • 1.71.10VectorParams, sparse-vector params, and quantization_config shape tightened. Many fields became required that were optional before.
  • Multivector support (multiple vectors per point with a single name) and named vectors evolved independently — read the changelog when adopting either.
  • points_batch deprecated in favour of upload_points / upload_collection.
  • Server features lead the client. Sparse vectors, multivector, and quantization typically appear in the server REST API a release before they get typed Python wrappers. Fall back to client.http (the auto-generated REST methods) or post raw JSON when the typed wrapper is missing.

Pinning strategy. Match qdrant-client minor to server minor exactly in production. qdrant-client~=1.10 paired with Qdrant server 1.10.x is the safe path; ~=1.10 allows patch upgrades but not minor drift.

Performance tuning#

The two transports — REST (HTTP) and gRPC — have very different cost profiles. Use gRPC for batch upload and any hot query loop; REST is fine for ad-hoc calls.

LeverMechanismWhen it helps
prefer_grpc=TruegRPC over port 6334batch uploads, query throughput
upload_points(parallel=N)concurrent batchesinitial bulk-load
wait=False on upsertfire-and-forgethigh-throughput ingestion; consistency relaxed
on_disk=True for vectorsmemmap raw vectorslarge collections, RAM-bound
always_ram=True for quantizationpin quantized indexlatency-critical reads
payload_index on filter fieldshashed lookupfiltered queries with selective predicates
AsyncQdrantClientconcurrent queries from asynciomany concurrent users

Async client. AsyncQdrantClient mirrors the sync surface with await semantics — required for high-concurrency FastAPI / aiohttp apps.

import asyncio
from qdrant_client import AsyncQdrantClient

async def main():
    client = AsyncQdrantClient(host="qdrant.internal", prefer_grpc=True)
    hits = await client.search("kb", query_vector=q, limit=10)
    await client.close()

asyncio.run(main())

Output: a coroutine that runs the search without blocking the event loop; always await client.close() to release gRPC channels

Batching guidance. For initial bulk-load: 256–1024 points per batch, 4–8 parallel batches. Smaller batches are network-overhead-bound; larger batches saturate server memory and trigger optimiser pauses.

Troubleshooting common errors#

  • ConnectionRefusedError on port 6334 — gRPC port not exposed. Many Docker Compose examples publish only 6333 (REST). Either expose 6334 or drop prefer_grpc=True.
  • ValidationError on VectorParams — collection-config schema changed. Regenerate the config call from current docs; old 1.7-era examples no longer validate on 1.10+.
  • Service Unavailable mid-upload — server is in optimiser pause (index rebuild). Lower parallel= and add max_retries=; the client retries with backoff.
  • recreate_collection deleted my data — by design. Use create_collection with a try/except on UnexpectedResponse, or check client.collection_exists("name") first.
  • Unauthorized against Qdrant Cloud — JWT token expired, or you used a Cloud token against a self-hosted instance. Tokens are tenant-scoped and time-bounded.
  • Typed wrapper missing for a server feature — drop to client.http.points_api.upsert_points(...) (the auto-generated REST client) or post raw JSON via requests.
  • Slow first query after restart — HNSW graph loads lazily from disk. Warm the cache with a synthetic query at startup.
  • gRPC client leaks file descriptors — always client.close() in long-running processes; the channel pool does not auto-clean on GC.

Security considerations#

Qdrant ships with auth disabled by default — appropriate for localhost development, dangerous for any networked deployment.

  • API key auth. Set QDRANT__SERVICE__API_KEY on the server and pass api_key= to the client. Read-only keys are available via QDRANT__SERVICE__READ_ONLY_API_KEY.
  • JWT (Qdrant Cloud). Cloud customers use JWT tokens with role claims (read/write per collection). Tokens are issued from the Cloud console; rotate alongside other secrets.
  • TLS. Configure QDRANT__SERVICE__ENABLE_TLS=true with cert/key files, or terminate TLS at an ingress proxy. Without TLS, both API keys and vectors are visible on the wire.
  • mTLS for cluster traffic. Multi-node Qdrant clusters can require mutual TLS on the gossip and replication channels.
  • Multi-tenant isolation. Filter-per-tenant is the recommended pattern; enforce the filter in a wrapper rather than trusting every call site.
  • Payload size limits. Large per-point payloads (>10 KB) are accepted but slow queries; consider storing only IDs in Qdrant and fetching full content from a separate store. Also a smaller blast radius if the vector DB is compromised.
  • Prompt injection via retrieved content. Documents returned to the LM may carry attack payloads. Sanitise before prompt assembly.
  • Snapshots are plaintext. Encrypt at the filesystem / object-store layer (s3://... SSE, EBS encryption, etc.).
  • Telemetry. Qdrant sends anonymous usage telemetry by default; disable via QDRANT__TELEMETRY_DISABLED=true if your compliance regime forbids it.

Ecosystem integrations#

  • LangChainlangchain-qdrant package; QdrantVectorStore retriever.
  • LlamaIndexllama-index-vector-stores-qdrant.
  • Haystack 2.xqdrant-haystack from the integrations namespace.
  • Semantic Kernelsemantic-kernel[qdrant] extra; QdrantVectorStore via the new VectorStore abstraction.
  • DSPy — Qdrant retriever module ships with dspy-ai.
  • fastembed — embedding library by the same team; runs ONNX models for fast client-side embeddings without torch.
  • MCP — community MCP servers expose Qdrant collections as tools for agentic use.

When NOT to use this#

Qdrant earns its keep when production filtering, sharding, or hybrid search matter. The trade-offs below are where another tool fits better.

  • Notebook prototypes with no infra. chromadb in-process is friendlier — one pip install, no server. Move to Qdrant when latency, filtering, or scale demand it.
  • You want hybrid search out of the box, not in app code. Weaviate fuses BM25 + vector server-side without RRF stitching.
  • Postgres is already your operational database. pgvector adds vector search to an existing operational store; one less moving piece.
  • Very large clusters (>10B vectors). Milvus and Vespa have more battle-tested distributed stories at that scale.
  • Fully-managed-only deployments. Qdrant Cloud exists but you may prefer Pinecone if you do not want any awareness of the engine internals.

See also#