Qwen3 Embedding 4B

alibaba/qwen3-embedding-4b

Qwen3 Embedding 4B is a mid-tier 4-billion-parameter text embedding model producing 2560-dimensional vectors over a context of 32.8K tokens, designed for multilingual semantic search and code retrieval that balances quality with operational cost.

import { embed } from 'ai';

const result = await embed({
  model: 'alibaba/qwen3-embedding-4b',
  value: 'Sunny day at the beach',
})

About Qwen3 Embedding 4B

Qwen3 Embedding 4B represents the middle tier of the Qwen3 Embedding family, balancing retrieval quality against operational cost. Its 2560-dimensional output space captures richer semantic structure than the 0.6B variant, which translates to measurably better performance on dense retrieval benchmarks and multilingual similarity tasks without reaching the full resource requirements of the 8B model.

The embeddings handle asymmetric retrieval tasks where a short user query must match longer documents, and support user-defined instruction prefixes to adapt the embedding space to a specific domain or retrieval intent. Cross-lingual transfer is stable across the 100+ natural languages and multiple programming languages the Qwen3 Embedding family covers.

The context window of 32.8K tokens allows Qwen3 Embedding 4B to embed substantial passages in one shot, reducing the need for aggressive chunking in document-heavy workflows. Combined with Matryoshka Representation Learning (MRL), dimension counts can be adjusted at query time to trade off storage against precision, giving teams flexibility when scaling a vector index.

Providers

The AI Gateway supports routing requests across multiple AI providers. You can control provider preferences using the provider slugs available for copying with the buttons below. For more see the AI Gateway provider options documentation. By using the AI provider you acknowledge you reviewed and agree to their terms listed in the Legal section under the AI provider's name.

Provider

Context	Max Output	Latency	Throughput	Input	Output	Cache	Image Gen	Video Gen	Web Search	Per Query	Capabilities	ZDR	No Training	HIPAA	Release Date

Legal:Terms

•

Privacy

33K

$0.02/M

—

06/05/2025

Metrics

Based exclusively on usage through AI Gateway.

Throughput24 hours

More models by Alibaba

Model

Context	Latency	Throughput	Input	Output	Cache	Web Search	Per Query	Capabilities	Providers	ZDR	No Training	Release Date

256K

4.7s

63tps

$0.60/M

$3.60/M

—

04/22/2026

0.7s

55tps

$0.50/M

$3.00/M

Read:$0.1/M

Write:$0.63/M

—

04/02/2026

1.0s

219tps

$0.10/M

$0.40/M

Read:$0.0/M

Write:$0.13/M

—

02/24/2026

1.6s

93tps

$0.40/M

$2.40/M

Read:

$0.04/M

Write:

$0.5/M

—

02/16/2026

256K

0.6s

99tps

$0.50/M

$1.20/M

—

07/22/2025

262K

0.3s

99tps

$0.30/M

$1.60/M

Read:$0.02/M

Write:—

—

04/01/2025

What To Consider When Choosing a Provider

Zero Data Retention
AI Gateway supports Zero Data Retention for this model via direct gateway requests (BYOK is not included). To configure this, check the documentation.
Authentication
AI Gateway authenticates requests using an API key or OIDC token. You do not need to manage provider credentials directly.

Provider selection is most consequential when your use case combines high query volume with data-sovereignty requirements, verify each provider's regional availability before finalizing your architecture.

When to Use Qwen3 Embedding 4B

Best For

Enterprise multilingual search:
Applications that require high multilingual precision but can't justify the full cost of an 8B model
Semantic similarity and clustering:
Datasets that span many languages or mix natural language with code
Quality-sensitive RAG:
Pipelines serving diverse user populations where retrieval quality visibly affects answer accuracy
Cross-lingual alignment:
Document alignment tasks where 2560-dimensional vectors provide better discriminability than smaller alternatives

Consider Alternatives When

Throughput and cost first:
Dominant constraints where slightly lower recall is acceptable make the 0.6B variant sufficient
Maximum retrieval precision:
Specialized domains with dense, technical vocabulary may be better served by the 8B model
Generative output required:
This model produces embeddings only; use a generative model when you need text output

Conclusion

Qwen3 Embedding 4B is a well-positioned middle-ground choice for teams building multilingual retrieval systems that need better-than-baseline precision without committing to the full resource footprint of larger models. Its 2560-dimensional output and flexible dimension truncation via MRL give engineers a range of tradeoff points across the storage-vs-quality curve.

FAQ

The model outputs 2560-dimensional vectors by default. Matryoshka Representation Learning allows prefix truncation to smaller sizes if storage or query-speed budgets require it.

All three variants use a dual-encoder structure and share the same context window of 32.8K tokens. The 4B model uses 36 layers (compared to 28 in the 0.6B) and produces 2560-dimensional vectors, wider than the 0.6B's 1024 dimensions but narrower than the 8B's 4096.

Qwen3 Embedding 4B covers more than 100 natural languages plus multiple programming languages, enabling cross-lingual and code-retrieval tasks within a single embedding space.

Yes. The model supports custom instruction prefixes at query time to guide the embedding toward a specific retrieval task, such as legal document search vs. general knowledge retrieval.

Yes. The Qwen3 Embedding models explicitly include code in their language coverage, so hybrid corpora of code and prose can be embedded in the same vector space.

Longer passages embedded as single units can yield better recall for complex queries, but very long inputs near the ceiling of 32.8K tokens may dilute specificity. Experiments with paragraph-level vs. section-level chunking are worthwhile for your specific domain.

AI Cloud

Core Platform

Security

Company

Learn

Open Source

Use Cases

Tools

Users

Qwen3 Embedding 4B

About Qwen3 Embedding 4B

Providers

More models by Alibaba

What To Consider When Choosing a Provider

Zero Data Retention

Authentication

When to Use Qwen3 Embedding 4B

Best For

Enterprise multilingual search:

Semantic similarity and clustering:

Quality-sensitive RAG:

Cross-lingual alignment:

Consider Alternatives When

Throughput and cost first:

Maximum retrieval precision:

Generative output required:

Conclusion

FAQ

About Qwen3 Embedding 4B

Providers

More models by Alibaba

About Qwen3 Embedding 4B

What To Consider When Choosing a Provider

Zero Data Retention

Authentication

When to Use Qwen3 Embedding 4B

Best For

Enterprise multilingual search:

Semantic similarity and clustering:

Quality-sensitive RAG:

Cross-lingual alignment:

Consider Alternatives When

Throughput and cost first:

Maximum retrieval precision:

Generative output required:

Conclusion

FAQ

What output dimensionality does Qwen3 Embedding 4B produce?

How does the 4B model differ architecturally from the 0.6B and 8B variants?

What multilingual coverage does Qwen3 Embedding 4B support?

Can instruction prefixes change what the model retrieves?

Is this model appropriate for code-retrieval tasks alongside natural language?

How does chunking strategy affect retrieval quality at context of 32.8K tokens?

About Qwen3 Embedding 4B