Providers
Quantized routes each request to a provider based on the endpoint and your configuration. You don’t need to manage separate API keys or accounts for each provider.
Supported providers
| Provider | Slug | Capabilities |
|---|---|---|
| OpenRouter | openrouter |
Chat completions, Responses, Models, Embeddings |
| OpenAI Direct | openai |
Embeddings, Image generation (DALL-E 2/3, gpt-image-1) |
| Anthropic | anthropic |
Chat completions, Models |
| AWS Bedrock | bedrock |
Chat completions, Responses, Bedrock-native embeddings, Image generation (Titan / Nova Canvas / Stability) |
| Google Gemini | gemini |
Gemini-native embeddings, Image generation (Imagen 4, Gemini Flash Image) |
| Exa | exa |
Web search, Content fetch |
| Tavily | tavily |
Web search, Content fetch |
Default routing
Each capability has a default provider:
| Capability | Default Provider |
|---|---|
| Chat completions | OpenRouter |
| Responses | OpenRouter |
| Models | OpenRouter |
Embeddings (/v1/embeddings) |
OpenAI Direct |
Bedrock-native embeddings (/v1/aws-bedrock/embeddings) |
AWS Bedrock |
Gemini-native embeddings (/v1/gemini/embeddings) |
Google Gemini |
Image generation (/v1/images/generations) |
OpenAI Direct (resolved per-model) |
| Web search | Exa |
| Content fetch | Exa |
Choosing a provider
Use the X-Quantized-Provider header to override the default:
# Use Anthropic directly instead of OpenRouter
curl -X POST https://api.quantized.us/v1/chat/completions \
-H "Authorization: Bearer sk-quantized-YOUR-KEY" \
-H "X-Quantized-Provider: anthropic" \
-H "Content-Type: application/json" \
-d '{
"model": "anthropic/claude-sonnet-4",
"messages": [{"role": "user", "content": "Hello!"}]
}'
# Use Tavily instead of Exa for web search
curl -X POST https://api.quantized.us/v1/web-search \
-H "Authorization: Bearer sk-quantized-YOUR-KEY" \
-H "X-Quantized-Provider: tavily" \
-H "Content-Type: application/json" \
-d '{"query": "latest AI news"}'
All models use the author/model format (e.g., openai/gpt-4.1-mini, anthropic/claude-sonnet-4). Use the Models endpoint to list available model IDs.
Capability matrix
| Endpoint | OpenAI | OpenRouter | Anthropic | Bedrock | Gemini | Exa | Tavily |
|---|---|---|---|---|---|---|---|
POST /v1/chat/completions |
— | Yes (default) | Yes | Yes | — | — | — |
POST /v1/responses |
— | Yes (default) | — | Yes | — | — | — |
POST /v1/embeddings |
Yes (default) | Yes | — | — | — | — | — |
POST /v1/aws-bedrock/embeddings |
— | — | — | Yes (default, only) | — | — | — |
POST /v1/gemini/embeddings |
— | — | — | — | Yes (default, only) | — | — |
POST /v1/images/generations |
Yes (default) | — | — | Yes — Titan / Nova / Stability (inactive, pending ops) | Yes — Imagen 4 / Flash Image (inactive, pending paid tier) | — | — |
GET /v1/models |
— | Yes (default) | Yes | — | — | — | — |
POST /v1/web-search |
— | — | — | — | — | Yes (default) | Yes |
POST /v1/fetch |
— | — | — | — | — | Yes (default) | Yes |
Chat-completions modalities
Not every provider accepts every content part on POST /v1/chat/completions. Requests are additionally gated by the target model’s declared modalities — see the Models endpoint for per-model input_modality flags.
| Content part | OpenRouter | Anthropic | Bedrock |
|---|---|---|---|
text |
Yes | Yes | Yes |
image_url |
Yes | Yes | — |
input_audio |
Yes | — | — |
video_url |
Yes | — | — |
file (PDF) |
Yes (universal — works on all models via OpenRouter’s PDF parser) | — | — |
Sending an unsupported modality returns 400 with a descriptive error message (e.g. "Model 'openai/gpt-4.1-nano' does not support audio input") before the request reaches the provider.
Anthropic-specific behavior
When routing through X-Quantized-Provider: anthropic, the router adapts OpenAI-shaped requests to Anthropic’s native /v1/messages format:
response_format — JSON output normalization
Anthropic’s API does not natively support the response_format parameter. The router emulates it by injecting a system-prompt instruction telling the model to return raw JSON. Some Claude models (notably Claude Haiku 4.5) still wrap their output in ```json ... ``` markdown fences despite this instruction.
To uphold the response_format contract — “callers asking for JSON get parseable JSON” — the router strips a single wrapping markdown fence from the response content when:
- the request specified
response_format: {"type": "json_object" | "json_schema"}, and - the response content is wrapped entirely in
```json ... ```or``` ... ```(a fence embedded inside prose is not stripped).
This stripping is only applied to the Anthropic provider path — OpenRouter responses are forwarded as-is because OpenRouter handles response_format server-side.
Out-of-scope content parts
Anthropic’s chat endpoint currently receives only text and image_url content parts from the router. Requests containing input_audio, video_url, or file parts are accepted by the router’s serializer but would fail upstream if routed to Anthropic. Use OpenRouter (the default for chat completions) for these modalities.
OpenAI Direct (Embeddings)
OpenAI Direct is the default provider for POST /v1/embeddings. It calls OpenAI’s /v1/embeddings endpoint with Quantized’s pooled OPENAI_API_KEY. Clients don’t need their own OpenAI account — billing is unified through Quantized’s per-key credit balance.
Models in scope
| Model id | Native dimension | Public list rate |
|---|---|---|
text-embedding-3-small |
1536 | $0.02 / 1M tokens |
text-embedding-3-large |
3072 | $0.13 / 1M tokens |
text-embedding-ada-002 |
1536 | $0.10 / 1M tokens |
Unknown OpenAI model ids fall back to a conservative default rate so the router never bills at $0 on a misconfigured request.
OpenRouter passthrough
X-Quantized-Provider: openrouter routes the same request through OpenRouter, which exposes OpenAI’s embedding models with an openai/ prefix. Because OpenRouter’s embedding response does not include a per-call cost, the router falls back to the OpenAI rate table after stripping the prefix.
Bedrock-specific behavior
When routing through X-Quantized-Provider: bedrock, the router calls AWS Bedrock’s Converse API on Quantized’s AWS account. Clients don’t need their own AWS credentials — billing is unified through Quantized’s per-key credit balance.
Model resolution
Only models with a bedrock row in the model catalog are routable through this header. The catalog row’s model_name field carries the full Bedrock model id (e.g., amazon.nova-micro-v1:0); clients always reference the canonical Quantized id (e.g., amazon/nova-micro).
If a model has no Bedrock row, requests with X-Quantized-Provider: bedrock fail at resolution with 400 Provider 'bedrock' not available for model '<id>' before reaching AWS. Use GET /v1/models and look for providers[].provider == "bedrock" to list eligible models.
Model access gates
Some Bedrock model families require AWS-side approval before they can be invoked, even when the catalog has a bedrock row for them:
| Family | Catalog model_name prefix |
AWS-side approval |
|---|---|---|
| Amazon Nova | amazon.nova-* |
None — invokable immediately |
| Meta Llama, Mistral, Cohere | meta.*, mistral.*, cohere.* |
One-line click-through, instant |
| Anthropic Claude | anthropic.claude-* |
Use-case form (5 fields, manual approval) |
If the AWS account behind the router lacks access for a model, you’ll see a 404 with a message like "Model use case details have not been submitted for this account..." — that’s AWS, not the router. The fix is operator-side: enable the model in the AWS Console under Bedrock → Model access for the region. Until that’s done, route the same call to OpenRouter (the default) or pick a Nova model — Amazon’s own family has no gating.
Request translation
| OpenAI field | Bedrock Converse field |
|---|---|
messages (with role: "system") |
Split — system text becomes top-level system: [{"text": "..."}]; user/assistant stay in messages |
max_tokens / max_completion_tokens |
inferenceConfig.maxTokens |
temperature |
inferenceConfig.temperature |
top_p |
inferenceConfig.topP |
stop (string or array) |
inferenceConfig.stopSequences (always a list — must be non-whitespace, see below) |
tools + tool_choice |
toolConfig.tools (toolSpec) + toolConfig.toolChoice |
Assistant messages with tool_calls[] |
Assistant content with toolUse blocks |
role: "tool" (with tool_call_id) |
User content with a toolResult block |
Stop-sequence quirk
Bedrock rejects whitespace-only stop sequences with 400 The stop sequence value at inferenceConfig.stopSequences.0 is blank. Other providers (OpenRouter, Anthropic native) accept them. If you need a request body that works across all providers, use printable stop sequences such as "###", "END", or "---" instead of "\n\n".
Out-of-scope today
The following are accepted by the router but are not forwarded to Bedrock — they will produce unexpected behavior or no-ops on this provider path. Use OpenRouter (the default for chat completions) when you need them:
- Streaming (
stream: true) — Bedrock provider does not implement streaming; requests will hang or error - Vision / multimodal content parts (
image_url,input_audio,video_url,file) — even when the underlying model supports them response_format— JSON-mode emulation is not implemented for Bedrock; the parameter is silently dropped- Reasoning (
reasoning.effort) — Claude extended thinking via Bedrock is not yet wired through frequency_penalty/presence_penalty/repetition_penalty/seed/logprobs/top_logprobs/logit_bias— Bedrock’s Converse API doesn’t accept them; silently dropped
Bedrock-native Embeddings
POST /v1/aws-bedrock/embeddings is a native-shape passthrough — distinct from the OpenAI-compatible /v1/embeddings. It uses bedrock-runtime.invoke_model (NOT Converse — embedding models don’t speak Converse) and preserves Bedrock’s request/response shape byte-for-byte. See the full reference at AWS Bedrock Embeddings.
Models in scope
| Model id | Vendor | Native dimension | Public list rate |
|---|---|---|---|
amazon.titan-embed-text-v2:0 |
Amazon Titan | 256, 512, 1024 | $0.02 / 1M tokens |
cohere.embed-english-v3 |
Cohere | 1024 | $0.10 / 1M tokens |
cohere.embed-multilingual-v3 |
Cohere | 1024 | $0.10 / 1M tokens |
The endpoint accepts two distinct request bodies discriminated by the model prefix:
amazon.titan-*→{ model, inputText, dimensions?, normalize?, embeddingTypes? }cohere.*→{ model, texts, input_type, embedding_types?, truncate? }
Mismatching the body shape and the model prefix (e.g. Cohere fields on a Titan model id) is rejected with 422 before reaching upstream.
Token estimation for Cohere
Cohere’s response does not include a token count. The router estimates input tokens at ~4 chars per token (floored at 1) — conservative and rarely under-bills natural-language input. Titan returns inputTextTokenCount directly and is billed against the upstream count.
Google Gemini (Embeddings)
POST /v1/gemini/embeddings is a native-shape passthrough to Google’s generativelanguage.googleapis.com/v1beta endpoints. Clients don’t need their own Gemini API key — billing is unified through Quantized’s per-key credit balance. See the full reference at Gemini Embeddings.
Single vs batch routing
The router picks the upstream endpoint based on the cardinality of contents:
- 1 content →
:embedContent($0.15per 1M tokens) - N > 1 contents →
:batchEmbedContents($0.075per 1M tokens — half-priced)
The endpoint field in the response confirms which upstream URL was used.
Sending multi-part content.parts to :embedContent (Gemini’s single-content endpoint) makes Gemini silently concatenate the parts into one string and return ONE vector for the concatenation — no error, 200 OK, wrong shape. The router always dispatches multi-content requests to :batchEmbedContents to avoid this. Treat any unexpected endpoint value as a router bug.
Models in scope
| Model id | Native dimension | Truncatable to |
|---|---|---|
gemini-embedding-001 |
3072 | 768 |
Token estimation
Gemini’s embedding endpoints do not return token counts. Same heuristic as Cohere (~4 chars/token, floored at 1).
Image Generation
POST /v1/images/generations is a unified endpoint — there are no native passthroughs (no /v1/aws-bedrock/images/generations, no /v1/gemini/images/generations). All providers adapt to the same OpenAI-shape request/response.
Provider matrix
| Provider | Models | Native body shape | Status |
|---|---|---|---|
| OpenAI Direct | dall-e-2, dall-e-3, gpt-image-1 |
OpenAI /v1/images/generations |
Active |
| AWS Bedrock | amazon.titan-image-generator-v2:0, amazon.nova-canvas-v1:0 |
{ taskType, textToImageParams, imageGenerationConfig } |
Catalog seeded, inactive (re-enable in AWS Console → Bedrock → Model access) |
| AWS Bedrock — Stability | stability.stable-image-{core,ultra}-v1:0, stability.sd3-5-large-v1:0 |
{ prompt, aspect_ratio, output_format, seed?, negative_prompt? } |
Catalog seeded, inactive (region-restricted to us-west-2) |
| Google Gemini — Imagen | imagen-4.0-{fast-generate,generate,ultra-generate}-001 |
:predict with instances + parameters |
Catalog seeded, inactive (paid-tier Gemini API project required) |
| Google Gemini — Flash Image | gemini-2.5-flash-image |
:generateContent with responseModalities: [TEXT, IMAGE] |
Catalog seeded, inactive |
Output transport
Forced to b64_json for every provider. Read images from data[].b64_json. There is no data[].url field — DALL-E URLs expire in ~60 minutes and would need a CDN rehost subsystem.
Provider-specific field handling
| Field | DALL-E 2 | DALL-E 3 | gpt-image-1 | Bedrock Titan/Nova | Bedrock Stability | Imagen | Gemini Flash Image |
|---|---|---|---|---|---|---|---|
prompt |
Yes | Yes (auto-rewritten) | Yes | Yes | Yes | Yes | Yes (chat-style) |
n |
1–10 | 1 only | 1 only | 1–5 | 1 only | 1–4 | 1 only |
size |
256x256, 512x512, 1024x1024 |
1024x1024, 1024x1792, 1792x1024 |
up to 3840px | WxH (varies per model) |
mapped to nearest aspect_ratio |
mapped to aspectRatio + imageSize (1K / 2K) |
chat-style (no size knob) |
quality |
— | standard, hd |
low, medium, high, auto |
standard, premium |
— | — | — |
style |
(ignored) | vivid, natural |
(stripped) | (stripped) | (stripped) | (stripped) | (stripped) |
background |
(stripped) | (stripped) | transparent, opaque, auto |
(stripped) | (stripped) | (stripped) | (stripped) |
output_format |
(stripped) | (stripped) | png, jpeg, webp |
(stripped) | png, jpeg, webp |
(stripped) | (stripped) |
seed |
(stripped) | (stripped) | (stripped) | Yes | Yes | (stripped) | (stripped) |
negative_prompt |
(stripped) | (stripped) | (stripped) | negativeText |
negative_prompt |
(stripped) | (stripped) |
Pricing
| Model | Pricing model | Source |
|---|---|---|
dall-e-2 |
$0.016 / $0.018 / $0.020 per image (by size) | OpenAI public list |
dall-e-3 |
$0.040 to $0.120 per image (by size × quality) | OpenAI public list |
gpt-image-1 |
Token-priced — $5/M input + $40/M output tokens | OpenAI public list |
| Bedrock Titan | $0.008 to $0.014 per image (by size × quality) | AWS Bedrock public list |
| Bedrock Nova Canvas | $0.040 to $0.080 per image (by size × quality) | AWS Bedrock public list |
| Bedrock Stability | $0.030 / $0.065 / $0.080 per image (flat per model) | AWS Bedrock public list |
| Imagen 4 Fast / Standard / Ultra | $0.020 / $0.040 / $0.060 per image (flat per tier) | Google AI Studio public list |
| Gemini 2.5 Flash Image | ~$0.039 per 1024² image (token-priced) | Google AI Studio public list |
The router computes the per-call cost in the provider code (see Errors for how a $0 cost is recorded when upstream returns an error).
Watermarking
The unified response includes a watermark enum on each data[] entry:
c2pa— gpt-image-1 (always)provenance— Amazon Titan, Nova Canvas (always)synthid— Google Imagen, Gemini Flash Image (always)none— DALL-E 2, DALL-E 3, Bedrock Stability
Consumers targeting education customers should consider rendering a disclosure when watermark != "none".
Content moderation
Bedrock Titan/Nova return 200 OK with an empty images array and an inline error string when content moderation triggers — they do not return a 4xx. The router surfaces this as a successful response with data: [{flagged: true, ...}] and usage.images: 0. Billing is $0 for moderation-blocked generations.
DALL-E 2/3, gpt-image-1, Stability, and Imagen all return standard error responses on moderation blocks (400 with the upstream message), which the router maps to its standard error hierarchy.
Provider errors
If the upstream provider fails (timeout, rate limit, authentication error), Quantized returns a 503 with a generic message:
{
"error": {
"message": "Service temporarily unavailable"
}
}
Internal provider errors are masked to avoid leaking infrastructure details. See Errors for the full error reference.