Gemini Embeddings

POST /v1/gemini/embeddings

Native passthrough to Google Gemini’s embedding API. Request shape mirrors models/{model}:embedContent / :batchEmbedContents; response is normalized to a single list-of-embeddings format with a usage block on top.

When to use this endpoint vs /v1/embeddings

Use /v1/embeddings for the unified OpenAI-compatible shape. Use this endpoint when you need Gemini-specific fields — task_type, output_dimensionality truncation, optional document title — that the OpenAI shape doesn’t expose.

Headers

Header Required Description
Authorization Yes Bearer <api-key-or-jwt>
Content-Type Yes application/json

Request body

Field Type Required Default Description
model string Yes Model id (e.g. gemini-embedding-001)
contents array Yes One or more content objects. Each item produces one embedding
contents[].parts array Yes At least one {text: string} part per content
task_type string No null One of RETRIEVAL_QUERY, RETRIEVAL_DOCUMENT, SEMANTIC_SIMILARITY, CLASSIFICATION, CLUSTERING, QUESTION_ANSWERING, FACT_VERIFICATION, CODE_RETRIEVAL_QUERY
output_dimensionality integer No 3072 Vector dimension. Must be >= 1. Common values: 768, 3072
title string No null Document title — only meaningful when task_type is RETRIEVAL_DOCUMENT
Strict validation

The serializer uses extra="forbid". Fields outside the list above are rejected with 422. Notable rejects: OpenAI’s input, dimensions, encoding_format, user; Cohere’s texts, input_type, embedding_types.

Single vs batch routing

The router picks the upstream endpoint based on the cardinality of contents:

  • 1 content:embedContent ($0.15 per 1M tokens)
  • N > 1 contents:batchEmbedContents ($0.075 per 1M tokens, 50% cheaper)
Why this matters

Gemini’s :embedContent endpoint silently concatenates multi-part content into a single string and returns ONE vector for everything — no error, 200 OK, wrong shape. The router defends against this by always dispatching multi-content requests to :batchEmbedContents. The endpoint field in the response confirms which upstream URL was used; treat any unexpected value (e.g. embedContent for a multi-content request) as a router bug.

Examples

cURL — single input
cURL — batch (3 inputs)
Python (httpx)
curl -X POST https://api.quantized.us/v1/gemini/embeddings \
  -H "Authorization: Bearer sk-quantized-YOUR-KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemini-embedding-001",
    "contents": [
      {"parts": [{"text": "The quick brown fox jumps over the lazy dog."}]}
    ],
    "task_type": "RETRIEVAL_DOCUMENT"
  }'
curl -X POST https://api.quantized.us/v1/gemini/embeddings \
  -H "Authorization: Bearer sk-quantized-YOUR-KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemini-embedding-001",
    "contents": [
      {"parts": [{"text": "first"}]},
      {"parts": [{"text": "second"}]},
      {"parts": [{"text": "third"}]}
    ],
    "task_type": "RETRIEVAL_DOCUMENT"
  }'
import httpx

resp = httpx.post(
    "https://api.quantized.us/v1/gemini/embeddings",
    headers={"Authorization": "Bearer sk-quantized-YOUR-KEY"},
    json={
        "model": "gemini-embedding-001",
        "contents": [
            {"parts": [{"text": "What is a fox?"}]}
        ],
        "task_type": "RETRIEVAL_QUERY",
        "output_dimensionality": 768,
    },
)
data = resp.json()
print(data["endpoint"])              # "embedContent"
print(len(data["embeddings"][0]["values"]))  # 768

Response

{
  "model": "gemini-embedding-001",
  "embeddings": [
    {"values": [0.0123, -0.0456, 0.0789, ...]}
  ],
  "endpoint": "embedContent",
  "usage": {
    "credits_used": 12,
    "credits_remaining": 999988
  }
}
Field Type Description
model string Echoed from the request
embeddings array One entry per input content, in order. Single-input requests return a 1-element list
embeddings[].values array of floats The vector
endpoint string Either "embedContent" or "batchEmbedContents" — the upstream URL the router called
usage.credits_used integer Micro-credits consumed (billed at the rate corresponding to endpoint)
usage.credits_remaining integer or null Micro-credits remaining (null for unlimited licenses)
Why is the response shape different from native Gemini?

Gemini’s :embedContent returns {embedding: {values: [...]}} (singular), while :batchEmbedContents returns {embeddings: [{values: [...]}, ...]} (plural). The router normalizes both to the plural list shape so client code doesn’t branch on cardinality, and surfaces the upstream choice via the endpoint field.

Gemini token estimation

Gemini’s embedding endpoints do not return token counts. Quantized estimates input tokens at ~4 characters per token (floored at 1). This is conservative and rarely under-bills for natural-language input.

Models

Model id Native dimension Truncatable to Public list rate (single / batch)
gemini-embedding-001 3072 768 $0.15 / $0.075 per 1M tokens

Filter for Gemini embedding models via GET /v1/models:

gemini_embed_models = [
    m for m in models
    if "gemini_embeddings" in m.get("supported_features", [])
]

Providers

Provider Slug Default?
Google Gemini gemini Yes (and only)

This endpoint is Gemini-only — X-Quantized-Provider is ignored.

Errors

Status Condition
401 Invalid or missing API key
402 Insufficient credits
404 Unknown model id (catalog gate)
422 Validation — missing contents, empty contents, unsupported field, invalid task_type value, output_dimensionality < 1
400 Upstream Gemini error (e.g. invalid task type combination); sanitized message returned
503 Upstream throttling, transient unavailability, or auth failure on the Gemini API key
Out of scope today

The following are not accepted by v1 and may be added in a future release:

  • Multimodal Gemini embeddings (image, audio, video parts)
  • Vertex AI endpoint variant (this endpoint talks to generativelanguage.googleapis.com, not the Vertex API)
  • Concurrent fan-out to :countTokens for precise token counts