Latency

Understand how Minds API response times compare to calling foundation models directly, and what contributes to the difference.

Overview

When you send a message through the Minds API, the response includes more than just a raw LLM call. The API orchestrates several steps to ground the response in your mind's knowledge base, providing higher-quality, contextual answers.

Typical response times:

Scenario	Latency
Direct foundation model call (no context)	1-3s
Minds API (with knowledge grounding)	5-12s
Minds API (simple greeting / no RAG)	2-4s

The additional time is spent on knowledge retrieval and grounding, which is what makes Minds responses more accurate and contextual than raw LLM calls.

What Happens During a Request

When you call POST /api/v1/sparks/{sparkId}/completion, the API performs these steps:

1. Authentication & mind loading          ~50ms
2. Knowledge retrieval (RAG)              ~1-3s
   - Semantic search across embeddings
   - Retrieve relevant knowledge chunks
3. Tool orchestration                     ~1-3s
   - Web search (if needed)
   - Knowledge grounding & citations
4. LLM generation                         ~1-3s
   - Same latency as calling the model directly
5. Response formatting & citations         ~50ms

Steps 2-3 are what differentiate Minds from a raw API call. They provide your mind with relevant context from its knowledge base, web search results, and grounded citations.

Benchmark Results

Measured on March 12, 2026. Each test ran 3 times with the same prompt. Minds API calls include full RAG pipeline and tool orchestration.

Response Times by Model

Endpoint	Avg	Min	Max
Minds API (default)	12,166ms	10,951ms	13,910ms
Minds API (gpt-4o)	7,013ms	5,900ms	8,203ms
Minds API (gpt-4o-mini)	6,651ms	4,702ms	7,975ms
Minds API (gemini-2.5-flash)	7,553ms	5,170ms	11,198ms
Direct OpenAI (gpt-4o)	1,461ms	1,139ms	1,720ms
Direct OpenAI (gpt-4o-mini)	1,784ms	1,589ms	1,925ms
Direct Google (gemini-2.5-flash)	1,593ms	1,466ms	1,701ms

Overhead Breakdown

Model	Minds API	Direct	Overhead
gpt-4o-mini	6,651ms	1,784ms	+4,867ms
gpt-4o	7,013ms	1,461ms	+5,551ms
gemini-2.5-flash	7,553ms	1,593ms	+5,960ms

Average overhead: ~5.5 seconds across all tested models. This overhead covers:

Semantic search across the mind's vector embeddings
Knowledge chunk retrieval and ranking
Web search validation (when applicable)
Citation mapping and response grounding
Tool orchestration pipeline

What the Overhead Gives You

The additional latency is the cost of intelligence. A raw LLM call has no context about your domain. Minds provides:

Knowledge grounding: Responses are based on your mind's specific knowledge base, not just the model's training data
Automatic citations: Know exactly which sources informed the response
Web search validation: Cross-reference knowledge with live web data
Persona consistency: Responses maintain the mind's personality and communication patterns
Thinking patterns: Psychological modeling that shapes how the mind reasons

Optimizing Latency

Choose the Right Model

Use the model parameter to select faster models when appropriate:

# Fastest: lightweight models
curl -X POST "https://api.getminds.ai/v1/sparks/{sparkId}/completion" \
  -H "Authorization: Bearer minds_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "Quick question"}],
    "model": "gpt-4o-mini"
  }'

Model speed ranking (fastest to slowest):

gpt-4o-mini / gemini-2.5-flash - Best for speed-critical use cases
gpt-4o / claude-sonnet-4-5 - Balanced speed and quality
Default (server-selected) - Optimized for quality

Chat API - Send messages and get responses
Knowledge API - Manage your spark's knowledge base
API Overview - Full endpoint reference

Latency

Latency

Overview

What Happens During a Request

Benchmark Results

Response Times by Model

Overhead Breakdown

What the Overhead Gives You

Optimizing Latency

Choose the Right Model

Keep Messages Concise

Warm Minds

Streaming (Coming Soon)

Rate Limits

Next Steps

User Access