Minds Team

Latency

Understand response latency, what contributes to it, and how it compares to calling foundation models directly.

Latency

Understand how Minds API response times compare to calling foundation models directly, and what contributes to the difference.

Overview

When you send a message through the Minds API, the response includes more than just a raw LLM call. The API orchestrates several steps to ground the response in your mind's knowledge base, providing higher-quality, contextual answers.

Typical response times:

ScenarioLatency
Direct foundation model call (no context)1-3s
Minds API (with knowledge grounding)5-12s
Minds API (simple greeting / no RAG)2-4s

The additional time is spent on knowledge retrieval and grounding, which is what makes Minds responses more accurate and contextual than raw LLM calls.

What Happens During a Request

When you call POST /api/v1/sparks/{sparkId}/completion, the API performs these steps:

1. Authentication & mind loading          ~50ms
2. Knowledge retrieval (RAG)              ~1-3s
   - Semantic search across embeddings
   - Retrieve relevant knowledge chunks
3. Tool orchestration                     ~1-3s
   - Web search (if needed)
   - Knowledge grounding & citations
4. LLM generation                         ~1-3s
   - Same latency as calling the model directly
5. Response formatting & citations         ~50ms

Steps 2-3 are what differentiate Minds from a raw API call. They provide your mind with relevant context from its knowledge base, web search results, and grounded citations.

Benchmark Results

Measured on March 12, 2026. Each test ran 3 times with the same prompt. Minds API calls include full RAG pipeline and tool orchestration.

Response Times by Model

EndpointAvgMinMax
Minds API (default)12,166ms10,951ms13,910ms
Minds API (gpt-4o)7,013ms5,900ms8,203ms
Minds API (gpt-4o-mini)6,651ms4,702ms7,975ms
Minds API (gemini-2.5-flash)7,553ms5,170ms11,198ms
Direct OpenAI (gpt-4o)1,461ms1,139ms1,720ms
Direct OpenAI (gpt-4o-mini)1,784ms1,589ms1,925ms
Direct Google (gemini-2.5-flash)1,593ms1,466ms1,701ms

Overhead Breakdown

ModelMinds APIDirectOverhead
gpt-4o-mini6,651ms1,784ms+4,867ms
gpt-4o7,013ms1,461ms+5,551ms
gemini-2.5-flash7,553ms1,593ms+5,960ms

Average overhead: ~5.5 seconds across all tested models. This overhead covers:

  • Semantic search across the mind's vector embeddings
  • Knowledge chunk retrieval and ranking
  • Web search validation (when applicable)
  • Citation mapping and response grounding
  • Tool orchestration pipeline

What the Overhead Gives You

The additional latency is the cost of intelligence. A raw LLM call has no context about your domain. Minds provides:

  1. Knowledge grounding: Responses are based on your mind's specific knowledge base, not just the model's training data
  2. Automatic citations: Know exactly which sources informed the response
  3. Web search validation: Cross-reference knowledge with live web data
  4. Persona consistency: Responses maintain the mind's personality and communication patterns
  5. Thinking patterns: Psychological modeling that shapes how the mind reasons

Optimizing Latency

Choose the Right Model

Use the model parameter to select faster models when appropriate:

# Fastest: lightweight models
curl -X POST "https://api.getminds.ai/v1/sparks/{sparkId}/completion" \
  -H "Authorization: Bearer minds_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "Quick question"}],
    "model": "gpt-4o-mini"
  }'

Model speed ranking (fastest to slowest):

  1. gpt-4o-mini / gemini-2.5-flash - Best for speed-critical use cases
  2. gpt-4o / claude-sonnet-4-5 - Balanced speed and quality
  3. Default (server-selected) - Optimized for quality

Keep Messages Concise

Shorter conversation histories reduce processing time. Only include relevant context in the messages array.

Warm Minds

The first request to a mind after a period of inactivity may be slightly slower due to cold-start effects. Subsequent requests benefit from cached embeddings and warmed connections.

Streaming (Coming Soon)

We're working on streaming support for the completion endpoint, which will deliver the first tokens much faster while the full response generates. This will significantly improve perceived latency for interactive applications.

Rate Limits

The API does not currently enforce hard rate limits. However, excessive concurrent requests may increase latency for all users. See Errors & Limits for details.

Next Steps