Latency
Understand response latency, what contributes to it, and how it compares to calling foundation models directly.
Latency
Understand how Minds API response times compare to calling foundation models directly, and what contributes to the difference.
Overview
When you send a message through the Minds API, the response includes more than just a raw LLM call. The API orchestrates several steps to ground the response in your mind's knowledge base, providing higher-quality, contextual answers.
Typical response times:
| Scenario | Latency |
|---|---|
| Direct foundation model call (no context) | 1-3s |
| Minds API (with knowledge grounding) | 5-12s |
| Minds API (simple greeting / no RAG) | 2-4s |
The additional time is spent on knowledge retrieval and grounding, which is what makes Minds responses more accurate and contextual than raw LLM calls.
What Happens During a Request
When you call POST /api/v1/sparks/{sparkId}/completion, the API performs these steps:
1. Authentication & mind loading ~50ms
2. Knowledge retrieval (RAG) ~1-3s
- Semantic search across embeddings
- Retrieve relevant knowledge chunks
3. Tool orchestration ~1-3s
- Web search (if needed)
- Knowledge grounding & citations
4. LLM generation ~1-3s
- Same latency as calling the model directly
5. Response formatting & citations ~50ms
Steps 2-3 are what differentiate Minds from a raw API call. They provide your mind with relevant context from its knowledge base, web search results, and grounded citations.
Benchmark Results
Measured on March 12, 2026. Each test ran 3 times with the same prompt. Minds API calls include full RAG pipeline and tool orchestration.
Response Times by Model
| Endpoint | Avg | Min | Max |
|---|---|---|---|
| Minds API (default) | 12,166ms | 10,951ms | 13,910ms |
| Minds API (gpt-4o) | 7,013ms | 5,900ms | 8,203ms |
| Minds API (gpt-4o-mini) | 6,651ms | 4,702ms | 7,975ms |
| Minds API (gemini-2.5-flash) | 7,553ms | 5,170ms | 11,198ms |
| Direct OpenAI (gpt-4o) | 1,461ms | 1,139ms | 1,720ms |
| Direct OpenAI (gpt-4o-mini) | 1,784ms | 1,589ms | 1,925ms |
| Direct Google (gemini-2.5-flash) | 1,593ms | 1,466ms | 1,701ms |
Overhead Breakdown
| Model | Minds API | Direct | Overhead |
|---|---|---|---|
| gpt-4o-mini | 6,651ms | 1,784ms | +4,867ms |
| gpt-4o | 7,013ms | 1,461ms | +5,551ms |
| gemini-2.5-flash | 7,553ms | 1,593ms | +5,960ms |
Average overhead: ~5.5 seconds across all tested models. This overhead covers:
- Semantic search across the mind's vector embeddings
- Knowledge chunk retrieval and ranking
- Web search validation (when applicable)
- Citation mapping and response grounding
- Tool orchestration pipeline
What the Overhead Gives You
The additional latency is the cost of intelligence. A raw LLM call has no context about your domain. Minds provides:
- Knowledge grounding: Responses are based on your mind's specific knowledge base, not just the model's training data
- Automatic citations: Know exactly which sources informed the response
- Web search validation: Cross-reference knowledge with live web data
- Persona consistency: Responses maintain the mind's personality and communication patterns
- Thinking patterns: Psychological modeling that shapes how the mind reasons
Optimizing Latency
Choose the Right Model
Use the model parameter to select faster models when appropriate:
# Fastest: lightweight models
curl -X POST "https://api.getminds.ai/v1/sparks/{sparkId}/completion" \
-H "Authorization: Bearer minds_your_api_key" \
-H "Content-Type: application/json" \
-d '{
"messages": [{"role": "user", "content": "Quick question"}],
"model": "gpt-4o-mini"
}'
Model speed ranking (fastest to slowest):
gpt-4o-mini/gemini-2.5-flash- Best for speed-critical use casesgpt-4o/claude-sonnet-4-5- Balanced speed and quality- Default (server-selected) - Optimized for quality
Keep Messages Concise
Shorter conversation histories reduce processing time. Only include relevant context in the messages array.
Warm Minds
The first request to a mind after a period of inactivity may be slightly slower due to cold-start effects. Subsequent requests benefit from cached embeddings and warmed connections.
Streaming (Coming Soon)
We're working on streaming support for the completion endpoint, which will deliver the first tokens much faster while the full response generates. This will significantly improve perceived latency for interactive applications.
Rate Limits
The API does not currently enforce hard rate limits. However, excessive concurrent requests may increase latency for all users. See Errors & Limits for details.
Next Steps
- Chat API - Send messages and get responses
- Knowledge API - Manage your spark's knowledge base
- API Overview - Full endpoint reference