Performance Characteristics
Performance depends on infrastructure, Redis latency, Qdrant latency, embedding provider latency, upstream model latency, and cache hit rate.
| Scenario | Expected behavior |
|---|---|
| Exact cache hit | fastest path; no upstream call; no embedding lookup |
| Semantic cache hit | avoids upstream chat call; requires embedding and Qdrant lookup |
| Cache miss | dominated by upstream chat model latency |
Semantic cache latency depends on embedding provider latency and Qdrant search latency.
Cost and latency trade-offs
Exact cache hits usually provide the clearest performance and cost benefit because they avoid the upstream chat call without requiring an embedding lookup.
Semantic cache hits can avoid expensive upstream chat completions, but they introduce embedding overhead. This means semantic caching should be evaluated by both latency and net savings:
net savings = avoided upstream chat cost - embedding overhead
Semantic caching is usually most valuable when prompts are repeated or semantically similar, upstream models are relatively expensive, and cached responses are reused often enough to justify embedding cost.
For very cheap models or low-repeat workloads, semantic cache hit rates and gross savings may look positive while net savings are lower because embedding overhead reduces the benefit.
Useful latency and runtime metrics
aif_upstream_request_duration_seconds
aif_upstream_timeouts_total
aif_upstream_calls_total
aif_cache_exact_hits
aif_cache_semantic_hits
aif_cache_misses
aif_embedding_request_duration_seconds
aif_embedding_timeouts_total
aif_semantic_lookup_duration_seconds
Useful cost and savings metrics
aif_model_cost_micro_usd_total{model="..."}
aif_gross_saved_micro_usd_total{model="...", cache_type="exact|semantic"}
aif_net_saved_micro_usd_total{model="...", cache_type="exact|semantic"}
aif_embedding_overhead_micro_usd_total{model="...", operation="lookup|store"}
aif_request_cost_micro_usd_total{model="...", cost_type="chat|embedding"}
aif_model_requests_total{model="..."}
aif_cache_hits_total{model="...", cache_type="exact|semantic"}
All cost values are reported in micro-USD.
1 USD = 1,000,000 micro-USD