Performance Characteristics

Performance depends on infrastructure, Redis latency, Qdrant latency, embedding provider latency, upstream model latency, and cache hit rate.

Scenario	Expected behavior
Exact cache hit	fastest path; no upstream call; no embedding lookup
Semantic cache hit	avoids upstream chat call; requires embedding and Qdrant lookup
Cache miss	dominated by upstream chat model latency

Semantic cache latency depends on embedding provider latency and Qdrant search latency.

Cost and latency trade-offs

Exact cache hits usually provide the clearest performance and cost benefit because they avoid the upstream chat call without requiring an embedding lookup.

Semantic cache hits can avoid expensive upstream chat completions, but they introduce embedding overhead. This means semantic caching should be evaluated by both latency and net savings:

net savings = avoided upstream chat cost - embedding overhead

Semantic caching is usually most valuable when prompts are repeated or semantically similar, upstream models are relatively expensive, and cached responses are reused often enough to justify embedding cost.

For very cheap models or low-repeat workloads, semantic cache hit rates and gross savings may look positive while net savings are lower because embedding overhead reduces the benefit.

Useful latency and runtime metrics

aif_upstream_request_duration_seconds
aif_upstream_timeouts_total
aif_upstream_calls_total
aif_cache_exact_hits
aif_cache_semantic_hits
aif_cache_misses
aif_embedding_request_duration_seconds
aif_embedding_timeouts_total
aif_semantic_lookup_duration_seconds

Useful cost and savings metrics

aif_model_cost_micro_usd_total{model="..."}
aif_gross_saved_micro_usd_total{model="...", cache_type="exact|semantic"}
aif_net_saved_micro_usd_total{model="...", cache_type="exact|semantic"}
aif_embedding_overhead_micro_usd_total{model="...", operation="lookup|store"}
aif_request_cost_micro_usd_total{model="...", cost_type="chat|embedding"}
aif_model_requests_total{model="..."}
aif_cache_hits_total{model="...", cache_type="exact|semantic"}

All cost values are reported in micro-USD.

1 USD = 1,000,000 micro-USD

Cost and latency trade-offs​

Useful latency and runtime metrics​

Useful cost and savings metrics​

Cost and latency trade-offs

Useful latency and runtime metrics

Useful cost and savings metrics