Metrics
AI Cost Firewall exposes Prometheus-format metrics at:
/metrics
Example:
curl http://localhost:8080/metrics
Request and cache metrics
aif_requests_total
aif_cache_exact_hits
aif_cache_semantic_hits
aif_cache_misses
aif_upstream_calls_total
These metrics show total request volume, cache outcomes, and how often requests are forwarded to the upstream LLM provider.
Per-model request and token metrics
aif_model_requests_total{model="..."}
aif_model_input_tokens_total{model="..."}
aif_model_output_tokens_total{model="..."}
These metrics show which models are being used and how many upstream input/output tokens they consume.
aif_model_requests_total counts upstream chat completion requests by model. It is useful as a denominator for average cost per upstream request.
Cost and savings metrics
AI Cost Firewall reports costs in micro-USD.
1 USD = 1,000,000 micro-USD
Backward-compatible aggregate cost metrics:
aif_chat_cost_saved_micro_usd
aif_embedding_cost_micro_usd
aif_cost_saved_micro_usd
Structured cost-intelligence metrics:
aif_model_cost_micro_usd_total{model="..."}
aif_request_cost_micro_usd_total{model="...", cost_type="chat|embedding"}
aif_gross_saved_micro_usd_total{model="...", cache_type="exact|semantic"}
aif_net_saved_micro_usd_total{model="...", cache_type="exact|semantic"}
aif_embedding_overhead_micro_usd_total{model="...", operation="lookup|store"}
aif_cache_hits_total{model="...", cache_type="exact|semantic"}
Cost accounting model
The main accounting model is:
gross savings = avoided upstream chat completion cost
embedding overhead = cost of semantic lookup/store embedding calls
net savings = gross savings - embedding overhead
For exact cache hits:
net savings ≈ gross savings
For semantic cache hits:
net savings = avoided chat cost - embedding overhead
Exact cache hits do not require embedding lookup. Semantic cache hits avoid upstream chat calls, but require embedding lookup to search for similar cached prompts.
Semantic cache storage can also create embedding overhead because cache misses may be embedded before they are stored for future semantic reuse.
If embedding_price is not configured, embedding overhead is treated as 0, and savings may be overestimated.
Runtime metrics
aif_inflight_requests
aif_readiness_state
aif_shutdown_in_progress
aif_shutdown_rejections_total
Note: aif_inflight_requests includes the /metrics request itself.
Provider diagnostics
Upstream provider diagnostics:
aif_upstream_timeouts_total
aif_upstream_request_duration_seconds
Embedding provider diagnostics:
aif_embedding_request_duration_seconds
aif_embedding_timeouts_total
These metrics help diagnose slow or unavailable upstream and embedding providers.
Error metrics
aif_errors_total{class="..."}
Common error classes include:
validation_error
upstream_error
upstream_timeout
upstream_authentication_error
upstream_not_found
upstream_rate_limited
upstream_tls_error
upstream_dns_error
upstream_connect_error
internal_error
These labels help distinguish request validation issues, provider connectivity problems, TLS/certificate errors, upstream rate limits, and internal failures.
Semantic diagnostics
aif_semantic_candidates_checked_total
aif_semantic_threshold_results_total{result="pass"}
aif_semantic_threshold_results_total{result="fail"}
aif_semantic_expired_entries_skipped_total
aif_semantic_lookup_duration_seconds
aif_semantic_store_total
aif_semantic_store_errors_total
These metrics explain semantic cache behavior:
- how many candidates are checked
- how often candidates pass or fail the similarity threshold
- how often expired entries are skipped
- how long semantic lookup takes
- whether semantic cache writes are succeeding
Useful PromQL examples
Estimated chat spend by model:
sum by (model) (
increase(aif_model_cost_micro_usd_total[$__range])
) / 1000000
Gross savings by cache type:
sum by (cache_type) (
increase(aif_gross_saved_micro_usd_total[$__range])
) / 1000000
Net savings by cache type:
sum by (cache_type) (
increase(aif_net_saved_micro_usd_total[$__range])
) / 1000000
Embedding overhead by operation:
sum by (operation) (
increase(aif_embedding_overhead_micro_usd_total[$__range])
) / 1000000
Average upstream chat cost per request:
sum by (model) (
rate(aif_model_cost_micro_usd_total[$__rate_interval])
)
/
clamp_min(
sum by (model) (
rate(aif_model_requests_total[$__rate_interval])
),
1
)
Average net saved cost per cache hit:
sum by (model, cache_type) (
rate(aif_net_saved_micro_usd_total[$__rate_interval])
)
/
clamp_min(
sum by (model, cache_type) (
rate(aif_cache_hits_total[$__rate_interval])
),
1
)
Semantic lookup overhead per semantic hit:
sum by (model) (
rate(aif_embedding_overhead_micro_usd_total{operation="lookup"}[$__rate_interval])
)
/
clamp_min(
sum by (model) (
rate(aif_cache_hits_total{cache_type="semantic"}[$__rate_interval])
),
1
)
Pricing configuration
Cost metrics depend on model pricing configured in AI Cost Firewall.
Example:
model_price gpt-4o-mini-2024-07-18 0.15 0.60;
embedding_price 0.020;
Model prices are configured in USD per 1M input/output tokens.
Embedding price is configured in USD per 1M embedding tokens.