Skip to main content

Metrics

AI Cost Firewall exposes Prometheus-format metrics at:

/metrics

Example:

curl http://localhost:8080/metrics

Request and cache metrics

aif_requests_total
aif_cache_exact_hits
aif_cache_semantic_hits
aif_cache_misses
aif_upstream_calls_total

These metrics show total request volume, cache outcomes, and how often requests are forwarded to the upstream LLM provider.

Per-model request and token metrics

aif_model_requests_total{model="..."}
aif_model_input_tokens_total{model="..."}
aif_model_output_tokens_total{model="..."}

These metrics show which models are being used and how many upstream input/output tokens they consume.

aif_model_requests_total counts upstream chat completion requests by model. It is useful as a denominator for average cost per upstream request.

Cost and savings metrics

AI Cost Firewall reports costs in micro-USD.

1 USD = 1,000,000 micro-USD

Backward-compatible aggregate cost metrics:

aif_chat_cost_saved_micro_usd
aif_embedding_cost_micro_usd
aif_cost_saved_micro_usd

Structured cost-intelligence metrics:

aif_model_cost_micro_usd_total{model="..."}
aif_request_cost_micro_usd_total{model="...", cost_type="chat|embedding"}
aif_gross_saved_micro_usd_total{model="...", cache_type="exact|semantic"}
aif_net_saved_micro_usd_total{model="...", cache_type="exact|semantic"}
aif_embedding_overhead_micro_usd_total{model="...", operation="lookup|store"}
aif_cache_hits_total{model="...", cache_type="exact|semantic"}

Cost accounting model

The main accounting model is:

gross savings = avoided upstream chat completion cost
embedding overhead = cost of semantic lookup/store embedding calls
net savings = gross savings - embedding overhead

For exact cache hits:

net savings ≈ gross savings

For semantic cache hits:

net savings = avoided chat cost - embedding overhead

Exact cache hits do not require embedding lookup. Semantic cache hits avoid upstream chat calls, but require embedding lookup to search for similar cached prompts.

Semantic cache storage can also create embedding overhead because cache misses may be embedded before they are stored for future semantic reuse.

If embedding_price is not configured, embedding overhead is treated as 0, and savings may be overestimated.

Runtime metrics

aif_inflight_requests
aif_readiness_state
aif_shutdown_in_progress
aif_shutdown_rejections_total

Note: aif_inflight_requests includes the /metrics request itself.

Provider diagnostics

Upstream provider diagnostics:

aif_upstream_timeouts_total
aif_upstream_request_duration_seconds

Embedding provider diagnostics:

aif_embedding_request_duration_seconds
aif_embedding_timeouts_total

These metrics help diagnose slow or unavailable upstream and embedding providers.

Error metrics

aif_errors_total{class="..."}

Common error classes include:

validation_error
upstream_error
upstream_timeout
upstream_authentication_error
upstream_not_found
upstream_rate_limited
upstream_tls_error
upstream_dns_error
upstream_connect_error
internal_error

These labels help distinguish request validation issues, provider connectivity problems, TLS/certificate errors, upstream rate limits, and internal failures.

Semantic diagnostics

aif_semantic_candidates_checked_total
aif_semantic_threshold_results_total{result="pass"}
aif_semantic_threshold_results_total{result="fail"}
aif_semantic_expired_entries_skipped_total
aif_semantic_lookup_duration_seconds
aif_semantic_store_total
aif_semantic_store_errors_total

These metrics explain semantic cache behavior:

  • how many candidates are checked
  • how often candidates pass or fail the similarity threshold
  • how often expired entries are skipped
  • how long semantic lookup takes
  • whether semantic cache writes are succeeding

Useful PromQL examples

Estimated chat spend by model:

sum by (model) (
increase(aif_model_cost_micro_usd_total[$__range])
) / 1000000

Gross savings by cache type:

sum by (cache_type) (
increase(aif_gross_saved_micro_usd_total[$__range])
) / 1000000

Net savings by cache type:

sum by (cache_type) (
increase(aif_net_saved_micro_usd_total[$__range])
) / 1000000

Embedding overhead by operation:

sum by (operation) (
increase(aif_embedding_overhead_micro_usd_total[$__range])
) / 1000000

Average upstream chat cost per request:

sum by (model) (
rate(aif_model_cost_micro_usd_total[$__rate_interval])
)
/
clamp_min(
sum by (model) (
rate(aif_model_requests_total[$__rate_interval])
),
1
)

Average net saved cost per cache hit:

sum by (model, cache_type) (
rate(aif_net_saved_micro_usd_total[$__rate_interval])
)
/
clamp_min(
sum by (model, cache_type) (
rate(aif_cache_hits_total[$__rate_interval])
),
1
)

Semantic lookup overhead per semantic hit:

sum by (model) (
rate(aif_embedding_overhead_micro_usd_total{operation="lookup"}[$__rate_interval])
)
/
clamp_min(
sum by (model) (
rate(aif_cache_hits_total{cache_type="semantic"}[$__rate_interval])
),
1
)

Pricing configuration

Cost metrics depend on model pricing configured in AI Cost Firewall.

Example:

model_price gpt-4o-mini-2024-07-18 0.15 0.60;
embedding_price 0.020;

Model prices are configured in USD per 1M input/output tokens.

Embedding price is configured in USD per 1M embedding tokens.