Skip to main content

Grafana Dashboards

AI Cost Firewall includes preconfigured Grafana dashboards for runtime visibility, cache behavior, provider diagnostics, and cost/savings analysis.

The Docker Compose stack includes Prometheus and Grafana configuration.

Start the stack:

docker compose up -d

Open Grafana:

http://localhost:3000

Default dashboard files are located in the main repository under:

deploy/grafana/dashboards/

and are provided automatically.

Overview dashboard

The Overview dashboard is intended for day-to-day operator and business-value visibility.

It answers questions such as:

  • How much traffic is going through the firewall?
  • How many requests are served from cache?
  • How much estimated upstream chat cost was incurred?
  • How much cost was avoided by caching?
  • How much embedding overhead was introduced by semantic caching?
  • What is the net savings after embedding overhead?
  • Which models generate the most spend?
  • Which models generate the most savings?
  • How do exact and semantic cache savings compare?

Main panels include:

  • Total requests
  • Estimated chat cost
  • Gross savings
  • Embedding overhead
  • Net savings
  • Net savings percentage
  • Cache hit rate
  • Readiness
  • Cost and savings breakdown
  • Savings by cache type
  • Spend and savings by model
  • Exact vs semantic hit rate
  • Cost per upstream request
  • Net saved per cache hit
  • Top models by spend
  • Top models by net savings
  • Request cost by cost type
  • Embedding overhead by operation

Diagnostics dashboard

The Diagnostics dashboard is intended for troubleshooting and explaining why savings are high, low, or lower than expected.

It answers questions such as:

  • Are semantic lookups happening?
  • Are semantic candidates being checked?
  • Are candidates passing or failing the similarity threshold?
  • Are expired semantic entries being skipped?
  • Are semantic cache writes succeeding?
  • Is embedding overhead significant?
  • Is semantic caching producing positive net savings?
  • Are some models less profitable for semantic caching?
  • Are providers returning authentication, connectivity, DNS, TLS, rate-limit, or timeout errors?

Main panels include:

  • Readiness state
  • Shutdown state
  • In-flight requests
  • Provider timeouts
  • Semantic store errors
  • Error classification rate
  • Provider and semantic latency diagnostics
  • Semantic threshold and expiration decisions
  • Semantic store health
  • Semantic threshold pass share
  • Expired entries skipped
  • Runtime and provider pressure signals
  • Operational signals
  • Embedding overhead over time
  • Gross vs net semantic savings
  • Exact vs semantic savings
  • Semantic cache misses vs passes
  • Semantic net savings by model
  • Potentially low-value semantic models

The Diagnostics dashboard focuses on semantic cache runtime behavior rather than business metrics.

Cost breakdown

AI Cost Firewall separates gross savings, embedding overhead, and net savings.

gross savings = avoided upstream chat completion cost
embedding overhead = cost of semantic lookup/store embedding calls
net savings = gross savings - embedding overhead

For exact cache hits:

net savings ≈ gross savings

For semantic cache hits:

net savings = avoided chat cost - embedding overhead

This distinction is important because semantic caching can avoid expensive chat completions, but it also requires embedding calls. Exact and semantic cache savings should therefore be evaluated separately.

Main dashboard metrics

The dashboards use the following cost-intelligence metrics:

aif_model_cost_micro_usd_total{model="..."}
aif_model_requests_total{model="..."}
aif_model_input_tokens_total{model="..."}
aif_model_output_tokens_total{model="..."}
aif_gross_saved_micro_usd_total{model="...", cache_type="exact|semantic"}
aif_net_saved_micro_usd_total{model="...", cache_type="exact|semantic"}
aif_embedding_overhead_micro_usd_total{model="...", operation="lookup|store"}
aif_request_cost_micro_usd_total{model="...", cost_type="chat|embedding"}
aif_cache_hits_total{model="...", cache_type="exact|semantic"}

All cost values are reported in micro-USD.

1 USD = 1,000,000 micro-USD

Useful PromQL examples

Estimated chat spend by model:

sum by (model) (
increase(aif_model_cost_micro_usd_total[$__range])
) / 1000000

Gross savings by cache type:

sum by (cache_type) (
increase(aif_gross_saved_micro_usd_total[$__range])
) / 1000000

Net savings by cache type:

sum by (cache_type) (
increase(aif_net_saved_micro_usd_total[$__range])
) / 1000000

Embedding overhead by operation:

sum by (operation) (
increase(aif_embedding_overhead_micro_usd_total[$__range])
) / 1000000

Average upstream chat cost per request:

sum by (model) (
rate(aif_model_cost_micro_usd_total[$__rate_interval])
)
/
clamp_min(
sum by (model) (
rate(aif_model_requests_total[$__rate_interval])
),
1
)

Average net saved cost per cache hit:

sum by (model, cache_type) (
rate(aif_net_saved_micro_usd_total[$__rate_interval])
)
/
clamp_min(
sum by (model, cache_type) (
rate(aif_cache_hits_total[$__rate_interval])
),
1
)

Net savings percentage:

100 *
sum(increase(aif_net_saved_micro_usd_total[$__range]))
/
clamp_min(
sum(increase(aif_model_cost_micro_usd_total[$__range]))
+
sum(increase(aif_net_saved_micro_usd_total[$__range])),
1
)

Notes

The dashboards are designed for both real deployments and local synthetic workloads.

For meaningful cost panels, configure model pricing in the firewall configuration:

model_price <model> <input_usd_per_1m_tokens> <output_usd_per_1m_tokens>;
embedding_price <usd_per_1m_tokens>;

If embedding_price is not configured, embedding overhead is treated as 0, and semantic net savings may be overestimated.

Most deployment examples include an optional:

docker-compose.observability.yml

overlay for Prometheus and Grafana.

The local-full-stack/ deployment includes observability directly in its main Compose stack.