Grafana Dashboards

AI Cost Firewall includes preconfigured Grafana dashboards for runtime visibility, cache behavior, provider diagnostics, and cost/savings analysis.

The Docker Compose stack includes Prometheus and Grafana configuration.

Start the stack:

docker compose up -d

Open Grafana:

http://localhost:3000

Default dashboard files are located in the main repository under:

deploy/grafana/dashboards/

and are provided automatically.

Overview dashboard

The Overview dashboard is intended for day-to-day operator and business-value visibility.

It answers questions such as:

How much traffic is going through the firewall?
How many requests are served from cache?
How much estimated upstream chat cost was incurred?
How much cost was avoided by caching?
How much embedding overhead was introduced by semantic caching?
What is the net savings after embedding overhead?
Which models generate the most spend?
Which models generate the most savings?
How do exact and semantic cache savings compare?

Main panels include:

Total requests
Estimated chat cost
Gross savings
Embedding overhead
Net savings
Net savings percentage
Cache hit rate
Readiness
Cost and savings breakdown
Savings by cache type
Spend and savings by model
Exact vs semantic hit rate
Cost per upstream request
Net saved per cache hit
Top models by spend
Top models by net savings
Request cost by cost type
Embedding overhead by operation

Diagnostics dashboard

The Diagnostics dashboard is intended for troubleshooting and explaining why savings are high, low, or lower than expected.

It answers questions such as:

Are semantic lookups happening?
Are semantic candidates being checked?
Are candidates passing or failing the similarity threshold?
Are expired semantic entries being skipped?
Are semantic cache writes succeeding?
Is embedding overhead significant?
Is semantic caching producing positive net savings?
Are some models less profitable for semantic caching?
Are providers returning authentication, connectivity, DNS, TLS, rate-limit, or timeout errors?

Main panels include:

Readiness state
Shutdown state
In-flight requests
Provider timeouts
Semantic store errors
Error classification rate
Provider and semantic latency diagnostics
Semantic threshold and expiration decisions
Semantic store health
Semantic threshold pass share
Expired entries skipped
Runtime and provider pressure signals
Operational signals
Embedding overhead over time
Gross vs net semantic savings
Exact vs semantic savings
Semantic cache misses vs passes
Semantic net savings by model
Potentially low-value semantic models

The Diagnostics dashboard focuses on semantic cache runtime behavior rather than business metrics.

Cost breakdown

AI Cost Firewall separates gross savings, embedding overhead, and net savings.

gross savings = avoided upstream chat completion cost
embedding overhead = cost of semantic lookup/store embedding calls
net savings = gross savings - embedding overhead

For exact cache hits:

net savings ≈ gross savings

For semantic cache hits:

net savings = avoided chat cost - embedding overhead

This distinction is important because semantic caching can avoid expensive chat completions, but it also requires embedding calls. Exact and semantic cache savings should therefore be evaluated separately.

Main dashboard metrics

The dashboards use the following cost-intelligence metrics:

aif_model_cost_micro_usd_total{model="..."}
aif_model_requests_total{model="..."}
aif_model_input_tokens_total{model="..."}
aif_model_output_tokens_total{model="..."}
aif_gross_saved_micro_usd_total{model="...", cache_type="exact|semantic"}
aif_net_saved_micro_usd_total{model="...", cache_type="exact|semantic"}
aif_embedding_overhead_micro_usd_total{model="...", operation="lookup|store"}
aif_request_cost_micro_usd_total{model="...", cost_type="chat|embedding"}
aif_cache_hits_total{model="...", cache_type="exact|semantic"}

All cost values are reported in micro-USD.

1 USD = 1,000,000 micro-USD

Useful PromQL examples

Estimated chat spend by model:

sum by (model) (
  increase(aif_model_cost_micro_usd_total[$__range])
) / 1000000

Gross savings by cache type:

sum by (cache_type) (
  increase(aif_gross_saved_micro_usd_total[$__range])
) / 1000000

Net savings by cache type:

sum by (cache_type) (
  increase(aif_net_saved_micro_usd_total[$__range])
) / 1000000

Embedding overhead by operation:

sum by (operation) (
  increase(aif_embedding_overhead_micro_usd_total[$__range])
) / 1000000

Average upstream chat cost per request:

sum by (model) (
  rate(aif_model_cost_micro_usd_total[$__rate_interval])
)
/
clamp_min(
  sum by (model) (
    rate(aif_model_requests_total[$__rate_interval])
  ),
  1
)

Average net saved cost per cache hit:

sum by (model, cache_type) (
  rate(aif_net_saved_micro_usd_total[$__rate_interval])
)
/
clamp_min(
  sum by (model, cache_type) (
    rate(aif_cache_hits_total[$__rate_interval])
  ),
  1
)

Net savings percentage:

100 *
sum(increase(aif_net_saved_micro_usd_total[$__range]))
/
clamp_min(
  sum(increase(aif_model_cost_micro_usd_total[$__range]))
  +
  sum(increase(aif_net_saved_micro_usd_total[$__range])),
  1
)

Notes

The dashboards are designed for both real deployments and local synthetic workloads.

For meaningful cost panels, configure model pricing in the firewall configuration:

model_price <model> <input_usd_per_1m_tokens> <output_usd_per_1m_tokens>;
embedding_price <usd_per_1m_tokens>;

If embedding_price is not configured, embedding overhead is treated as 0, and semantic net savings may be overestimated.

Most deployment examples include an optional:

docker-compose.observability.yml

overlay for Prometheus and Grafana.

The local-full-stack/ deployment includes observability directly in its main Compose stack.

Overview dashboard​

Diagnostics dashboard​

Cost breakdown​

Main dashboard metrics​

Useful PromQL examples​

Notes​