Architecture Overview
AI Cost Firewall is a lightweight LLM infrastructure component.
Components
AI Cost Firewall
The Rust + Axum gateway that validates requests, checks caches, forwards misses upstream, stores cache entries, and exposes metrics.
Redis / Valkey
Stores exact cache entries.
Qdrant
Stores semantic cache entries and performs vector search. AI Cost Firewall uses Qdrant gRPC on port 6334.
OpenAI-compatible chat upstream
Receives cache misses.
OpenAI-compatible embedding provider
Semantic caching uses embeddings generated by the configured embedding provider. The embedding provider may be the same service as the chat upstream or a separate OpenAI-compatible endpoint.
OpenAI-Compatible Provider Flexibility
AI Cost Firewall supports practical OpenAI-compatible providers including:
- OpenAI
- Ollama
- LM Studio
- vLLM
- LiteLLM
- OpenRouter
without requiring provider-specific configuration blocks.
The deployment stack also includes:
- Prometheus
- Grafana
- Overview dashboard
- Diagnostics dashboard
See:
deploy/examples/
Prometheus and Grafana
Prometheus scrapes /metrics; Grafana visualizes cache performance, savings, and diagnostics.