Skip to main content

Architecture Overview

AI Cost Firewall is a lightweight LLM infrastructure component.

Components

AI Cost Firewall

The Rust + Axum gateway that validates requests, checks caches, forwards misses upstream, stores cache entries, and exposes metrics.

Redis / Valkey

Stores exact cache entries.

Qdrant

Stores semantic cache entries and performs vector search. AI Cost Firewall uses Qdrant gRPC on port 6334.

OpenAI-compatible chat upstream

Receives cache misses.

OpenAI-compatible embedding provider

Semantic caching uses embeddings generated by the configured embedding provider. The embedding provider may be the same service as the chat upstream or a separate OpenAI-compatible endpoint.

OpenAI-Compatible Provider Flexibility

AI Cost Firewall supports practical OpenAI-compatible providers including:

  • OpenAI
  • Ollama
  • LM Studio
  • vLLM
  • LiteLLM
  • OpenRouter

without requiring provider-specific configuration blocks.

The deployment stack also includes:

  • Prometheus
  • Grafana
  • Overview dashboard
  • Diagnostics dashboard

See:

deploy/examples/

Prometheus and Grafana

Prometheus scrapes /metrics; Grafana visualizes cache performance, savings, and diagnostics.