AI Cost Firewall v0.2.1
Gateway Control and Observability Polish
AI Cost Firewall v0.2.1 is a focused operational release that builds on the pilot-ready v0.2.0 baseline.
This release improves runtime control over cache behavior, request limits, timeout handling, dashboard visibility, and evaluation workflows. The goal is to make AI Cost Firewall easier to test, tune, and operate in controlled pilot environments before broader production rollout.
The main improvements in v0.2.1 are:
- explicit exact-cache enablement controls
- separate exact-cache and semantic-cache store controls
- clearer timeout configuration for upstream and embedding calls
- request body and prompt-size protection
- cache bypass support for controlled testing
- improved configuration examples
- updated Prometheus and Grafana observability
- better operational diagnostics for cache behavior
Release Positioning
v0.2.1 positions AI Cost Firewall as a more controllable OpenAI-compatible gateway for pilot deployments.
v0.2.0 established the stable baseline. v0.2.1 adds the operational switches needed to run cleaner cache experiments, compare exact and semantic cache behavior, and protect the gateway from oversized requests.
Typical use cases include:
- comparing exact-cache, semantic-cache, and upstream-only behavior
- disabling cache writes during selected tests
- bypassing cache with a request header
- tuning upstream and embedding timeout behavior separately
- protecting the gateway from unusually large request bodies or prompts
- improving dashboard-based demonstrations and diagnostics
What Is New
Explicit Exact Cache Controls
v0.2.1 adds explicit exact-cache controls.
exact_cache_enabled true;
exact_cache_fail_open true;
These controls make exact-cache behavior easier to reason about and document separately from semantic-cache behavior.
exact_cache_enabled controls whether Redis-backed exact-cache lookup is used.
exact_cache_fail_open controls whether exact-cache errors should be skipped so the request can continue upstream.
Separate Cache Store Controls
v0.2.1 adds separate store controls for exact and semantic cache layers.
exact_cache_store_enabled true;
semantic_cache_store_enabled true;
These settings are useful during evaluations where cache lookup should remain enabled but new entries should not be written.
Example use cases:
- measure behavior against an existing warmed cache
- prevent benchmark traffic from polluting cache state
- compare lookup-only behavior against normal lookup-and-store behavior
- temporarily pause semantic-cache writes while keeping exact-cache writes enabled
Split Timeout Controls
v0.2.1 introduces dedicated timeout controls for upstream and embedding operations.
upstream_timeout_seconds 120;
embedding_timeout_seconds 120;
The older request_timeout_seconds setting remains available for compatibility.
request_timeout_seconds 120;
Recommended behavior:
- use
upstream_timeout_secondsfor chat-completion upstream calls - use
embedding_timeout_secondsfor embedding calls used by semantic caching - keep
request_timeout_secondsonly where compatibility with older configurations is needed
Request and Prompt Protection
v0.2.1 improves request-size protection with explicit controls.
max_request_body_bytes 1M;
max_prompt_chars 20000;
max_request_body_bytes limits the total HTTP request body size.
max_prompt_chars limits the collected prompt text size used by the gateway before forwarding the request.
These controls help protect pilot deployments from accidental oversized requests, benchmark mistakes, and unsupported payload patterns.
Cache Bypass Header
v0.2.1 adds configurable cache bypass support.
cache_bypass_header X-AIF-Cache-Bypass;
When the configured header is present on a request, AI Cost Firewall can bypass cache lookup and forward the request upstream.
This is useful for:
- comparing cached and uncached behavior
- validating upstream availability
- running controlled benchmark probes
- confirming that savings metrics reflect actual cache behavior
Example:
curl -X POST http://localhost:8080/v1/chat/completions \
-H 'Content-Type: application/json' \
-H 'X-AIF-Cache-Bypass: 1' \
-d @request.json
Configuration Example
listen_addr 0.0.0.0:8080;
redis_url redis://redis:6379;
upstream_provider openai_compatible;
upstream_base_url http://host.docker.internal:9000;
upstream_api_key demo-key;
embedding_provider openai_compatible;
embedding_base_url http://host.docker.internal:9000;
embedding_api_key demo-key;
embedding_model text-embedding-3-small;
qdrant_url http://qdrant:6334;
qdrant_collection aif_semantic_cache;
qdrant_vector_size 1536;
cache_ttl_seconds 2592000;
exact_cache_enabled true;
exact_cache_fail_open true;
exact_cache_store_enabled true;
semantic_cache_enabled true;
semantic_cache_fail_open true;
semantic_cache_store_enabled true;
semantic_similarity_threshold 0.92;
upstream_timeout_seconds 120;
embedding_timeout_seconds 120;
max_request_body_bytes 1M;
max_prompt_chars 20000;
cache_bypass_header X-AIF-Cache-Bypass;
allow_unknown_models_pass_through false;
model_price gpt-4o-mini-2024-07-18 0.15 0.60;
model_price gpt-4.1-mini-2025-04-14 0.30 1.20;
embedding_price 0.020;
Observability Improvements
v0.2.1 improves dashboard usefulness for pilot evaluation.
The dashboard updates focus on:
- clearer cache-hit visibility
- better separation between runtime health and diagnostics
- improved cold-cache and warm-cache evaluation support
- cache bypass visibility
- cleaner Grafana panels for demonstrations
Recommended dashboard interpretation:
- use cache hit rate to evaluate reuse effectiveness
- use bypass request metrics to confirm controlled upstream-only tests
- use latency panels to compare cold-cache and warm-cache behavior
- use diagnostics panels to confirm Redis, Qdrant, upstream, and embedding behavior
Validation and Runtime Checks
Recommended checks after deployment:
curl http://localhost:8080/healthz
curl http://localhost:8080/readyz
curl http://localhost:8080/version
curl http://localhost:8080/metrics
Recommended configuration validation:
./target/release/ai-firewall --config configs/ai-firewall.conf --test-config
Recommended masked configuration review:
./target/release/ai-firewall --config configs/ai-firewall.conf --print-config
Upgrade Notes
When upgrading from v0.2.0 to v0.2.1:
- Add explicit exact-cache controls if they are not already present.
- Decide whether cache store controls should be enabled for your evaluation.
- Prefer
upstream_timeout_secondsandembedding_timeout_secondsover the older shared timeout setting. - Add
max_prompt_charsfor prompt-size protection. - Add
cache_bypass_headerif you want controlled cache-bypass tests. - Refresh Prometheus and Grafana dashboard provisioning files.
- Check
/healthz,/readyz,/version, and/metricsafter startup.
Known Non-Goals for v0.2.1
The following items remain outside the v0.2.1 scope:
- provider-specific configuration blocks
- native provider-specific API integrations
- administrative UI
- advanced enterprise policy orchestration
- privacy anonymization modules
- security scanning modules
- compliance reporting modules
- distributed cache coordination beyond Redis and Qdrant
Summary
AI Cost Firewall v0.2.1 strengthens the v0.2.0 pilot-ready baseline with practical runtime controls.
It improves the ability to:
- enable or disable exact and semantic cache behavior
- control cache writes independently from cache reads
- test upstream-only behavior through a bypass header
- tune upstream and embedding timeout handling
- protect the gateway from oversized requests
- run cleaner Grafana-backed pilot demonstrations
This release is recommended for pilots that need more precise cache evaluation and operational control.