AI Cost Firewall v0.2.1

Gateway Control and Observability Polish

AI Cost Firewall v0.2.1 is a focused operational release that builds on the pilot-ready v0.2.0 baseline.

This release improves runtime control over cache behavior, request limits, timeout handling, dashboard visibility, and evaluation workflows. The goal is to make AI Cost Firewall easier to test, tune, and operate in controlled pilot environments before broader production rollout.

The main improvements in v0.2.1 are:

explicit exact-cache enablement controls
separate exact-cache and semantic-cache store controls
clearer timeout configuration for upstream and embedding calls
request body and prompt-size protection
cache bypass support for controlled testing
improved configuration examples
updated Prometheus and Grafana observability
better operational diagnostics for cache behavior

Release Positioning

v0.2.1 positions AI Cost Firewall as a more controllable OpenAI-compatible gateway for pilot deployments.

v0.2.0 established the stable baseline. v0.2.1 adds the operational switches needed to run cleaner cache experiments, compare exact and semantic cache behavior, and protect the gateway from oversized requests.

Typical use cases include:

comparing exact-cache, semantic-cache, and upstream-only behavior
disabling cache writes during selected tests
bypassing cache with a request header
tuning upstream and embedding timeout behavior separately
protecting the gateway from unusually large request bodies or prompts
improving dashboard-based demonstrations and diagnostics

What Is New

Explicit Exact Cache Controls

v0.2.1 adds explicit exact-cache controls.

exact_cache_enabled true;
exact_cache_fail_open true;

These controls make exact-cache behavior easier to reason about and document separately from semantic-cache behavior.

exact_cache_enabled controls whether Redis-backed exact-cache lookup is used.

exact_cache_fail_open controls whether exact-cache errors should be skipped so the request can continue upstream.

Separate Cache Store Controls

v0.2.1 adds separate store controls for exact and semantic cache layers.

exact_cache_store_enabled true;
semantic_cache_store_enabled true;

These settings are useful during evaluations where cache lookup should remain enabled but new entries should not be written.

Example use cases:

measure behavior against an existing warmed cache
prevent benchmark traffic from polluting cache state
compare lookup-only behavior against normal lookup-and-store behavior
temporarily pause semantic-cache writes while keeping exact-cache writes enabled

Split Timeout Controls

v0.2.1 introduces dedicated timeout controls for upstream and embedding operations.

upstream_timeout_seconds 120;
embedding_timeout_seconds 120;

The older request_timeout_seconds setting remains available for compatibility.

request_timeout_seconds 120;

Recommended behavior:

use upstream_timeout_seconds for chat-completion upstream calls
use embedding_timeout_seconds for embedding calls used by semantic caching
keep request_timeout_seconds only where compatibility with older configurations is needed

Request and Prompt Protection

v0.2.1 improves request-size protection with explicit controls.

max_request_body_bytes 1M;
max_prompt_chars 20000;

max_request_body_bytes limits the total HTTP request body size.

max_prompt_chars limits the collected prompt text size used by the gateway before forwarding the request.

These controls help protect pilot deployments from accidental oversized requests, benchmark mistakes, and unsupported payload patterns.

Cache Bypass Header

v0.2.1 adds configurable cache bypass support.

cache_bypass_header X-AIF-Cache-Bypass;

When the configured header is present on a request, AI Cost Firewall can bypass cache lookup and forward the request upstream.

This is useful for:

comparing cached and uncached behavior
validating upstream availability
running controlled benchmark probes
confirming that savings metrics reflect actual cache behavior

Example:

curl -X POST http://localhost:8080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -H 'X-AIF-Cache-Bypass: 1' \
  -d @request.json

Configuration Example

listen_addr 0.0.0.0:8080;

redis_url redis://redis:6379;

upstream_provider openai_compatible;
upstream_base_url http://host.docker.internal:9000;
upstream_api_key demo-key;

embedding_provider openai_compatible;
embedding_base_url http://host.docker.internal:9000;
embedding_api_key demo-key;
embedding_model text-embedding-3-small;

qdrant_url http://qdrant:6334;
qdrant_collection aif_semantic_cache;
qdrant_vector_size 1536;

cache_ttl_seconds 2592000;

exact_cache_enabled true;
exact_cache_fail_open true;
exact_cache_store_enabled true;

semantic_cache_enabled true;
semantic_cache_fail_open true;
semantic_cache_store_enabled true;
semantic_similarity_threshold 0.92;

upstream_timeout_seconds 120;
embedding_timeout_seconds 120;

max_request_body_bytes 1M;
max_prompt_chars 20000;

cache_bypass_header X-AIF-Cache-Bypass;

allow_unknown_models_pass_through false;

model_price gpt-4o-mini-2024-07-18 0.15 0.60;
model_price gpt-4.1-mini-2025-04-14 0.30 1.20;

embedding_price 0.020;

Observability Improvements

v0.2.1 improves dashboard usefulness for pilot evaluation.

The dashboard updates focus on:

clearer cache-hit visibility
better separation between runtime health and diagnostics
improved cold-cache and warm-cache evaluation support
cache bypass visibility
cleaner Grafana panels for demonstrations

Recommended dashboard interpretation:

use cache hit rate to evaluate reuse effectiveness
use bypass request metrics to confirm controlled upstream-only tests
use latency panels to compare cold-cache and warm-cache behavior
use diagnostics panels to confirm Redis, Qdrant, upstream, and embedding behavior

Validation and Runtime Checks

Recommended checks after deployment:

curl http://localhost:8080/healthz
curl http://localhost:8080/readyz
curl http://localhost:8080/version
curl http://localhost:8080/metrics

Recommended configuration validation:

./target/release/ai-firewall --config configs/ai-firewall.conf --test-config

Recommended masked configuration review:

./target/release/ai-firewall --config configs/ai-firewall.conf --print-config

Upgrade Notes

When upgrading from v0.2.0 to v0.2.1:

Add explicit exact-cache controls if they are not already present.
Decide whether cache store controls should be enabled for your evaluation.
Prefer upstream_timeout_seconds and embedding_timeout_seconds over the older shared timeout setting.
Add max_prompt_chars for prompt-size protection.
Add cache_bypass_header if you want controlled cache-bypass tests.
Refresh Prometheus and Grafana dashboard provisioning files.
Check /healthz, /readyz, /version, and /metrics after startup.

Known Non-Goals for v0.2.1

The following items remain outside the v0.2.1 scope:

provider-specific configuration blocks
native provider-specific API integrations
administrative UI
advanced enterprise policy orchestration
privacy anonymization modules
security scanning modules
compliance reporting modules
distributed cache coordination beyond Redis and Qdrant

Summary

AI Cost Firewall v0.2.1 strengthens the v0.2.0 pilot-ready baseline with practical runtime controls.

It improves the ability to:

enable or disable exact and semantic cache behavior
control cache writes independently from cache reads
test upstream-only behavior through a bypass header
tune upstream and embedding timeout handling
protect the gateway from oversized requests
run cleaner Grafana-backed pilot demonstrations

This release is recommended for pilots that need more precise cache evaluation and operational control.

Gateway Control and Observability Polish​

Release Positioning​

What Is New​

Explicit Exact Cache Controls​

Separate Cache Store Controls​

Split Timeout Controls​

Request and Prompt Protection​

Cache Bypass Header​

Configuration Example​

Observability Improvements​

Validation and Runtime Checks​

Upgrade Notes​

Known Non-Goals for v0.2.1​

Summary​