AI Cost Firewall v0.2.0
Pilot-Ready OpenAI-Compatible LLM Gateway
AI Cost Firewall v0.2.0 is the first pilot-ready release of the gateway.
This release stabilizes the core OpenAI-compatible gateway behavior, deployment model, runtime checks, caching flow, and observability needed to evaluate AI Cost Firewall in real pilot environments.
The main goal of v0.2.0 is to define a clear and practical compatibility model:
- OpenAI-compatible chat completions
- OpenAI-compatible embeddings
- Flat configuration
- Exact and semantic caching
- Redis-backed exact cache
- Qdrant-backed semantic cache
- Prometheus metrics
- Grafana dashboards
- Startup diagnostics
- Health and readiness endpoints
- Runtime version reporting
Release Positioning
v0.2.0 positions AI Cost Firewall as a practical gateway layer for OpenAI-compatible LLM traffic.
It can be placed between client applications and OpenAI-compatible model endpoints to reduce repeated upstream calls through exact and semantic caching while exposing cost, savings, cache, and runtime metrics.
Typical pilot use cases include:
- evaluating repeated LLM request patterns
- measuring exact and semantic cache hit rates
- estimating gross and net savings
- comparing local and cloud model endpoint behavior
- testing OpenAI-compatible deployment options
- validating operational readiness before broader adoption
Compatibility Model
AI Cost Firewall v0.2.0 uses a simple OpenAI-compatible configuration model.
Supported API style:
openai_compatible
Supported request endpoint:
/v1/chat/completions
Supported upstream model behavior:
OpenAI-compatible chat completion APIs
Supported embedding behavior:
OpenAI-compatible embedding APIs
This release intentionally keeps configuration simple. Provider-specific configuration blocks are not part of v0.2.0.
What Is Included
v0.2.0 includes the core runtime behavior expected from a pilot-ready AI gateway.
Gateway Runtime
- OpenAI-compatible chat completion gateway
- Forwarding to configurable upstream endpoints
- Support for cloud, local, and self-hosted OpenAI-compatible providers
- Request body size controls
- Runtime startup diagnostics
- Graceful shutdown handling
- Hot reload support through
SIGHUP
Caching
- Exact cache backed by Redis
- Semantic cache backed by Qdrant
- Configurable semantic similarity threshold
- Semantic cache fail-open behavior
- Cache hit and miss metrics
- Separate exact and semantic cache accounting
Cost and Savings Metrics
- Per-model request counters
- Per-model input and output token metrics
- Request cost metrics
- Gross savings metrics
- Net savings metrics
- Embedding overhead metrics
- Exact-vs-semantic savings attribution
Observability
- Prometheus
/metricsendpoint - Grafana dashboard support
- Health endpoint
- Readiness endpoint
- Runtime version endpoint
- Startup dependency checks
New /version Endpoint
v0.2.0 introduces a runtime version endpoint:
GET /version
This endpoint reports release metadata and compatibility information.
Example response:
{
"version": "0.2.0",
"release_title": "Pilot-Ready OpenAI-Compatible LLM Gateway",
"supported_api_style": "openai_compatible",
"provider_specific_config_blocks": false,
"compatibility_model": "OpenAI-compatible chat completions and embeddings through flat configuration",
"native_provider_support": "No native provider-specific API integrations or provider-specific config blocks in v0.2.0"
}
This is useful for:
- confirming the running container or binary version
- checking runtime compatibility expectations
- documenting pilot deployments
- troubleshooting mismatches between docs, images, and deployed services
Example:
curl http://localhost:8080/version
Configuration Model
v0.2.0 continues to use flat configuration directives.
Example upstream configuration:
upstream_provider openai_compatible;
upstream_base_url https://api.openai.com;
upstream_api_key sk-your-api-key;
Example embedding configuration:
embedding_provider openai_compatible;
embedding_base_url https://api.openai.com;
embedding_api_key sk-your-api-key;
embedding_model text-embedding-3-small;
qdrant_vector_size 1536;
For local or self-hosted OpenAI-compatible endpoints, the base URLs can point to local services instead:
upstream_base_url http://ollama:11434;
embedding_base_url http://ollama:11434;
The important requirement is that the upstream and embedding services expose compatible API behavior.
Provider Scope
v0.2.0 supports OpenAI-compatible providers through the same flat configuration model.
Examples of OpenAI-compatible setups include:
- OpenAI
- Ollama with OpenAI-compatible endpoints
- LM Studio
- vLLM
- LiteLLM
- OpenRouter
- other OpenAI-compatible gateways or proxies
Provider-specific configuration blocks are intentionally postponed until after v0.2.0.
This means v0.2.0 does not include separate native blocks such as:
openai { ... }
openrouter { ... }
ollama { ... }
vllm { ... }
Instead, all compatible providers are configured through the same OpenAI-compatible directive style.
Startup Diagnostics
v0.2.0 continues the startup diagnostics flow introduced in the pilot-polish phase.
At startup, AI Cost Firewall reports key runtime information such as:
- loaded configuration file
- upstream provider mode
- upstream base URL
- embedding provider mode
- semantic cache status
- Redis dependency status
- Qdrant dependency status
- upstream dependency status
- server bind address
This helps operators identify configuration and dependency problems before sending traffic through the gateway.
Health and Readiness
v0.2.0 exposes separate health and readiness endpoints.
Health endpoint:
GET /healthz
Readiness endpoint:
GET /readyz
Use /healthz to check whether the process is alive.
Use /readyz to check whether the gateway is ready to serve traffic based on required runtime dependencies.
Example:
curl http://localhost:8080/healthz
curl http://localhost:8080/readyz
Validation and Runtime Checks
Configuration can be validated before startup:
./target/release/ai-firewall --config configs/ai-firewall.conf --test-config
Expected successful output:
configuration OK
The resolved configuration can also be printed with secrets masked:
./target/release/ai-firewall --config configs/ai-firewall.conf --print-config
This is useful before pilot deployments, CI checks, and container rollout.
Operational Notes
Redis
Redis is required for exact caching.
If Redis is unavailable, readiness checks may fail and exact cache behavior will not be available.
Qdrant
Qdrant is required when semantic cache is enabled.
AI Cost Firewall uses Qdrant for semantic cache lookup and storage. The configured qdrant_vector_size must match the embedding model dimension.
Embeddings
Semantic caching depends on the configured embedding provider.
If semantic_cache_fail_open is enabled, embedding or semantic cache failures can be skipped so requests continue upstream instead of failing the whole chat request.
Hot Reload
Configuration can be reloaded without restarting the process:
kill -HUP $(pgrep ai-firewall)
In Docker Compose:
docker compose kill -s HUP ai-firewall
Known Non-Goals for v0.2.0
The following items are intentionally not part of v0.2.0:
- provider-specific configuration blocks
- native provider-specific API integrations
- non-OpenAI-compatible API styles
- advanced multi-tenant policy controls
- authentication and authorization policy layers
- administrative UI
- distributed cache coordination beyond Redis and Qdrant
These may be considered in later releases, but v0.2.0 focuses on a stable pilot-ready OpenAI-compatible gateway.
Upgrade Notes
When upgrading from v0.1.9 to v0.2.0:
- Confirm that your upstream endpoint is OpenAI-compatible.
- Confirm that your embedding endpoint is OpenAI-compatible.
- Confirm that
qdrant_vector_sizematches the embedding model dimension. - Check
/healthz,/readyz, and/versionafter startup. - Review Prometheus and Grafana metrics after sending test traffic.
- Keep using the flat provider configuration model.
Recommended checks:
curl http://localhost:8080/healthz
curl http://localhost:8080/readyz
curl http://localhost:8080/version
Summary
AI Cost Firewall v0.2.0 marks the transition from early feature iteration to a pilot-ready OpenAI-compatible gateway.
It provides a stable baseline for evaluating:
- LLM gateway deployment
- exact cache savings
- semantic cache savings
- embedding overhead
- net savings
- provider compatibility
- operational readiness
This release is intended for pilot deployments, demos, and controlled evaluations before broader production hardening.