AI Cost Firewall v0.2.0

Pilot-Ready OpenAI-Compatible LLM Gateway

AI Cost Firewall v0.2.0 is the first pilot-ready release of the gateway.

This release stabilizes the core OpenAI-compatible gateway behavior, deployment model, runtime checks, caching flow, and observability needed to evaluate AI Cost Firewall in real pilot environments.

The main goal of v0.2.0 is to define a clear and practical compatibility model:

OpenAI-compatible chat completions
OpenAI-compatible embeddings
Flat configuration
Exact and semantic caching
Redis-backed exact cache
Qdrant-backed semantic cache
Prometheus metrics
Grafana dashboards
Startup diagnostics
Health and readiness endpoints
Runtime version reporting

Release Positioning

v0.2.0 positions AI Cost Firewall as a practical gateway layer for OpenAI-compatible LLM traffic.

It can be placed between client applications and OpenAI-compatible model endpoints to reduce repeated upstream calls through exact and semantic caching while exposing cost, savings, cache, and runtime metrics.

Typical pilot use cases include:

evaluating repeated LLM request patterns
measuring exact and semantic cache hit rates
estimating gross and net savings
comparing local and cloud model endpoint behavior
testing OpenAI-compatible deployment options
validating operational readiness before broader adoption

Compatibility Model

AI Cost Firewall v0.2.0 uses a simple OpenAI-compatible configuration model.

Supported API style:

openai_compatible

Supported request endpoint:

/v1/chat/completions

Supported upstream model behavior:

OpenAI-compatible chat completion APIs

Supported embedding behavior:

OpenAI-compatible embedding APIs

This release intentionally keeps configuration simple. Provider-specific configuration blocks are not part of v0.2.0.

What Is Included

v0.2.0 includes the core runtime behavior expected from a pilot-ready AI gateway.

Gateway Runtime

OpenAI-compatible chat completion gateway
Forwarding to configurable upstream endpoints
Support for cloud, local, and self-hosted OpenAI-compatible providers
Request body size controls
Runtime startup diagnostics
Graceful shutdown handling
Hot reload support through SIGHUP

Caching

Exact cache backed by Redis
Semantic cache backed by Qdrant
Configurable semantic similarity threshold
Semantic cache fail-open behavior
Cache hit and miss metrics
Separate exact and semantic cache accounting

Cost and Savings Metrics

Per-model request counters
Per-model input and output token metrics
Request cost metrics
Gross savings metrics
Net savings metrics
Embedding overhead metrics
Exact-vs-semantic savings attribution

Observability

Prometheus /metrics endpoint
Grafana dashboard support
Health endpoint
Readiness endpoint
Runtime version endpoint
Startup dependency checks

New `/version` Endpoint

v0.2.0 introduces a runtime version endpoint:

GET /version

This endpoint reports release metadata and compatibility information.

Example response:

{
  "version": "0.2.0",
  "release_title": "Pilot-Ready OpenAI-Compatible LLM Gateway",
  "supported_api_style": "openai_compatible",
  "provider_specific_config_blocks": false,
  "compatibility_model": "OpenAI-compatible chat completions and embeddings through flat configuration",
  "native_provider_support": "No native provider-specific API integrations or provider-specific config blocks in v0.2.0"
}

This is useful for:

confirming the running container or binary version
checking runtime compatibility expectations
documenting pilot deployments
troubleshooting mismatches between docs, images, and deployed services

Example:

curl http://localhost:8080/version

Configuration Model

v0.2.0 continues to use flat configuration directives.

Example upstream configuration:

upstream_provider openai_compatible;
upstream_base_url https://api.openai.com;
upstream_api_key sk-your-api-key;

Example embedding configuration:

embedding_provider openai_compatible;
embedding_base_url https://api.openai.com;
embedding_api_key sk-your-api-key;
embedding_model text-embedding-3-small;
qdrant_vector_size 1536;

For local or self-hosted OpenAI-compatible endpoints, the base URLs can point to local services instead:

upstream_base_url http://ollama:11434;
embedding_base_url http://ollama:11434;

The important requirement is that the upstream and embedding services expose compatible API behavior.

Provider Scope

v0.2.0 supports OpenAI-compatible providers through the same flat configuration model.

Examples of OpenAI-compatible setups include:

OpenAI
Ollama with OpenAI-compatible endpoints
LM Studio
vLLM
LiteLLM
OpenRouter
other OpenAI-compatible gateways or proxies

Provider-specific configuration blocks are intentionally postponed until after v0.2.0.

This means v0.2.0 does not include separate native blocks such as:

openai { ... }
openrouter { ... }
ollama { ... }
vllm { ... }

Instead, all compatible providers are configured through the same OpenAI-compatible directive style.

Startup Diagnostics

v0.2.0 continues the startup diagnostics flow introduced in the pilot-polish phase.

At startup, AI Cost Firewall reports key runtime information such as:

loaded configuration file
upstream provider mode
upstream base URL
embedding provider mode
semantic cache status
Redis dependency status
Qdrant dependency status
upstream dependency status
server bind address

This helps operators identify configuration and dependency problems before sending traffic through the gateway.

Health and Readiness

v0.2.0 exposes separate health and readiness endpoints.

Health endpoint:

GET /healthz

Readiness endpoint:

GET /readyz

Use /healthz to check whether the process is alive.

Use /readyz to check whether the gateway is ready to serve traffic based on required runtime dependencies.

Example:

curl http://localhost:8080/healthz
curl http://localhost:8080/readyz

Validation and Runtime Checks

Configuration can be validated before startup:

./target/release/ai-firewall --config configs/ai-firewall.conf --test-config

Expected successful output:

configuration OK

The resolved configuration can also be printed with secrets masked:

./target/release/ai-firewall --config configs/ai-firewall.conf --print-config

This is useful before pilot deployments, CI checks, and container rollout.

Operational Notes

Redis

Redis is required for exact caching.

If Redis is unavailable, readiness checks may fail and exact cache behavior will not be available.

Qdrant

Qdrant is required when semantic cache is enabled.

AI Cost Firewall uses Qdrant for semantic cache lookup and storage. The configured qdrant_vector_size must match the embedding model dimension.

Embeddings

Semantic caching depends on the configured embedding provider.

If semantic_cache_fail_open is enabled, embedding or semantic cache failures can be skipped so requests continue upstream instead of failing the whole chat request.

Hot Reload

Configuration can be reloaded without restarting the process:

kill -HUP $(pgrep ai-firewall)

In Docker Compose:

docker compose kill -s HUP ai-firewall

Known Non-Goals for v0.2.0

The following items are intentionally not part of v0.2.0:

provider-specific configuration blocks
native provider-specific API integrations
non-OpenAI-compatible API styles
advanced multi-tenant policy controls
authentication and authorization policy layers
administrative UI
distributed cache coordination beyond Redis and Qdrant

These may be considered in later releases, but v0.2.0 focuses on a stable pilot-ready OpenAI-compatible gateway.

Upgrade Notes

When upgrading from v0.1.9 to v0.2.0:

Confirm that your upstream endpoint is OpenAI-compatible.
Confirm that your embedding endpoint is OpenAI-compatible.
Confirm that qdrant_vector_size matches the embedding model dimension.
Check /healthz, /readyz, and /version after startup.
Review Prometheus and Grafana metrics after sending test traffic.
Keep using the flat provider configuration model.

Recommended checks:

curl http://localhost:8080/healthz
curl http://localhost:8080/readyz
curl http://localhost:8080/version

Summary

AI Cost Firewall v0.2.0 marks the transition from early feature iteration to a pilot-ready OpenAI-compatible gateway.

It provides a stable baseline for evaluating:

LLM gateway deployment
exact cache savings
semantic cache savings
embedding overhead
net savings
provider compatibility
operational readiness

This release is intended for pilot deployments, demos, and controlled evaluations before broader production hardening.

Pilot-Ready OpenAI-Compatible LLM Gateway​

Release Positioning​

Compatibility Model​

What Is Included​

Gateway Runtime​

Caching​

Cost and Savings Metrics​

Observability​

New /version Endpoint​

Configuration Model​

Provider Scope​

Startup Diagnostics​

Health and Readiness​

Validation and Runtime Checks​

Operational Notes​

Redis​

Qdrant​

Embeddings​

Hot Reload​

Known Non-Goals for v0.2.0​

Upgrade Notes​

Summary​