Skip to main content

AI Cost Firewall v0.2.0

Pilot-Ready OpenAI-Compatible LLM Gateway

AI Cost Firewall v0.2.0 is the first pilot-ready release of the gateway.

This release stabilizes the core OpenAI-compatible gateway behavior, deployment model, runtime checks, caching flow, and observability needed to evaluate AI Cost Firewall in real pilot environments.

The main goal of v0.2.0 is to define a clear and practical compatibility model:

  • OpenAI-compatible chat completions
  • OpenAI-compatible embeddings
  • Flat configuration
  • Exact and semantic caching
  • Redis-backed exact cache
  • Qdrant-backed semantic cache
  • Prometheus metrics
  • Grafana dashboards
  • Startup diagnostics
  • Health and readiness endpoints
  • Runtime version reporting

Release Positioning

v0.2.0 positions AI Cost Firewall as a practical gateway layer for OpenAI-compatible LLM traffic.

It can be placed between client applications and OpenAI-compatible model endpoints to reduce repeated upstream calls through exact and semantic caching while exposing cost, savings, cache, and runtime metrics.

Typical pilot use cases include:

  • evaluating repeated LLM request patterns
  • measuring exact and semantic cache hit rates
  • estimating gross and net savings
  • comparing local and cloud model endpoint behavior
  • testing OpenAI-compatible deployment options
  • validating operational readiness before broader adoption

Compatibility Model

AI Cost Firewall v0.2.0 uses a simple OpenAI-compatible configuration model.

Supported API style:

openai_compatible

Supported request endpoint:

/v1/chat/completions

Supported upstream model behavior:

OpenAI-compatible chat completion APIs

Supported embedding behavior:

OpenAI-compatible embedding APIs

This release intentionally keeps configuration simple. Provider-specific configuration blocks are not part of v0.2.0.


What Is Included

v0.2.0 includes the core runtime behavior expected from a pilot-ready AI gateway.

Gateway Runtime

  • OpenAI-compatible chat completion gateway
  • Forwarding to configurable upstream endpoints
  • Support for cloud, local, and self-hosted OpenAI-compatible providers
  • Request body size controls
  • Runtime startup diagnostics
  • Graceful shutdown handling
  • Hot reload support through SIGHUP

Caching

  • Exact cache backed by Redis
  • Semantic cache backed by Qdrant
  • Configurable semantic similarity threshold
  • Semantic cache fail-open behavior
  • Cache hit and miss metrics
  • Separate exact and semantic cache accounting

Cost and Savings Metrics

  • Per-model request counters
  • Per-model input and output token metrics
  • Request cost metrics
  • Gross savings metrics
  • Net savings metrics
  • Embedding overhead metrics
  • Exact-vs-semantic savings attribution

Observability

  • Prometheus /metrics endpoint
  • Grafana dashboard support
  • Health endpoint
  • Readiness endpoint
  • Runtime version endpoint
  • Startup dependency checks

New /version Endpoint

v0.2.0 introduces a runtime version endpoint:

GET /version

This endpoint reports release metadata and compatibility information.

Example response:

{
"version": "0.2.0",
"release_title": "Pilot-Ready OpenAI-Compatible LLM Gateway",
"supported_api_style": "openai_compatible",
"provider_specific_config_blocks": false,
"compatibility_model": "OpenAI-compatible chat completions and embeddings through flat configuration",
"native_provider_support": "No native provider-specific API integrations or provider-specific config blocks in v0.2.0"
}

This is useful for:

  • confirming the running container or binary version
  • checking runtime compatibility expectations
  • documenting pilot deployments
  • troubleshooting mismatches between docs, images, and deployed services

Example:

curl http://localhost:8080/version

Configuration Model

v0.2.0 continues to use flat configuration directives.

Example upstream configuration:

upstream_provider openai_compatible;
upstream_base_url https://api.openai.com;
upstream_api_key sk-your-api-key;

Example embedding configuration:

embedding_provider openai_compatible;
embedding_base_url https://api.openai.com;
embedding_api_key sk-your-api-key;
embedding_model text-embedding-3-small;
qdrant_vector_size 1536;

For local or self-hosted OpenAI-compatible endpoints, the base URLs can point to local services instead:

upstream_base_url http://ollama:11434;
embedding_base_url http://ollama:11434;

The important requirement is that the upstream and embedding services expose compatible API behavior.


Provider Scope

v0.2.0 supports OpenAI-compatible providers through the same flat configuration model.

Examples of OpenAI-compatible setups include:

  • OpenAI
  • Ollama with OpenAI-compatible endpoints
  • LM Studio
  • vLLM
  • LiteLLM
  • OpenRouter
  • other OpenAI-compatible gateways or proxies

Provider-specific configuration blocks are intentionally postponed until after v0.2.0.

This means v0.2.0 does not include separate native blocks such as:

openai { ... }
openrouter { ... }
ollama { ... }
vllm { ... }

Instead, all compatible providers are configured through the same OpenAI-compatible directive style.


Startup Diagnostics

v0.2.0 continues the startup diagnostics flow introduced in the pilot-polish phase.

At startup, AI Cost Firewall reports key runtime information such as:

  • loaded configuration file
  • upstream provider mode
  • upstream base URL
  • embedding provider mode
  • semantic cache status
  • Redis dependency status
  • Qdrant dependency status
  • upstream dependency status
  • server bind address

This helps operators identify configuration and dependency problems before sending traffic through the gateway.


Health and Readiness

v0.2.0 exposes separate health and readiness endpoints.

Health endpoint:

GET /healthz

Readiness endpoint:

GET /readyz

Use /healthz to check whether the process is alive.

Use /readyz to check whether the gateway is ready to serve traffic based on required runtime dependencies.

Example:

curl http://localhost:8080/healthz
curl http://localhost:8080/readyz

Validation and Runtime Checks

Configuration can be validated before startup:

./target/release/ai-firewall --config configs/ai-firewall.conf --test-config

Expected successful output:

configuration OK

The resolved configuration can also be printed with secrets masked:

./target/release/ai-firewall --config configs/ai-firewall.conf --print-config

This is useful before pilot deployments, CI checks, and container rollout.


Operational Notes

Redis

Redis is required for exact caching.

If Redis is unavailable, readiness checks may fail and exact cache behavior will not be available.

Qdrant

Qdrant is required when semantic cache is enabled.

AI Cost Firewall uses Qdrant for semantic cache lookup and storage. The configured qdrant_vector_size must match the embedding model dimension.

Embeddings

Semantic caching depends on the configured embedding provider.

If semantic_cache_fail_open is enabled, embedding or semantic cache failures can be skipped so requests continue upstream instead of failing the whole chat request.

Hot Reload

Configuration can be reloaded without restarting the process:

kill -HUP $(pgrep ai-firewall)

In Docker Compose:

docker compose kill -s HUP ai-firewall

Known Non-Goals for v0.2.0

The following items are intentionally not part of v0.2.0:

  • provider-specific configuration blocks
  • native provider-specific API integrations
  • non-OpenAI-compatible API styles
  • advanced multi-tenant policy controls
  • authentication and authorization policy layers
  • administrative UI
  • distributed cache coordination beyond Redis and Qdrant

These may be considered in later releases, but v0.2.0 focuses on a stable pilot-ready OpenAI-compatible gateway.


Upgrade Notes

When upgrading from v0.1.9 to v0.2.0:

  1. Confirm that your upstream endpoint is OpenAI-compatible.
  2. Confirm that your embedding endpoint is OpenAI-compatible.
  3. Confirm that qdrant_vector_size matches the embedding model dimension.
  4. Check /healthz, /readyz, and /version after startup.
  5. Review Prometheus and Grafana metrics after sending test traffic.
  6. Keep using the flat provider configuration model.

Recommended checks:

curl http://localhost:8080/healthz
curl http://localhost:8080/readyz
curl http://localhost:8080/version

Summary

AI Cost Firewall v0.2.0 marks the transition from early feature iteration to a pilot-ready OpenAI-compatible gateway.

It provides a stable baseline for evaluating:

  • LLM gateway deployment
  • exact cache savings
  • semantic cache savings
  • embedding overhead
  • net savings
  • provider compatibility
  • operational readiness

This release is intended for pilot deployments, demos, and controlled evaluations before broader production hardening.