Skip to main content

FAQ

General

What is AI Cost Firewall?

AI Cost Firewall is an OpenAI-compatible API gateway for caching, cost control, and operational visibility.

It sits between your application and an upstream LLM provider. Applications send requests to AI Cost Firewall instead of calling the provider directly. The firewall checks whether a response can be reused from cache and forwards only necessary requests upstream.

What problem does AI Cost Firewall solve?

LLM applications often recompute the same or similar answers many times.

AI Cost Firewall helps reduce:

  • repeated upstream LLM calls
  • unnecessary token usage
  • latency for repeated questions
  • lack of visibility into cache behavior and estimated cost savings

Is AI Cost Firewall a replacement for an LLM provider?

No.

AI Cost Firewall is a gateway layer. It does not generate responses by itself. It forwards requests to an upstream OpenAI-compatible provider when a cached response cannot be used.

Is AI Cost Firewall OpenAI-specific?

AI Cost Firewall uses an OpenAI-compatible API shape.

The primary supported endpoint is:

/v1/chat/completions

The upstream provider must expose a compatible chat completions API.

Is AI Cost Firewall free to use?

Yes. AI Cost Firewall is intended to be used as a free, open-source gateway layer for LLM traffic control, caching, and observability.


Endpoints

Which API endpoint is supported?

AI Cost Firewall currently supports:

/v1/chat/completions

Does the client application need to change?

Usually, only the base URL needs to change.

Instead of calling the upstream provider directly, the application points to AI Cost Firewall:

Application -> AI Cost Firewall -> OpenAI-compatible upstream

The request format remains OpenAI-compatible.

Does AI Cost Firewall expose health endpoints?

Yes.

AI Cost Firewall exposes:

/healthz
/readyz
/metrics

/healthz is a liveness endpoint.

/readyz is a readiness endpoint.

/metrics exposes Prometheus metrics.

What is the difference between /healthz and /readyz?

/healthz is intended to show that the process is alive.

/readyz is intended to show whether the service is ready to handle traffic.

A service can be alive but not ready if dependencies are unavailable or startup initialization has not completed.


Dependencies

Do I need Redis and Qdrant?

Redis or Valkey is required for exact cache.

Qdrant is required only when semantic cache is enabled.

If semantic cache is disabled, AI Cost Firewall can run without Qdrant.

What is Redis used for?

Redis or Valkey is used for exact cache.

Exact cache stores and reuses responses for requests that normalize to the same cache key.

What is Qdrant used for?

Qdrant is used for semantic cache.

Semantic cache stores vector embeddings and response metadata so that similar requests can reuse previous responses when the similarity threshold is met.

Which Qdrant port should I use?

AI Cost Firewall uses Qdrant gRPC on port:

6334

The Qdrant HTTP port is usually:

6333

but AI Cost Firewall uses the gRPC endpoint for semantic cache operations.

Can another Qdrant container already be running on the same server?

Yes, as long as there is no Docker port conflict.

A separate Qdrant container can run on the same host if AI Cost Firewall is configured to connect to the correct container, hostname, network, and port.

Can I use Valkey instead of Redis?

Yes, Valkey can be used as a Redis-compatible exact cache backend.


Caching

What is exact cache?

Exact cache reuses a response when the normalized request matches a previously stored request.

This is the fastest and safest cache mode because it does not rely on semantic similarity.

What is semantic cache?

Semantic cache reuses a response when a new request is semantically similar to a previous request.

It uses embeddings and vector search to decide whether a cached response is close enough to reuse.

Can exact cache and semantic cache be enabled together?

Yes.

A typical request flow is:

Exact cache lookup
-> if hit, return cached response
-> if miss, optionally perform semantic lookup
-> if semantic hit, return cached response
-> if miss, forward upstream

What happens when semantic cache is disabled?

AI Cost Firewall skips embedding generation and Qdrant lookup.

Requests still use exact cache and upstream forwarding.

What happens when semantic cache is enabled but Qdrant is unavailable?

Startup behavior depends on configuration and dependency initialization.

At runtime, if semantic cache fail-open behavior is enabled, semantic cache failures can be skipped and the request can continue upstream.

What does semantic_similarity_threshold control?

It controls how similar a new request must be to a cached request before AI Cost Firewall reuses the cached response.

Higher values are stricter.

For example:

0.95 = stricter matching
0.90 = more permissive matching

A stricter threshold reduces the risk of reusing an unsuitable cached answer.

Can expired semantic cache entries still be returned?

No.

Expired semantic entries are skipped during lookup before final cache hit decisions are made.

Are expired semantic cache entries automatically deleted?

No.

Expired semantic entries are skipped during lookup, but cleanup is handled separately through pruning.

How do I prune expired semantic cache entries?

Use the semantic cache pruning command:

ai-firewall --config configs/ai-firewall.conf --prune-expired-semantic-cache

When using Docker Compose, run the command through the firewall image and the mounted config path.

Are streaming responses cached?

Streaming requests can be forwarded, but streaming responses are not stored in semantic cache.


Configuration

Where is AI Cost Firewall configured?

AI Cost Firewall uses an nginx-style configuration file.

A typical config file is:

configs/ai-firewall.conf

Can configuration be validated before startup?

Yes:

./target/release/ai-firewall --config configs/ai-firewall.conf --test-config

Expected output:

configuration OK

Does --test-config connect to Redis, Qdrant, or the upstream provider?

No.

--test-config performs static validation only. It checks syntax, required directives, ranges, and configuration consistency.

It does not initialize runtime dependencies.

Can I print the resolved configuration?

Yes:

./target/release/ai-firewall --config configs/ai-firewall.conf --print-config

Secrets are masked in the output.

Can configuration be reloaded without restart?

Yes.

AI Cost Firewall supports nginx-style reload with SIGHUP:

kill -HUP $(pgrep ai-firewall)

Docker Compose:

docker compose kill -s HUP firewall

Does reload interrupt in-flight requests?

Configuration reload is intended to apply updated configuration without a full process restart.

For major dependency or topology changes, a controlled restart may still be simpler and safer.

What is upstream_base_url?

upstream_base_url is the base URL of the OpenAI-compatible provider that AI Cost Firewall forwards cache misses to.

Example:

upstream_base_url https://api.openai.com;

or for a local provider:

upstream_base_url http://ollama:11434;

What is upstream_api_key?

upstream_api_key is the API key AI Cost Firewall uses when forwarding requests to the upstream provider.

This is different from any Authorization header sent by a client to AI Cost Firewall.

Does the client need to send a Bearer token to AI Cost Firewall?

Only if you have implemented or configured authentication in front of AI Cost Firewall.

If AI Cost Firewall itself is not enforcing client authentication, a manual test request may work without a client Bearer token.

The upstream API key in the config is used by AI Cost Firewall when calling the upstream provider.

What is model_price used for?

model_price defines the allowed model names and their input/output token prices.

It is used for cost estimation and model validation.

Example:

model_price gpt-4o-mini-2024-07-18 0.150 0.600;

Why are unknown models rejected?

By default, AI Cost Firewall uses model_price entries as a model allowlist.

This prevents accidental use of unpriced or unexpected models.

Can unknown models be passed through?

Yes, if enabled:

allow_unknown_models_pass_through true;

Use this carefully because cost estimation may be incomplete for unknown models.


Docker and deployment

Can AI Cost Firewall run with Docker Compose?

Yes.

Docker Compose is the recommended way to run AI Cost Firewall with Redis, Qdrant, Prometheus, and Grafana for local demos or simple deployments.

How do I validate configuration in Docker?

Use:

docker compose run --rm firewall --config /configs/ai-firewall.conf --test-config

The exact config path depends on how the config file is mounted in your Compose file.

Can I run AI Cost Firewall on the same server as Ollama?

Yes.

For example, if Ollama is reachable from the firewall container as:

http://ollama:11434

then configure the upstream base URL accordingly.

Is HTTP safe for upstream communication?

HTTP is unencrypted.

If AI Cost Firewall connects to an upstream provider over HTTP, any transmitted headers and request content are not encrypted at the transport layer.

For local Docker networks or demo environments this may be acceptable. For production or cross-host communication, HTTPS is strongly preferred.

Is HTTPS traffic to providers encrypted?

Yes.

When AI Cost Firewall connects to an HTTPS upstream such as:

https://api.openai.com

the request, including the upstream Authorization header, is encrypted in transit.

What happens if the upstream HTTPS certificate is invalid?

The request may fail and AI Cost Firewall may return an upstream error.

Certificate issues commonly occur with self-signed certificates or endpoints whose certificate does not match the hostname being used.


Observability

Does AI Cost Firewall expose Prometheus metrics?

Yes.

Metrics are available at:

/metrics

Does AI Cost Firewall include Grafana dashboards?

Yes.

The Docker Compose setup can include Grafana dashboards for overview and diagnostics.

What metrics are most important?

Commonly useful metrics include:

aif_requests_total
aif_cache_exact_hits
aif_cache_semantic_hits
aif_cache_misses
aif_tokens_saved
aif_cost_saved_micro_usd
aif_chat_cost_saved_micro_usd
aif_embedding_cost_micro_usd
aif_readiness_state
aif_shutdown_in_progress

What is the difference between gross and net cost savings?

Gross chat savings estimate the upstream chat completion cost avoided by cache hits.

Net savings subtract semantic embedding overhead from the estimated saved chat cost.

Why does cost saved show zero?

Common reasons include:

  • no cache hits occurred yet
  • the upstream model name does not match a configured model_price
  • the request uses an unknown model
  • token usage is missing or not returned by the upstream provider
  • semantic cache embedding cost offsets the saved chat cost

Why are semantic cache savings lower than expected?

Semantic cache requires embedding generation.

If the number of semantic misses is high, embedding overhead can reduce net savings.

Tuning the semantic threshold and improving cache warm-up can help.

Why are dashboards empty?

Common causes:

  • no traffic has been sent yet
  • Prometheus is not scraping AI Cost Firewall
  • Grafana is connected to the wrong Prometheus datasource
  • the selected dashboard time range does not include recent traffic
  • the firewall container is not exposing /metrics

Logs and troubleshooting

Does AI Cost Firewall save logs by default?

AI Cost Firewall writes logs to standard output and standard error.

The runtime environment decides where those logs are stored.

For Docker, logs can be viewed with:

docker compose logs firewall

Can I save logs to a file?

Yes.

For a binary run:

./target/release/ai-firewall --config configs/ai-firewall.conf > logs.txt 2>&1

For Docker Compose foreground output:

docker compose up > logs.txt 2>&1

For detached Docker Compose runs, use Docker logging:

docker compose logs firewall > logs.txt

Why does /healthz return OK while /readyz fails?

/healthz only means the process is alive.

/readyz means the service is ready to handle traffic.

If /readyz fails, check dependency connectivity, Redis, Qdrant, configuration, and startup logs.

Why does the server not listen on port 8080?

Common causes include:

  • startup failed before binding the listener
  • Redis or Qdrant initialization failed
  • config validation failed
  • the configured listen_addr uses another port
  • another process already uses the port
  • Docker port mapping is missing or incorrect

Why do I get 502 Bad Gateway?

A 502 usually means AI Cost Firewall could not successfully complete an upstream call.

Possible causes include:

  • upstream provider is unavailable
  • upstream URL is wrong
  • upstream API key is missing or invalid
  • HTTPS certificate validation failed
  • request timeout
  • incompatible OpenAI-compatible endpoint behavior

Why do I get model validation errors?

The requested model is probably not configured in model_price.

Add a matching model_price directive or enable unknown model pass-through if appropriate.

Why does semantic cache not hit?

Common causes include:

  • semantic cache is disabled
  • Qdrant is unavailable
  • embedding provider is unavailable
  • similarity threshold is too strict
  • requests are not semantically similar enough
  • semantic cache has not been warmed up
  • cached entries expired
  • vector size does not match the embedding model dimension

Why is Qdrant vector size important?

The Qdrant collection vector size must match the embedding model dimension.

For example, text-embedding-3-small commonly uses vector size:

1536

If the configured vector size does not match the embedding output dimension, semantic cache operations will fail.


Security

Does AI Cost Firewall inspect or moderate prompts?

AI Cost Firewall is primarily a gateway, cache, cost-control, and observability layer.

It is not a full prompt security, moderation, or data-loss-prevention system by itself.

Does AI Cost Firewall store prompts and responses?

Caching requires storing data needed to reuse responses.

Exact cache stores request/response data in Redis or Valkey.

Semantic cache stores embeddings and metadata in Qdrant.

Review your deployment, retention settings, and data handling requirements before enabling caching in sensitive environments.

Should AI Cost Firewall be exposed directly to the public internet?

Usually, no.

For production, place it behind appropriate network controls, authentication, TLS termination, and access policies.

Are API keys masked in printed configuration?

Yes.

When using:

./target/release/ai-firewall --config configs/ai-firewall.conf --print-config

secrets are masked in the output.


Releases

What is included in v0.1.6?

v0.1.6 focuses on configuration diagnostics and semantic cache hardening.

Highlights include:

  • --test-config for static configuration validation
  • --print-config with masked secrets
  • stricter startup dependency checks
  • clearer semantic provider diagnostics
  • semantic fail-open behavior
  • Qdrant vector size validation
  • OpenAI-compatible provider flexibility improvements

Should I deploy unreleased changes to production?

For production usage, prefer tagged releases.

Unreleased changes can be useful for testing, demos, or staging environments, but they should be validated carefully before production deployment.