llm-cost

Use this module when the gateway is becoming a shared internal platform or customer-facing product and you need trustworthy usage accounting for budgets, chargeback, billing, or finance reporting.

When to use this module

You need to compute per-request cost from prompt and completion token counts using configurable rate cards.
You need cached-input pricing for providers that discount prompt-cache reads or price cache creation separately.
You want cost events persisted to PostgreSQL for dashboarding, billing rollups, and audit trails.
You need full attribution (org, project, client, user, team, auth key fingerprint) in each cost record.
You need to distinguish accounting states: recorded, skipped_error, usage_missing, no_rate, and persist_failed.
You want translation-aware billing: records include translation_happened and auto-stamp translated traffic cohort.
You need requested-vs-effective routing attribution in cost records for policy auditability.
You want rate-card versioning so pricing changes are auditable over time.

nginx.conf synthesis

Basic cost accounting with log-level output for development.

# Rates are USD per million tokens: prompt/input first, completion/output second.
llm_cost_rate openai gpt-4o 5.00 15.00;
llm_cost_rate anthropic claude 3.00 15.00;
llm_cost_rate_unit anthropic usd;

location /v1 {
    llm_proxy;
    llm_proxy_route openai    openai_upstream;
    llm_proxy_route anthropic anthropic_upstream anthropic;
    llm_proxy_default_provider openai;

    llm_cost;
    llm_cost_backend log;

    access_log /var/log/nginx/llm-cost.log combined;

    proxy_pass https://$llm_provider_upstream;
}

Production configuration with PostgreSQL persistence, full attribution, and rate-card versioning.

# Standard input/output rates are per million tokens.
llm_cost_rate openai gpt-4o        5.00  15.00;
llm_cost_rate openai gpt-4         30.00 60.00;
llm_cost_rate openai gpt-3.5-turbo 0.50  1.50;
llm_cost_rate anthropic claude     3.00  15.00;

# Cached-input rates are also per million tokens.
# OpenAI reports cache reads; cache creation is not configured and falls back to prompt rate.
llm_cost_cached_rate openai gpt-4o 2.50;

# Anthropic reports cache reads and cache creation/write tokens.
llm_cost_cached_rate anthropic claude 0.30 3.75;

llm_cost_rate_unit openai    usd;
llm_cost_rate_unit anthropic usd;
llm_cost_backend postgres;
llm_cost_dsn "host=127.0.0.1 port=5432 dbname=darkanchor user=llm_cost password=changeme";
llm_cost_table public.llm_cost_events;
llm_cost_rate_card_version 2026-q2;

location /v1 {
    llm_proxy;
    llm_proxy_route openai    openai_upstream;
    llm_proxy_route anthropic anthropic_upstream anthropic;
    llm_proxy_default_provider openai;

    llm_auth;
    llm_auth_credential openai    env:OPENAI_KEY;
    llm_auth_tenant $http_x_tenant_id;
    llm_auth_project $http_x_project_id;
    llm_auth_org $http_x_org_id;
    llm_auth_fail_closed on;

    llm_cost;
    llm_cost_identity $llm_auth_key_fingerprint;
    llm_cost_org $llm_auth_org;
    llm_cost_project $llm_auth_project;
    llm_cost_client $llm_auth_client;
    llm_cost_user $http_x_user_id;
    llm_cost_team $http_x_team_id;
    llm_cost_auth_fingerprint $llm_auth_key_fingerprint;
    llm_cost_traffic_cohort $http_x_traffic_cohort;

    proxy_pass https://$llm_provider_upstream;
}

Directive reference

Core directives

Directive	Contexts	Default	Description
`llm_cost`	`location`	—	Enable cost accounting for this location.
`llm_cost_backend`	`http`, `server`	`log`	Persistence backend: `off`, `log`, or `postgres`.

Rate-card directives

Directive	Contexts	Default	Description
`llm_cost_rate`	`http`, `server`	—	Add a standard rate-card entry. Args: `<provider> <model-prefix> <prompt-rate-per-million> <completion-rate-per-million>`. The model prefix is matched against the effective model. Up to 32 entries. Rates must be finite and non-negative.
`llm_cost_cached_rate`	`http`, `server`	—	Add an optional cached-input rate-card entry. Args: `<provider> <model-prefix> <cache-read-rate-per-million> [<cache-create-rate-per-million>]`. When omitted, cached tokens are billed at the normal prompt rate. When only the read rate is supplied, cache-create tokens fall back to the normal prompt rate. Up to 32 entries. Rates must be finite and non-negative.
`llm_cost_rate_unit`	`http`, `server`	`usd`	Bind a provider to a pricing unit such as `usd`, `cny`, or `credits`.
`llm_cost_rate_card_version`	`http`, `server`	—	Version label stamped into every persisted accounting row. Use this to audit when pricing changes took effect.

PostgreSQL directives

Directive	Contexts	Default	Description
`llm_cost_dsn`	`http`, `server`	—	PostgreSQL connection string. Required when `llm_cost_backend postgres`.
`llm_cost_table`	`http`, `server`	`llm_cost_events`	Target table for INSERT. When unset, the module writes to `llm_cost_events`.

Attribution directives

Directive	Contexts	Default	Description
`llm_cost_identity`	`location`, `server`	—	nginx variable for the primary identity/tenant attribution.
`llm_cost_org`	`location`, `server`	—	nginx variable for org attribution.
`llm_cost_project`	`location`, `server`	—	nginx variable for project attribution.
`llm_cost_client`	`location`, `server`	—	nginx variable for client attribution.
`llm_cost_user`	`location`, `server`	—	nginx variable for user attribution.
`llm_cost_team`	`location`, `server`	—	nginx variable for team attribution.
`llm_cost_auth_fingerprint`	`location`, `server`	—	nginx variable carrying a non-secret auth key fingerprint.
`llm_cost_traffic_cohort`	`location`, `server`	—	Optional cohort label for rollout or canary attribution. Falls back to `translated` when translation happened and no explicit cohort is set, otherwise `fallback` when llm-fallback retried, otherwise `default`.

Exported variables

Variable	Description
`$llm_cost_status`	Accounting status: `recorded`, `skipped_error`, `usage_missing`, `no_rate`, or `persist_failed`.
`$llm_cost_prompt`	Computed prompt cost in the configured unit.
`$llm_cost_completion`	Computed completion cost in the configured unit.
`$llm_cost_total`	Computed total cost in the configured unit.
`$llm_cost_requested_provider`	Provider the client requested (before any policy replacement).
`$llm_cost_requested_model`	Model the client requested.
`$llm_cost_translation_happened`	`0`/`1` — whether the request was translated across dialects.
`$llm_cost_resolution_outcome`	How the request was resolved: `as_requested`, `replaced_by_policy`, etc.
`$llm_cost_org`	Org attribution from `llm_cost_org`.
`$llm_cost_project`	Project attribution.
`$llm_cost_client`	Client attribution.

Persisted PostgreSQL columns

When llm_cost_backend postgres is enabled, each request produces one INSERT with these fields:

Column	Description
`provider`	Effective provider that served the request.
`model`	Effective model.
`cost_unit`	Pricing unit for this provider.
`traffic_cohort`	Cohort label for traffic splitting.
`is_streaming`	Whether the request was streaming.
`prompt_tokens`	Total prompt/input token count. For Anthropic this includes `input_tokens + cache_read_input_tokens + cache_creation_input_tokens`; for OpenAI this is the provider-reported total prompt token count.
`completion_tokens`	Completion token count.
`total_tokens`	Total token count.
`prompt_cost`	Computed prompt/input cost. When cached-token rates match, this is the blended cost of regular input, cache reads, and cache creation.
`completion_cost`	Computed completion cost.
`total_cost`	Computed total cost.
`status`	Accounting status string.
`status_code`	HTTP status code returned to the client.
`duration_ms`	Request duration in milliseconds.
`identity`	Primary identity from `llm_cost_identity`.
`user_id`	User attribution.
`team_id`	Team attribution.
`auth_key_fingerprint`	Non-secret auth key fingerprint.
`rate_card_version`	Rate-card version at the time of the request.
`requested_provider`	Provider the client requested.
`requested_model`	Model the client requested.
`translation_happened`	Whether format translation occurred.
`resolution_outcome`	How the routing was resolved.
`org`	Org attribution.
`project`	Project attribution.
`client`	Client attribution.

Behavior notes

Cost is computed in the LOG phase after the response is complete. Usage authority comes only from llm-proxy’s extracted token counts.
Rate-card model matching uses prefix matching: a gpt-4o entry matches gpt-4o-mini unless a more specific entry exists.
All standard input, cached input, and output rates are per million tokens. The module does not support per-kilo pricing units.
Without a matching llm_cost_cached_rate, cost is calculated as: (prompt_tokens / 1,000,000) * prompt_rate + (completion_tokens / 1,000,000) * completion_rate.
With a matching llm_cost_cached_rate, prompt cost is blended as: (regular_input_tokens / 1,000,000) * prompt_rate + (cache_read_tokens / 1,000,000) * cache_read_rate + (cache_create_tokens / 1,000,000) * cache_create_or_prompt_rate.
regular_input_tokens is defensively clamped from prompt_tokens - cache_read_tokens - cache_create_tokens; malformed cache splits cannot make regular input negative.
Cached-token splits are internal pricing buckets from llm-proxy. They are not stored as extra PostgreSQL columns; persisted prompt_tokens remains total input and prompt_cost carries the blended result.
$llm_cost_status = no_rate rows are persisted to PostgreSQL with zero-cost columns so unpriced traffic is auditable.
$llm_cost_status = persist_failed is not re-persisted to avoid re-entry; usage_missing and skipped_error are not persisted by design.
Per-worker PostgreSQL connection caching: each worker maintains its own connection, reconnecting on drop.
PostgreSQL persistence uses parameterized PQexecParams — no SQL injection.
Cost is computed against effective_provider / effective_model, so fallback traffic is priced correctly against the provider that actually served the response.
Only one final usage record is emitted. If the first hop fails and nginx retries to a secondary, the first-hop partial spend is not recorded as a separate event.