llm-cost

Use this module when the gateway is becoming a shared internal platform or customer-facing product and you need trustworthy usage accounting for budgets, chargeback, billing, or finance reporting.

When to use this module

  • You need to compute per-request cost from prompt and completion token counts using configurable rate cards.
  • You need cached-input pricing for providers that discount prompt-cache reads or price cache creation separately.
  • You want cost events persisted to PostgreSQL for dashboarding, billing rollups, and audit trails.
  • You need full attribution (org, project, client, user, team, auth key fingerprint) in each cost record.
  • You need to distinguish accounting states: recorded, skipped_error, usage_missing, no_rate, and persist_failed.
  • You want translation-aware billing: records include translation_happened and auto-stamp translated traffic cohort.
  • You need requested-vs-effective routing attribution in cost records for policy auditability.
  • You want rate-card versioning so pricing changes are auditable over time.

nginx.conf synthesis

Basic cost accounting with log-level output for development.

# Rates are USD per million tokens: prompt/input first, completion/output second.
llm_cost_rate openai gpt-4o 5.00 15.00;
llm_cost_rate anthropic claude 3.00 15.00;
llm_cost_rate_unit anthropic usd;

location /v1 {
    llm_proxy;
    llm_proxy_route openai    openai_upstream;
    llm_proxy_route anthropic anthropic_upstream anthropic;
    llm_proxy_default_provider openai;

    llm_cost;
    llm_cost_backend log;

    access_log /var/log/nginx/llm-cost.log combined;

    proxy_pass https://$llm_provider_upstream;
}

Production configuration with PostgreSQL persistence, full attribution, and rate-card versioning.

# Standard input/output rates are per million tokens.
llm_cost_rate openai gpt-4o        5.00  15.00;
llm_cost_rate openai gpt-4         30.00 60.00;
llm_cost_rate openai gpt-3.5-turbo 0.50  1.50;
llm_cost_rate anthropic claude     3.00  15.00;

# Cached-input rates are also per million tokens.
# OpenAI reports cache reads; cache creation is not configured and falls back to prompt rate.
llm_cost_cached_rate openai gpt-4o 2.50;

# Anthropic reports cache reads and cache creation/write tokens.
llm_cost_cached_rate anthropic claude 0.30 3.75;

llm_cost_rate_unit openai    usd;
llm_cost_rate_unit anthropic usd;
llm_cost_backend postgres;
llm_cost_dsn "host=127.0.0.1 port=5432 dbname=darkanchor user=llm_cost password=changeme";
llm_cost_table public.llm_cost_events;
llm_cost_rate_card_version 2026-q2;

location /v1 {
    llm_proxy;
    llm_proxy_route openai    openai_upstream;
    llm_proxy_route anthropic anthropic_upstream anthropic;
    llm_proxy_default_provider openai;

    llm_auth;
    llm_auth_credential openai    env:OPENAI_KEY;
    llm_auth_tenant $http_x_tenant_id;
    llm_auth_project $http_x_project_id;
    llm_auth_org $http_x_org_id;
    llm_auth_fail_closed on;

    llm_cost;
    llm_cost_identity $llm_auth_key_fingerprint;
    llm_cost_org $llm_auth_org;
    llm_cost_project $llm_auth_project;
    llm_cost_client $llm_auth_client;
    llm_cost_user $http_x_user_id;
    llm_cost_team $http_x_team_id;
    llm_cost_auth_fingerprint $llm_auth_key_fingerprint;
    llm_cost_traffic_cohort $http_x_traffic_cohort;

    proxy_pass https://$llm_provider_upstream;
}

Directive reference

Core directives

DirectiveContextsDefaultDescription
llm_costlocationEnable cost accounting for this location.
llm_cost_backendhttp, serverlogPersistence backend: off, log, or postgres.

Rate-card directives

DirectiveContextsDefaultDescription
llm_cost_ratehttp, serverAdd a standard rate-card entry. Args: <provider> <model-prefix> <prompt-rate-per-million> <completion-rate-per-million>. The model prefix is matched against the effective model. Up to 32 entries. Rates must be finite and non-negative.
llm_cost_cached_ratehttp, serverAdd an optional cached-input rate-card entry. Args: <provider> <model-prefix> <cache-read-rate-per-million> [<cache-create-rate-per-million>]. When omitted, cached tokens are billed at the normal prompt rate. When only the read rate is supplied, cache-create tokens fall back to the normal prompt rate. Up to 32 entries. Rates must be finite and non-negative.
llm_cost_rate_unithttp, serverusdBind a provider to a pricing unit such as usd, cny, or credits.
llm_cost_rate_card_versionhttp, serverVersion label stamped into every persisted accounting row. Use this to audit when pricing changes took effect.

PostgreSQL directives

DirectiveContextsDefaultDescription
llm_cost_dsnhttp, serverPostgreSQL connection string. Required when llm_cost_backend postgres.
llm_cost_tablehttp, serverllm_cost_eventsTarget table for INSERT. When unset, the module writes to llm_cost_events.

Attribution directives

DirectiveContextsDefaultDescription
llm_cost_identitylocation, servernginx variable for the primary identity/tenant attribution.
llm_cost_orglocation, servernginx variable for org attribution.
llm_cost_projectlocation, servernginx variable for project attribution.
llm_cost_clientlocation, servernginx variable for client attribution.
llm_cost_userlocation, servernginx variable for user attribution.
llm_cost_teamlocation, servernginx variable for team attribution.
llm_cost_auth_fingerprintlocation, servernginx variable carrying a non-secret auth key fingerprint.
llm_cost_traffic_cohortlocation, serverOptional cohort label for rollout or canary attribution. Falls back to translated when translation happened and no explicit cohort is set, otherwise fallback when llm-fallback retried, otherwise default.

Exported variables

VariableDescription
$llm_cost_statusAccounting status: recorded, skipped_error, usage_missing, no_rate, or persist_failed.
$llm_cost_promptComputed prompt cost in the configured unit.
$llm_cost_completionComputed completion cost in the configured unit.
$llm_cost_totalComputed total cost in the configured unit.
$llm_cost_requested_providerProvider the client requested (before any policy replacement).
$llm_cost_requested_modelModel the client requested.
$llm_cost_translation_happened0/1 — whether the request was translated across dialects.
$llm_cost_resolution_outcomeHow the request was resolved: as_requested, replaced_by_policy, etc.
$llm_cost_orgOrg attribution from llm_cost_org.
$llm_cost_projectProject attribution.
$llm_cost_clientClient attribution.

Persisted PostgreSQL columns

When llm_cost_backend postgres is enabled, each request produces one INSERT with these fields:

ColumnDescription
providerEffective provider that served the request.
modelEffective model.
cost_unitPricing unit for this provider.
traffic_cohortCohort label for traffic splitting.
is_streamingWhether the request was streaming.
prompt_tokensTotal prompt/input token count. For Anthropic this includes input_tokens + cache_read_input_tokens + cache_creation_input_tokens; for OpenAI this is the provider-reported total prompt token count.
completion_tokensCompletion token count.
total_tokensTotal token count.
prompt_costComputed prompt/input cost. When cached-token rates match, this is the blended cost of regular input, cache reads, and cache creation.
completion_costComputed completion cost.
total_costComputed total cost.
statusAccounting status string.
status_codeHTTP status code returned to the client.
duration_msRequest duration in milliseconds.
identityPrimary identity from llm_cost_identity.
user_idUser attribution.
team_idTeam attribution.
auth_key_fingerprintNon-secret auth key fingerprint.
rate_card_versionRate-card version at the time of the request.
requested_providerProvider the client requested.
requested_modelModel the client requested.
translation_happenedWhether format translation occurred.
resolution_outcomeHow the routing was resolved.
orgOrg attribution.
projectProject attribution.
clientClient attribution.

Behavior notes

  • Cost is computed in the LOG phase after the response is complete. Usage authority comes only from llm-proxy’s extracted token counts.
  • Rate-card model matching uses prefix matching: a gpt-4o entry matches gpt-4o-mini unless a more specific entry exists.
  • All standard input, cached input, and output rates are per million tokens. The module does not support per-kilo pricing units.
  • Without a matching llm_cost_cached_rate, cost is calculated as: (prompt_tokens / 1,000,000) * prompt_rate + (completion_tokens / 1,000,000) * completion_rate.
  • With a matching llm_cost_cached_rate, prompt cost is blended as: (regular_input_tokens / 1,000,000) * prompt_rate + (cache_read_tokens / 1,000,000) * cache_read_rate + (cache_create_tokens / 1,000,000) * cache_create_or_prompt_rate.
  • regular_input_tokens is defensively clamped from prompt_tokens - cache_read_tokens - cache_create_tokens; malformed cache splits cannot make regular input negative.
  • Cached-token splits are internal pricing buckets from llm-proxy. They are not stored as extra PostgreSQL columns; persisted prompt_tokens remains total input and prompt_cost carries the blended result.
  • $llm_cost_status = no_rate rows are persisted to PostgreSQL with zero-cost columns so unpriced traffic is auditable.
  • $llm_cost_status = persist_failed is not re-persisted to avoid re-entry; usage_missing and skipped_error are not persisted by design.
  • Per-worker PostgreSQL connection caching: each worker maintains its own connection, reconnecting on drop.
  • PostgreSQL persistence uses parameterized PQexecParams — no SQL injection.
  • Cost is computed against effective_provider / effective_model, so fallback traffic is priced correctly against the provider that actually served the response.
  • Only one final usage record is emitted. If the first hop fails and nginx retries to a secondary, the first-hop partial spend is not recorded as a separate event.