QuotaKit SDK

Usage tracking and quota enforcement for AI API calls. Wrap any provider call with quotakit.track() and QuotaKit handles cost attribution, quota checks, and async log ingestion — all without proxying your traffic.

Self-reported usage

QuotaKit relies on the usage you report through the SDK or API. For accurate analytics and dependable enforcement, make sure your token counts and success/charge signals are correct.

Quickstart

Install the SDK, initialize with your API key, and wrap your first provider call.

bash
pip install quotakit
python
import quotakit
import openai

quotakit.init(api_key="aisc_...")
client = openai.OpenAI(api_key="sk-openai")

with quotakit.track("app/prod", service="openai", model="gpt-4o") as t:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": "Summarize this article"}],
    )
    t.result(
        input_tokens=response.usage.prompt_tokens,
        output_tokens=response.usage.completion_tokens,
    )

That's it for the happy path. The call is quota-checked before it executes, and usage is logged asynchronously once .result() is called.

SDK Flow

Understanding the lifecycle of a tracked call helps when handling edge cases.

  1. init() sets your API key and starts a background sync thread that keeps node-state (quota policies and current spend) cached locally. The sync adapts to your quota proximity — syncing every 2 minutes for open-mode paths, down to every 10 seconds as you approach a block or strict limit.
  2. track(path, service, model) opens a context manager. On enter, the SDK checks the local cache for any block-mode quota that would be exceeded — no network call needed.
  3. If a quota would be exceeded in block mode, QuotaExceeded is raised before the provider call ever happens. The prevented attempt is reported as a quota event.
  4. If allowed, your provider call runs inside the with block. You then call t.result() to report token usage and outcome.
  5. On context exit, the entry is queued and batched to /v1/log/batch asynchronously so your main thread is never blocked. The SDK adapts to your throughput: during high-volume periods it fills batches to ~2,000 entries before sending; during quiet periods it flushes within seconds of the last entry arriving.

SDK Signatures

Full surface of the SDK.

python
quotakit.init(api_key: str)

# Tracking
quotakit.track(path: str, service: str, model: str) -> context manager

# Service configuration
quotakit.create_service(...)
quotakit.update_service(...)
quotakit.list_services()
quotakit.delete_service(service, model)

# Quota management
quotakit.create_quota(...)
quotakit.update_quota(...)
quotakit.list_quotas(...)
quotakit.delete_quota(...)

Service and quota management methods are thin wrappers over the REST endpoints documented in the API Reference below.

.result() Patterns

t.result(input_tokens, output_tokens, success, charged) normalizes usage across any provider. The shape of the response object varies — extract tokens however the provider exposes them.

python
# Standard OpenAI / Anthropic shape
t.result(
    input_tokens=response.usage.prompt_tokens,
    output_tokens=response.usage.completion_tokens,
)

# Custom / non-standard provider
with quotakit.track("app/scraper", service="scraper_api", model="standard") as t:
    resp = scraper.fetch(...)
    usage = resp["meta"]["usage"]
    t.result(
        input_tokens=int(usage["in"]),
        output_tokens=int(usage["out"]),
    )

If tokens are omitted, QuotaKit falls back to the per-request estimate configured for that service+model. Cost is computed using the pricing in /api/sdk/services. See status semantics for how success/failure maps to cost handling.

Failed Charged vs Uncharged

Not all failures are the same. Some provider errors still consume tokens and incur cost; others (timeouts, connection drops) don't. Use the charged flag on .result() to explicitly control whether a failed call is billed. By default, charged mirrors success — so a failed call is uncharged unless you say otherwise.

python
# Failed + uncharged (default  charged mirrors success)
try:
    with quotakit.track("app/assistant", service="openai", model="gpt-4o") as t:
        response = provider_call()
        t.result(success=False)   # charged=False by default
except SomeError:
    pass

# No .result() at all also logs as failed + uncharged
try:
    with quotakit.track("app/assistant", service="openai", model="gpt-4o"):
        provider_timeout_call()   # raises before .result() is called
except TimeoutError:
    pass

# Failed + charged  provider processed the request and billed you anyway
with quotakit.track("app/assistant", service="openai", model="gpt-4o") as t:
    response = provider_call()   # e.g. returned 403 but still billed
    t.result(
        success=False,
        charged=True,
        input_tokens=response.usage.prompt_tokens,
        output_tokens=response.usage.completion_tokens,
    )

Pass charged=True when the provider billed you despite the failure (e.g. a 403 that still consumed tokens). In analytics, failed+charged entries count against quota spend; failed+uncharged entries do not. The corresponding ingest payload rules are documented at /v1/log/batch.

Quota Modes (Open / Block / Strict)

Quotas attach to hierarchy nodes and cascade to children. Every policy has a mode that determines how enforcement behaves. The default is open so production traffic is never blocked unless you opt in.

  • Open (default) - always allows the call. Overages are logged and flagged, but no exception is thrown.
  • Block - predicts locally and raises QuotaExceeded before the provider call.
  • Strict - reservation-based enforcement across pods. Blocks if a reservation cannot be acquired.
python
from quotakit import QuotaExceeded

try:
    with quotakit.track("app/team/feature", service="openai", model="gpt-4o") as t:
        response = client.chat.completions.create(...)
        t.result(input_tokens=..., output_tokens=...)
except QuotaExceeded as e:
    print(e.path)           # "app/team/feature"
    print(e.service)        # "openai"
    print(e.current_spend)  # current period spend in USD
    print(e.limit)          # configured limit

An ancestor node policy (e.g. on app) can block calls at any child path (e.g. app/team/feature). Manage policies via the Quota Editor (Projects tab) or /api/sdk/quotas. Node state is synced via /api/sdk/node-stateand updated opportunistically in /v1/log/batch responses.

Enforcement logic (simplified)

text
if mode == "open":
    allow()
elif mode == "block":
    projected = current_spend + pending_requests * avg_cost + estimated_cost
    if projected > limit:
        raise QuotaExceeded
    allow()
elif mode == "strict":
    if not reservation.acquire(estimated_cost):
        raise QuotaExceeded
    allow()

Block is enforced by the SDK before the provider call: the SDK checks local quota state (refreshed via background sync and flush-response updates) and raises QuotaExceeded if the estimated cost would exceed any block-mode limit. The ingest server records usage but does not reject entries that exceed block-mode quotas — enforcement is client-side.

However, block mode has no cross-server coordination. Two batches arriving simultaneously on different servers will each see the same pre-insert snapshot and may both approve entries that together exceed the limit. The potential overshoot is bounded by the total cost of all concurrent batches running at the moment the limit is crossed.

Strict uses reservations so every server shares one authoritative spend ledger. Availability is computed as:

text
available = limit - logged_spend - sum(all_active_reservations)

Reservations are sized to recent burn rate and renewed automatically. If a server crashes, its reservation expires within a few minutes and the quota is freed. Strict mode adds a network round-trip before the provider call and fails closed if the reservation service is unreachable.

Tradeoffs by mode

  • Open: zero latency, maximum availability, highest risk of overage (no blocking).
  • Block: zero extra round-trips, cannot overshoot within a single batch, small race window if multiple servers submit batches simultaneously.
  • Strict: strongest correctness across multiple servers, additional latency per call, may block if the reservation service is unavailable.

Worst-case overage estimate (block — concurrent servers only)

text
overage_usd <= C_max * (in_flight_total + R_total * T_sync)

T_sync is adaptive: 10120 s based on quota proximity.
At  90% usage, T_sync drops to 10 s.
  • C_max: max cost per request (USD or credits).
  • in_flight_total: concurrent requests already in flight across servers.
  • R_total: aggregate request rate (req/sec) across servers.
  • T_sync: SDK sync interval (10–120 s, adaptive based on quota proximity).

Practical guidance

  • Use Open for experiments and low-risk paths. It is the default.
  • Use Block for most production paths. Single-server deployments get hard enforcement; multi-server deployments get a small concurrent-batch window.
  • Use Strict for hard caps where any overshoot is unacceptable — especially high-concurrency multi-server deployments.
  • Set quotas 5-10% below your real hard limit when using open or block modes with multiple servers.
  • Strict mode limit: each account may hold at most 100 open reservations at one time. If this cap is reached, track() raises QuotaExceeded with reason="reservation_limit_reached". Reservations are released automatically when POST /api/sdk/release-batch is called or when they expire.

Where to set it: Dashboard → Projects → select a node → Quota Editor → Mode (Open, Block, Strict). In the API, mode can beopen, block, or strict.

Quota Events

When a call is prevented, QuotaKit records a quota event. These events appear in your analytics dashboard and count as prevented attempts without mixing them into request logs or spend totals.

json
{
  "event_id": "uuid",
  "path": "app/team/feature",
  "service": "openai",
  "model": "gpt-4o",
  "enforcement_mode": "block",
  "limit_type": "usd",
  "reason": "monthly spend limit exceeded"
}

Async + Streaming

v1 does not auto-wrap async or streaming clients. Instrument them manually — the pattern is the same: open track(), run your call, call .result() when you have final usage.

python
# Async
import asyncio, quotakit
from openai import AsyncOpenAI

quotakit.init(api_key="aisc_...")
client = AsyncOpenAI()

async def run():
    with quotakit.track("app/async", service="openai", model="gpt-4o") as t:
        resp = await client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": "Hello"}],
        )
        t.result(
            input_tokens=resp.usage.prompt_tokens,
            output_tokens=resp.usage.completion_tokens,
        )

asyncio.run(run())
python
# Streaming  finalize once the stream ends
with quotakit.track("app/stream", service="openai", model="gpt-4o") as t:
    stream = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": "Stream this"}],
        stream=True,
    )

    final_usage = None
    for chunk in stream:
        if getattr(chunk, "usage", None):
            final_usage = chunk.usage

    if final_usage is not None:
        t.result(
            input_tokens=final_usage.prompt_tokens,
            output_tokens=final_usage.completion_tokens,
        )
API Reference

Authentication

All SDK and ingest endpoints authenticate with your QuotaKit API key as a Bearer token.

bash
curl https://ingest.quotakit.io/api/sdk/services \
  -H "Authorization: Bearer aisc_..."

Server-side only — never call this API from a browser

QuotaKit API keys must be kept secret. Embedding a key in browser JavaScript exposes it to anyone who opens devtools — they could log arbitrary usage against your account or read your quota configuration. The ingest API has no CORS support and is not designed for browser clients. Call it only from your backend: a Node.js server, a Python service, or a serverless function.

Monthly spend cap

Paid plans with metered overage support a customer-set dollar ceiling. Set it from the Billing tab. When your projected ingest overage cost reaches the cap, new SDK calls receive a 402 spend_cap_reached response and batches are dropped until the next billing period (or until you raise/remove the cap). Rejections are served from a Redis fast-path so hitting the cap costs near-zero per request.

Rate limits

All limits are per API key, per 1-minute sliding window. Requests that exceed the limit receive a 429 rate_limit_exceeded response. The counter resets automatically — no backoff is required beyond a brief retry.

EndpointLimitNotes
GET /api/sdk/services120 req / minService pricing reads. Low traffic — called once at SDK init.
GET /api/sdk/quotas240 req / minQuota policy reads.
GET /api/sdk/node-state240 req / minSDK background sync. Adaptive: syncs every 2 min for open-mode paths, down to 10 s as paths approach block/strict limits.
POST /api/sdk/reserve-batch1200 req / minStrict-mode reservation acquire/renew. One request per active (path, service, model) per TTL window.
POST /api/sdk/release-batch1200 req / minStrict-mode reservation release. Fire-and-forget; called on quota exhaustion or process exit.
/v1/log/batch300 req / minUsage log ingest. The SDK batches entries adaptively — filling to ~2,000 entries per call during high throughput, or flushing sooner when traffic is light.

/api/sdk/services

bash
GET    /api/sdk/services
POST   /api/sdk/services
PUT    /api/sdk/services DELETE /api/sdk/services?service=<name>&model=<name>

Defines how QuotaKit prices a service+model combination. Used to compute USD cost from token counts.

json
{
  "service": "scraper_api",
  "model": "standard",
  "currency_type": "credits",
  "price_per_request": 3,
  "price_per_input_unit": 0,
  "input_unit_size": 1000000,
  "price_per_output_unit": 0,
  "output_unit_size": 1000000
}
  • 409 on POST with a duplicate service+model.
  • 404 on PUT/DELETE when the service+model doesn't exist.
  • 400 for invalid currency_type or non-numeric price fields.

/api/sdk/quotas

bash
GET    /api/sdk/quotas?node_path=app&service=openai&model=gpt-4o
POST   /api/sdk/quotas
PUT    /api/sdk/quotas DELETE /api/sdk/quotas?node_path=app&service=openai&model=gpt-4o
json
{
  "node_path": "app/api",
  "service": "openai",
  "model": "gpt-4o",
  "limit_dollars": 100,
  "window_type": "monthly",
  "mode": "strict"
}
  • mode: open (default), block, or strict.
  • limit_credits requires service and model.
  • 409 on POST duplicate scope; 404 on PUT/DELETE missing scope.

Strict mode uses reservations for cross-pod coordination and may allow small overshoot within a reservation window.

/api/sdk/node-state

bash
GET /api/sdk/node-state?path=app/api

Used by the SDK's background sync thread. Returns the effective policy state and current spend totals for a node path, which the SDK caches locally for fast quota checks without a per-call network round-trip.

/v1/log/batch

Ingest endpoint for batched usage logs. The SDK calls this asynchronously — you don't call it directly in normal usage, but the schema is useful when building integrations or debugging.

json
{
  "entries": [
    {
      "path": "app/api",
      "service": "openai",
      "model": "gpt-4o",
      "input_tokens": 100,
      "output_tokens": 60,
      "usd": 0.001,
      "status": "success",
      "request_id": "uuid"
    },
    {
      "path": "app/api",
      "service": "openai",
      "model": "gpt-4o",
      "status": "failed",
      "usd": 0.0007,
      "request_id": "uuid2"
    }
  ]
}
  • entries must be non-empty.
  • Each entry requires path and service.
  • Paths are validated for format and max depth.
  • A failed entry is charged if usd or credits is non-null and positive; uncharged otherwise. The SDK sets this automatically based on the charged flag.
  • quota_state — array of path + policy objects with approximate current_spend, returned for up to 5 paths in the batch. Values reflect a pre-insert snapshot and may be off by one batch's worth of spend. The SDK uses this to update its local quota state without a separate sync call; for an authoritative figure use the dashboard or a fresh node-state call.

Status semantics

StatusMeaningCost handling
successProvider call completed successfully.Cost computed from tokens or per-request estimate.
failedCall failed after starting (transport, provider, or app error).Uncharged by default. Pass charged=True to .result() to log cost.
Prevented calls (blocked by quota enforcement before the provider call was made) are not recorded as request logs. They appear as quota events instead, with zero charge, so they never inflate your spend totals.

Error matrix

StatusMeaning
401Missing or invalid Authorization header
402quota_exceeded — strict-mode quota denied
404service or quota not found (update / delete)
409duplicate service or quota scope (create)
429rate_limit_exceeded (API key rate limit) or ingest_api_call_limit_reached (plan call limit)
500internal server error