Skip to content

LLM Bastion Caching Strategy - Technical Card

🗄️ Hierarchical Caching (L1 & L2)

The bastion uses a two-layer caching mechanism to maximize efficiency while ensuring 100% data integrity.

Layer 1: Exact Match (L1)

  • Goal: Instant response for identical queries.
  • Logic: A SHA-256 hash of the normalized request.
  • Normalization:
    • Keys are sorted alphabetically.
    • Metadata like user_id, request_id, and stream are removed.
    • Message content is trimmed and lowercased.
  • Performance: < 1ms latency.

Layer 2: Semantic Match (L2)

  • Goal: Cache hits for paraphrased queries (e.g. "Hi" vs "Hello").
  • Logic: Vector similarity search using Cosine Similarity.
  • Safety Threshold: > 0.98 similarity is required for a HIT.
  • Performance: ~10ms latency (excluding embedding).

🛡️ Multi-Tenant Security Mask

Your data is private by default. Each cache entry is "salted" with the tenant_id.

Tenant ATenant BResult
What is 1+1?What is 1+1?MISS (for Tenant B)

NOTE

Only explicitly marked "Public" queries can bypass the tenant salt to serve common knowledge answers.


⏳ Policy & Control

For details on TTL categories, cache bypass headers, and business rules, please refer to the semantic_caching.feature living documentation.

Response headers will include X-Bastion-Cache: HIT-L1, HIT-L2, or MISS.