iTechNotes

I Scribble My Tech Thoughts Here.

How We Approached Context Caching in Our GenAI Platform

In any GenAI platform, maintaining context continuity across user sessions is a key challenge. Without memory, the AI feels disconnected — it forgets who the user is, what was said earlier, and the overall conversation purpose. That’s where context caching becomes essential.

At our platform, we designed a flexible Context Caching Layer that keeps conversations intelligent, lightweight, and cost-efficient — without overloading the model with long histories.


What Is Context Caching

When users interact with a conversational or task-based GenAI system, the platform must remember key details such as:

  • Who the user is
  • What topic they are discussing
  • What entities or preferences were mentioned earlier
  • What system instructions or roles were set

Instead of sending the entire conversation to the LLM on each request, we cache key summaries, extracted entities, and structured context. This approach improves response time, reduces cost, and ensures continuity in user experience.


Our Architecture Overview

A simplified architecture of the context caching system looks like this:

The Context Manager acts as the central component that manages:

  • System prompt caching
  • Conversation summarisation
  • Entity extraction and storage
  • Context assembly before each model call

Redis is used as the fast-access store for all cached information.


Step 1: Structured System Prompting

Every conversation starts with a system prompt — a structured instruction block that defines the AI’s behaviour, tone, and objective.

We designed this prompt in a JSON or YAML format so it can be versioned and cached.

Example:

system_prompt:
  role: "Conversational AI Assistant"
  tone: "Professional yet friendly"
  goal: "Help users complete GenAI tasks efficiently"
  memory_instructions: "Use cached entities from Redis"

This structured approach allows dynamic updates and better control across environments and user types.


Step 2: System Prompt Configuration and Caching

System prompts are stored and retrieved from Redis using a consistent key structure:

redis.set(f"system_prompt:{workspace_id}", json.dumps(prompt_data))

When a user message arrives, the system retrieves the cached prompt:

prompt = redis.get(f"system_prompt:{workspace_id}")

This ensures:

  • Quick retrieval and lower database load
  • Version control across environments
  • Personalisation for users or teams

Step 3: Capturing Conversation Summary

To avoid passing the full conversation each time, the system generates summaries using a lightweight summarisation model.

For example:

“User is building a GenAI chatbot for HR onboarding. They asked about caching and memory persistence.”

The summary is then cached for the session:

redis.set(f"summary:{session_id}", summary_text)

This summary becomes part of the context in subsequent turns, ensuring the model retains continuity without large token usage.


Step 4: Capturing User Entities and Context

We extract important entities and facts mentioned by the user, such as company names, projects, frameworks, or objectives.

Example:

Entity TypeExample
Full NameAbhishek Dwivedi
Mobile No9999999999
Current CityNew Delhi

Entity extraction is handled through a combination of LLM-based extraction or traditional NER pipelines.
The extracted entities are cached in Redis:

redis.hset(f"entities:{session_id}", mapping=entities_dict)

During subsequent messages, these entities are merged into the context so the AI can maintain relevance.


Step 5: Combining All Layers — The Context Builder

Before sending a query to the model, the Context Builder assembles the final prompt dynamically:

final_prompt = {
   "system": get_cached_system_prompt(),
   "summary": get_conversation_summary(),
   "entities": get_cached_entities(),
   "user_input": latest_message
}

This prompt is sent to the model gateway (vLLM, OpenAI API, or NVIDIA NIM endpoint).
After receiving the model output, any new information or entities are extracted and updated back into Redis.

This creates a continuous loop:
Cache → Retrieve → Generate → Update → Cache again.


Flow of Context Caching

  1. User sends a message.
  2. Context Manager retrieves cached system prompt, summary, and entities.
  3. The Context Builder constructs the complete prompt.
  4. The prompt is sent to the model gateway for inference.
  5. Response is processed, and new summaries or entities are cached.

Business and Technical Benefits

AspectBenefit
SpeedReduces redundant prompt construction and retrieval time
CostMinimises token usage per model call
PersonalisationRetains user context seamlessly
ScalabilityWorks efficiently across multi-tenant deployments
MaintainabilitySimple caching architecture, easy to extend

What’s Next

We are currently exploring:

  • Persistent long-term memory beyond Redis using vector stores
  • Adaptive summarisation that adjusts based on conversation depth
  • Context expiry and recency scoring for dynamic memory management

Context caching is a foundation for building intelligent and scalable AI systems. It bridges the gap between stateless LLMs and stateful user experiences.


Conclusion

In our platform, context caching transformed how users interact with GenAI systems. It enabled faster, more coherent, and more personal responses without increasing model complexity or cost.

Applied thoughtfully, context caching turns generative models into context-aware assistants — capable of remembering, reasoning, and adapting to each user’s needs.