How We Approached Context Caching in Our GenAI Platform

In any GenAI platform, maintaining context continuity across user sessions is a key challenge. Without memory, the AI feels disconnected — it forgets who the user is, what was said earlier, and the overall conversation purpose. That’s where context caching becomes essential.

At our platform, we designed a flexible Context Caching Layer that keeps conversations intelligent, lightweight, and cost-efficient — without overloading the model with long histories.

What Is Context Caching

When users interact with a conversational or task-based GenAI system, the platform must remember key details such as:

Who the user is
What topic they are discussing
What entities or preferences were mentioned earlier
What system instructions or roles were set

Instead of sending the entire conversation to the LLM on each request, we cache key summaries, extracted entities, and structured context. This approach improves response time, reduces cost, and ensures continuity in user experience.

Our Architecture Overview

A simplified architecture of the context caching system looks like this:

The Context Manager acts as the central component that manages:

System prompt caching
Conversation summarisation
Entity extraction and storage
Context assembly before each model call

Redis is used as the fast-access store for all cached information.

Step 1: Structured System Prompting

Every conversation starts with a system prompt — a structured instruction block that defines the AI’s behaviour, tone, and objective.

We designed this prompt in a JSON or YAML format so it can be versioned and cached.

Example:

system_prompt:
  role: "Conversational AI Assistant"
  tone: "Professional yet friendly"
  goal: "Help users complete GenAI tasks efficiently"
  memory_instructions: "Use cached entities from Redis"

This structured approach allows dynamic updates and better control across environments and user types.

Step 2: System Prompt Configuration and Caching

System prompts are stored and retrieved from Redis using a consistent key structure:

redis.set(f"system_prompt:{workspace_id}", json.dumps(prompt_data))

When a user message arrives, the system retrieves the cached prompt:

prompt = redis.get(f"system_prompt:{workspace_id}")

This ensures:

Quick retrieval and lower database load
Version control across environments
Personalisation for users or teams

Step 3: Capturing Conversation Summary

To avoid passing the full conversation each time, the system generates summaries using a lightweight summarisation model.

For example:

“User is building a GenAI chatbot for HR onboarding. They asked about caching and memory persistence.”

The summary is then cached for the session:

redis.set(f"summary:{session_id}", summary_text)

This summary becomes part of the context in subsequent turns, ensuring the model retains continuity without large token usage.

Step 4: Capturing User Entities and Context

We extract important entities and facts mentioned by the user, such as company names, projects, frameworks, or objectives.

Example:

Entity Type	Example
Full Name	Abhishek Dwivedi
Mobile No	9999999999
Current City	New Delhi

Entity extraction is handled through a combination of LLM-based extraction or traditional NER pipelines.
The extracted entities are cached in Redis:

redis.hset(f"entities:{session_id}", mapping=entities_dict)

During subsequent messages, these entities are merged into the context so the AI can maintain relevance.

Step 5: Combining All Layers — The Context Builder

Before sending a query to the model, the Context Builder assembles the final prompt dynamically:

final_prompt = {
   "system": get_cached_system_prompt(),
   "summary": get_conversation_summary(),
   "entities": get_cached_entities(),
   "user_input": latest_message
}

This prompt is sent to the model gateway (vLLM, OpenAI API, or NVIDIA NIM endpoint).
After receiving the model output, any new information or entities are extracted and updated back into Redis.

This creates a continuous loop:
Cache → Retrieve → Generate → Update → Cache again.

Flow of Context Caching

User sends a message.
Context Manager retrieves cached system prompt, summary, and entities.
The Context Builder constructs the complete prompt.
The prompt is sent to the model gateway for inference.
Response is processed, and new summaries or entities are cached.

Business and Technical Benefits

Aspect	Benefit
Speed	Reduces redundant prompt construction and retrieval time
Cost	Minimises token usage per model call
Personalisation	Retains user context seamlessly
Scalability	Works efficiently across multi-tenant deployments
Maintainability	Simple caching architecture, easy to extend

What’s Next

We are currently exploring:

Persistent long-term memory beyond Redis using vector stores
Adaptive summarisation that adjusts based on conversation depth
Context expiry and recency scoring for dynamic memory management

Context caching is a foundation for building intelligent and scalable AI systems. It bridges the gap between stateless LLMs and stateful user experiences.

Conclusion

In our platform, context caching transformed how users interact with GenAI systems. It enabled faster, more coherent, and more personal responses without increasing model complexity or cost.

Applied thoughtfully, context caching turns generative models into context-aware assistants — capable of remembering, reasoning, and adapting to each user’s needs.