Hybrid Summarization
Maintain conversation coherence across 128K+ token contexts with intelligent summarization.
Overview
Hybrid Summarization automatically compresses long conversation histories while preserving:
- Key facts and entities
- Code blocks and technical details
- Conversation thread structure
- Context for ongoing tasks
How It Works
graph LR
A[Long Conversation] --> B{Token Threshold}
B -->|< 50K tokens| C[No Compression]
B -->|> 50K tokens| D[Summarize Oldest]
D --> E[Preserve Code/Entities]
E --> F[Maintain Thread Structure]
F --> G[Inject Summary]
G --> H[Reduce to 128K]
Automatic Triggers
Summarization triggers at these thresholds:
| Context Size | Action |
|---|---|
| < 50K tokens | No compression |
| 50K - 80K tokens | Light compression |
| 80K - 100K tokens | Moderate compression |
| > 100K tokens | Aggressive compression |
What Gets Preserved
1. Code Blocks
All code is preserved exactly:
# This code is never summarized
def binary_search(arr, target):
left, right = 0, len(arr) - 1
while left <= right:
mid = (left + right) // 2
if arr[mid] == target:
return mid
# ... rest of function
2. Key Entities
Names, dates, technical terms are extracted:
User: John Smith
Project: Quantum Computing
Date: 2026-01-31
Framework: React + TypeScript
3. Thread Structure
Conversation flow is maintained:
Thread 1: API authentication setup (messages 1-15)
→ Summary: User implemented JWT auth, middleware created
Thread 2: Database schema design (messages 16-30)
→ Summary: PostgreSQL schema with users, transactions tables
Thread 3: Stripe integration (messages 31-45) ← Active
→ Full context preserved
Example Compression
Before (15K tokens)
[100+ messages about implementing a feature]
User: Can you help me implement OAuth?
Assistant: Sure, let's start with the providers...
[50 messages of implementation details]
User: How do I handle refresh tokens?
Assistant: Here's the refresh token logic...
[50 more messages]
User: What about error handling?
After (3K tokens)
Thread: OAuth Implementation (messages 1-75)
Summary: Implemented OAuth with Google and GitHub providers.
Created AuthController with login/logout/refresh endpoints.
JWT middleware for protected routes. Refresh token rotation enabled.
Code preserved: auth_controller.py (450 lines), middleware.py (120 lines)
[Recent messages fully preserved]
User: What about error handling?
Assistant: For error handling, you should...
Quality Metrics
| Metric | Target | Actual |
|---|---|---|
| Factual retention | > 95% | 98% |
| Code preservation | 100% | 100% |
| Entity retention | > 90% | 94% |
| Conversation coherence | > 90% | 93% |
Configuration
Via Dashboard
- Go to korad.ai/dashboard
- Settings → Conversation Management
- Adjust summarization threshold
Via API (Coming Soon)
client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
messages=[...],
korad_settings={
"summarization": {
"threshold": 80000, # tokens before compression
"preserve_code": true,
"preserve_entities": true
}
}
)
Best Practices
1. For Coding Projects
Code is automatically preserved, so:
# Safe: Long coding conversations
# All code blocks stay intact
# Only prose explanations are summarized
2. For Multi-Turn Tasks
Related messages are grouped:
# Thread 1: Implement feature A
# Thread 2: Debug feature B
# Thread 3: Add tests ← Active
# Only Threads 1-2 get summarized
3. For Document Analysis
Reference documents are preserved:
# Upload: 100-page technical spec
# Summarized: Key requirements extracted
# Preserved: Full spec available for reference
Monitoring
Check summarization activity:
response = client.messages.create(...)
# Check if context was compressed
if hasattr(response, 'korad_context'):
print(f"Original tokens: {response.korad_context.original_tokens}")
print(f"Compressed tokens: {response.korad_context.compressed_tokens}")
print(f"Compression ratio: {response.korad_context.compression_ratio}")
Technical Details
Algorithm
- Thread Detection — Group related messages
- Code Extraction — Preserve all code blocks
- Entity Recognition — Extract names, dates, terms
- Abstractive Summary — AI-powered summarization
- Structure Preservation — Maintain thread hierarchy
Performance
- Latency: < 100ms for compression
- Throughput: 100K tokens/second
- Memory: O(n) where n = context size