Hybrid Summarization

Maintain conversation coherence across 128K+ token contexts with intelligent summarization.

Overview

Hybrid Summarization automatically compresses long conversation histories while preserving:

Key facts and entities
Code blocks and technical details
Conversation thread structure
Context for ongoing tasks

How It Works

graph LR
    A[Long Conversation] --> B{Token Threshold}
    B -->|< 50K tokens| C[No Compression]
    B -->|> 50K tokens| D[Summarize Oldest]
    D --> E[Preserve Code/Entities]
    E --> F[Maintain Thread Structure]
    F --> G[Inject Summary]
    G --> H[Reduce to 128K]

Automatic Triggers

Summarization triggers at these thresholds:

Context Size	Action
< 50K tokens	No compression
50K - 80K tokens	Light compression
80K - 100K tokens	Moderate compression
> 100K tokens	Aggressive compression

What Gets Preserved

1. Code Blocks

All code is preserved exactly:

# This code is never summarized
def binary_search(arr, target):
    left, right = 0, len(arr) - 1
    while left <= right:
        mid = (left + right) // 2
        if arr[mid] == target:
            return mid
        # ... rest of function

2. Key Entities

Names, dates, technical terms are extracted:

User: John Smith
Project: Quantum Computing
Date: 2026-01-31
Framework: React + TypeScript

3. Thread Structure

Conversation flow is maintained:

Thread 1: API authentication setup (messages 1-15)
  → Summary: User implemented JWT auth, middleware created
Thread 2: Database schema design (messages 16-30)
  → Summary: PostgreSQL schema with users, transactions tables
Thread 3: Stripe integration (messages 31-45) ← Active
  → Full context preserved

Example Compression

Before (15K tokens)

[100+ messages about implementing a feature]
User: Can you help me implement OAuth?
Assistant: Sure, let's start with the providers...
[50 messages of implementation details]
User: How do I handle refresh tokens?
Assistant: Here's the refresh token logic...
[50 more messages]
User: What about error handling?

After (3K tokens)

Thread: OAuth Implementation (messages 1-75)
Summary: Implemented OAuth with Google and GitHub providers.
Created AuthController with login/logout/refresh endpoints.
JWT middleware for protected routes. Refresh token rotation enabled.
Code preserved: auth_controller.py (450 lines), middleware.py (120 lines)

[Recent messages fully preserved]
User: What about error handling?
Assistant: For error handling, you should...

Quality Metrics

Metric	Target	Actual
Factual retention	> 95%	98%
Code preservation	100%	100%
Entity retention	> 90%	94%
Conversation coherence	> 90%	93%

Configuration

Via Dashboard

Go to korad.ai/dashboard
Settings → Conversation Management
Adjust summarization threshold

Via API (Coming Soon)

client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    messages=[...],
    korad_settings={
        "summarization": {
            "threshold": 80000,  # tokens before compression
            "preserve_code": true,
            "preserve_entities": true
        }
    }
)

Best Practices

1. For Coding Projects

Code is automatically preserved, so:

# Safe: Long coding conversations
# All code blocks stay intact
# Only prose explanations are summarized

2. For Multi-Turn Tasks

Related messages are grouped:

# Thread 1: Implement feature A
# Thread 2: Debug feature B
# Thread 3: Add tests ← Active
# Only Threads 1-2 get summarized

3. For Document Analysis

Reference documents are preserved:

# Upload: 100-page technical spec
# Summarized: Key requirements extracted
# Preserved: Full spec available for reference

Monitoring

Check summarization activity:

response = client.messages.create(...)

# Check if context was compressed
if hasattr(response, 'korad_context'):
    print(f"Original tokens: {response.korad_context.original_tokens}")
    print(f"Compressed tokens: {response.korad_context.compressed_tokens}")
    print(f"Compression ratio: {response.korad_context.compression_ratio}")

Technical Details

Algorithm

Thread Detection — Group related messages
Code Extraction — Preserve all code blocks
Entity Recognition — Extract names, dates, terms
Abstractive Summary — AI-powered summarization
Structure Preservation — Maintain thread hierarchy

Performance

Latency: < 100ms for compression
Throughput: 100K tokens/second
Memory: O(n) where n = context size

Savings Slider → Context Optimization →

Overview​

How It Works​

Automatic Triggers​

What Gets Preserved​

1. Code Blocks​

2. Key Entities​

3. Thread Structure​

Example Compression​

Before (15K tokens)​

After (3K tokens)​

Quality Metrics​

Configuration​

Via Dashboard​

Via API (Coming Soon)​

Best Practices​

1. For Coding Projects​

2. For Multi-Turn Tasks​

3. For Document Analysis​

Monitoring​

Technical Details​

Algorithm​

Performance​