CLM Tokenization
Overview
CLM uses three different compression systems, each optimized for its specific content type. These systems are NOT interchangeable and do NOT share token vocabularies.
Core principle: Compress meaning, not characters.
Automatic Intent Detection
CLM features IntentDetectorV2, an intelligent system that automatically determines the correct REQ (request/action) token from natural language input. The detector analyzes: - Signals - Vocabulary-based phrase matching - Artifacts - Structural patterns in the text - Epistemic grounding - Context for prediction vs generation - SPEC detection - Domain-specific output types
This means you can write natural language prompts and CLM will automatically compress them into the optimal token format.
Three Independent Systems
| Encoder | System | Structure | Compression |
|---|---|---|---|
| System Prompt | 6-Token Hierarchy | Hierarchical instruction flow | 65-90% |
| Transcript | v2 Semantic Blocks | Sequential semantic contract | 85-92% |
| Structured Data | Header + Row Format | Tabular schema + data | 70-85% |
Why Three Different Systems?
Each content type has fundamentally different characteristics:
System Prompts: - Complex, nested instructions - Hierarchical relationships (action → target → fields → output) - Requires logical flow preservation - Solution: 6-token hierarchy (REQ, TARGET, EXTRACT, CTX, OUT, REF)
Transcripts: - Sequential conversations - Temporal flow (metadata → intent → actions → resolution → sentiment) - Explicit semantic contract separating intent, actions, and state - Solution: v2 semantic blocks (INTERACTION, DOMAIN, CUSTOMER_INTENT, AGENT_ACTIONS, STATE, SENTIMENT)
Structured Data: - Tabular information - Schema + records - Repeated field structure - Solution: Header + row format (not semantic tokens)
Part 1: System Prompt Tokenization
Purpose
Compress system instructions while preserving: - What to do (actions/operations) - What to operate on (data sources) - What to extract (specific fields) - How to format output (structure)
The 6-Token Hierarchy
┌─────────────────────────────────────────┐
│ 1. REQ - What to do │ ← Actions
│ [REQ:ANALYZE,EXTRACT] │
└─────────────────────────────────────────┘
↓
┌─────────────────────────────────────────┐
│ 2. TARGET - What to operate on │ ← Data source
│ [TARGET:TRANSCRIPT:DOMAIN=SUPPORT] │
└─────────────────────────────────────────┘
↓
┌─────────────────────────────────────────┐
│ 3. EXTRACT - What fields to get │ ← Specific data
│ [EXTRACT:SENTIMENT,URGENCY] │
└─────────────────────────────────────────┘
↓
┌─────────────────────────────────────────┐
│ 4. CTX - Additional context │ ← Metadata
│ [CTX:LANGUAGE=EN] │
└─────────────────────────────────────────┘
↓
┌─────────────────────────────────────────┐
│ 5. OUT - How to format │ ← Output spec
│ [OUT_JSON:{summary,score}] │
└─────────────────────────────────────────┘
↓
┌─────────────────────────────────────────┐
│ 6. REF - Identifiers │ ← References
│ [REF:TICKET=TKT-123] │
└─────────────────────────────────────────┘
Token Categories
| Token | Purpose | Required | Examples |
|---|---|---|---|
| REQ | Actions/Operations | ✅ Always | [REQ:ANALYZE], [REQ:EXTRACT,SUMMARIZE] |
| TARGET | Objects/Data Sources | ✅ Always | [TARGET:TRANSCRIPT], [TARGET:DOCUMENT:TYPE=INVOICE] |
| EXTRACT | Fields to Extract | ⚠️ When extracting | [EXTRACT:SENTIMENT,COMPLIANCE] |
| CTX | Context/Conditions | ⚠️ When applicable | [CTX:TONE=PROFESSIONAL] |
| OUT | Output Format | ⚠️ When specified | [OUT:JSON], [OUT_JSON:{fields}] |
| REF | References/IDs | ⚠️ When present | [REF:CASE=12345] |
Syntax
Basic structure:
[CATEGORY:VALUE]
[CATEGORY:VALUE:ATTRIBUTE=VALUE]
[CATEGORY:VALUE1,VALUE2,VALUE3]
Examples:
Simple: [REQ:ANALYZE]
Attribute: [TARGET:TRANSCRIPT:DOMAIN=SUPPORT]
Multiple: [REQ:ANALYZE,EXTRACT,SUMMARIZE]
Complex: [EXTRACT:SENTIMENT,URGENCY:TYPE=LIST,DOMAIN=LEGAL]
REQ Token (Request/Action)
The REQ token represents the primary action or operation to be performed. It is the most critical token in the system prompt hierarchy as it determines the fundamental task type.
REQ Categorization Taxonomy
REQ tokens are organized into six fundamental categories based on their purpose and what they produce:
A. ANALYSIS / EVALUATION
Purpose: Interpret, assess, or explain something. Produces: Insight, judgment, reasoning.
Examples: - Code review - Risk assessment - Performance evaluation - Root cause analysis
Canonical REQs: ANALYZE, EVALUATE, ASSESS, DIAGNOSE
⚠️ Use sparingly — this is the most generic bucket.
B. GENERATION / CREATION
Purpose: Create a new artifact. Produces: Text, data, structure, plans, content.
Examples: - Write a report - Generate a schema - Produce odds - Draft an email
Canonical REQs: GENERATE_REPORT, GENERATE_SCHEMA, GENERATE_BETTING_ODDS, GENERATE_SUMMARY
ℹ️ This is where most business prompts land.
C. PREDICTION / FORECASTING
Purpose: Estimate future outcomes or probabilities. Produces: Probabilities, forecasts, predictions.
Examples: - Match outcome probabilities - Sales forecast - Risk likelihood
Canonical REQs: PREDICT_OUTCOME, FORECAST_METRIC, ESTIMATE_PROBABILITY
⚠️ Often overlaps with GENERATION — choose the dominant intent.
For betting: - Outcome = odds → GENERATION wins - Time series forecast → PREDICTION wins
D. DECISION / RECOMMENDATION
Purpose: Choose or advise among options. Produces: A decision, ranking, or recommendation.
Examples: - Best investment option - Recommended action - Prioritized list
Canonical REQs: RECOMMEND_ACTION, SELECT_OPTION, RANK_ALTERNATIVES
E. EXTRACTION / TRANSFORMATION
Purpose: Convert or extract from existing input. Produces: Structured data from unstructured input.
Examples: - Extract entities - Parse schema - Normalize text
Canonical REQs: EXTRACT_FIELDS, TRANSFORM_SCHEMA, NORMALIZE_TEXT
ℹ️ This is often machine-facing, not user-facing.
F. VALIDATION / VERIFICATION
Purpose: Check correctness, compliance, or consistency. Produces: Pass/fail, issues, validation results.
Examples: - Policy compliance - Schema validation - Constraint checking
Canonical REQs: VALIDATE_OUTPUT, VERIFY_COMPLIANCE, CHECK_CONSISTENCY
REQ Token Values & Category Mapping
The following REQ tokens are available in CLM, organized by their category:
Category A: ANALYSIS / EVALUATION
- ANALYZE - Examine and evaluate content
- CLASSIFY - Categorize items (now mapped via ANALYZE)
- DEBUG - Find and fix issues
Category B: GENERATION / CREATION
- GENERATE - Create new content (reports, summaries, structured data)
- SUMMARIZE - Condense information (now mapped via GENERATE)
Category C: PREDICTION / FORECASTING
- PREDICT - Make future projections based on uncertainty and real-world grounding
Category D: DECISION / RECOMMENDATION
- RECOMMEND - Provide recommendations (deprecated, use RANK)
- RANK - Order items by priority or preference
Category E: EXTRACTION / TRANSFORMATION
- EXTRACT - Pull out specific data or entities
- TRANSFORM - Convert format or restructure data
- FORMAT - Reformat without changing meaning
Category F: VALIDATION / VERIFICATION
- VALIDATE - Check correctness, compliance, or verification
Utility / Other:
- SEARCH - Search or find information
- EXECUTE - Execute operations or commands
Examples:
[REQ:ANALYZE]
[REQ:EXTRACT]
[REQ:GENERATE:SPECS:REPORT]
[REQ:VALIDATE]
[REQ:PREDICT:SPECS:FORECAST]
How REQ Tokens are Automatically Detected
CLM uses IntentDetectorV2 to automatically determine the correct REQ token from natural language input. The detection system analyzes three key dimensions:
1. Signals (Vocabulary-Based Detection)
Signals are detected by matching phrases from the input text against a vocabulary dictionary:
| Signal | Trigger Words | Maps to REQ |
|---|---|---|
| ANALYSIS | analyze, assess, review, evaluate | ANALYZE |
| EXTRACTION | extract, pull, get, retrieve | EXTRACT |
| GENERATION | generate, create, produce, summarize, list | GENERATE |
| PREDICTION | predict, forecast, calculate, project | PREDICT |
| TRANSFORMATION | transform, convert, restructure | TRANSFORM |
| FORMATTING | format, reformat, style | FORMAT |
| VALIDATION | validate, verify, check, ensure | VALIDATE |
| RANKING | rank, order, prioritize, best | RANK |
| DEBUGGING | debug, troubleshoot, fix | DEBUG |
| SEARCH | search, find, lookup | SEARCH |
| EXECUTION | execute, run, perform | EXECUTE |
2. Artifacts (Pattern-Based Detection)
Artifacts are structural patterns detected in the text that indicate what type of output is expected:
| Artifact | Detection Pattern | Indicates |
|---|---|---|
| STRUCTURED | JSON objects: {...} |
Structured data output |
| PROBABILITY | Keywords: probability, odds, chance, likelihood | Probabilistic output |
| LIST | Markdown lists: - item or * item |
List output |
| VALIDATION | Keywords: validate, verify, check compliance, ensure | Validation task |
| DECISION | Keywords: recommend, best option, choose, decision | Decision/ranking task |
| TEXT | Keywords: report, analysis | Text-based output |
3. Epistemic Grounding (Context-Based Detection)
For probabilistic tasks, the system distinguishes between GENERATE and PREDICT based on epistemic grounding:
PREDICT is chosen when: - Uncertainty indicators are present: "likely", "probably", "might", "could" - AND either: - Future indicators: "will", "tomorrow", "next year", "forecast" - OR Real-world indicators: "weather", "market", "election", "outcome"
GENERATE is chosen when: - Probability artifacts exist but without epistemic grounding - Example: "Generate probability distribution" (synthetic data)
Examples:
"What's the likelihood it will rain tomorrow?"
→ PREDICT (uncertainty + future + real-world)
"Generate a probability distribution for dice rolls"
→ GENERATE (probability but synthetic, no real-world grounding)
4. REQ Resolution Decision Tree
The system resolves the final REQ token using this priority order:
1. VALIDATE
└─ If: (Artifact.VALIDATION OR Signal.VALIDATION)
AND has_validation_target (STRUCTURED, TEXT, or DECISION artifacts)
2. EXTRACT
└─ If: Signal.EXTRACTION
AND NOT Artifact.PROBABILITY
3. TRANSFORM
└─ If: Signal.TRANSFORMATION
AND has_transform_target (STRUCTURED or TEXT artifacts)
4. FORMAT
└─ If: Signal.FORMATTING
5. PREDICT
└─ If: Artifact.PROBABILITY
AND epistemic_grounding (uncertainty + future/real-world)
6. GENERATE
└─ If: Artifact.PROBABILITY (without epistemic grounding)
OR Artifact.STRUCTURED
OR Artifact.TEXT
OR Artifact.LIST
7. RANK
└─ If: Signal.RANKING
OR Artifact.DECISION
8. DEBUG
└─ If: Signal.DEBUGGING
9. SEARCH
└─ If: Signal.SEARCH
10. EXECUTE
└─ If: Signal.EXECUTION
11. ANALYZE (default)
└─ If: None of the above
5. SPEC Detection (Output Specialization)
In addition to REQ detection, IntentDetectorV2 also detects SPEC (specification) which refines what type of output is being generated/predicted/extracted:
SPEC detection uses three methods (scored):
| Method | Score | Description |
|---|---|---|
| Explicit patterns | 3 points | Phrases like "generate a REPORT", "return a SUMMARY" |
| Artifact mapping | 2 points | Artifact.VALIDATION → VALIDATION_RESULT, Artifact.DECISION → RECOMMENDATION |
| Keyword matching | 1 point | Domain-specific keywords (see below) |
SPEC Ontology (domain artifacts):
- SUPPORT_RESPONSE - Customer support responses
- TROUBLESHOOTING_GUIDE - Step-by-step troubleshooting
- BETTING_ODDS - Betting or odds information
- PROBABILITY_DISTRIBUTION - Statistical distributions (excluded as non-domain)
- FORECAST - Future projections
- REPORT - Analysis reports
- SUMMARY - Condensed summaries
- RECOMMENDATION - Recommendations or decisions
- RANKING - Ordered lists
- JSON_OBJECT / JSON_SCHEMA - (excluded as format, not domain)
- FIELDS - Field extraction
- ENTITIES - Entity extraction
- VALIDATION_RESULT - Validation outcomes
SPEC keyword mappings:
BETTING_ODDS: ["odds", "betting", "bookmaker"]
FORECAST: ["forecast", "projection"]
SUMMARY: ["summary", "recap", "overview"]
REPORT: ["report", "analysis document"]
SUPPORT_RESPONSE: ["support", "ticket", "issue", "incident"]
TROUBLESHOOTING_GUIDE: ["troubleshoot", "troubleshooting", "steps"]
How SPEC appears in tokens:
[REQ:GENERATE:SPECS:REPORT]
[REQ:PREDICT:SPECS:FORECAST]
[REQ:VALIDATE:SPECS:VALIDATION_RESULT]
Complete Detection Example
Input:
"Analyze this customer support transcript and generate a detailed report
with sentiment analysis. Check if the agent followed compliance guidelines."
Detection process:
- Signals detected:
- "analyze" → Signal.ANALYSIS
-
"generate" → Signal.GENERATION
-
Artifacts detected:
- "report" → Artifact.TEXT
-
"check" / "compliance" → Artifact.VALIDATION
-
REQ resolution:
- Has validation signal + has validation target (TEXT)
-
Result: REQ.VALIDATE (validation takes priority)
-
SPEC detection:
- "generate a...report" → explicit pattern (3 points) → "REPORT"
- "compliance" keywords → "VALIDATION_RESULT"
- Highest scorer: REPORT
Final output:
[REQ:VALIDATE:SPECS:REPORT]
More Real-World Examples
Example 1: Weather Prediction
Input: "What's the probability it will rain tomorrow in Seattle?"
Signals: PREDICTION (predict)
Artifacts: PROBABILITY (probability)
Epistemic: Yes (probability + will + real-world:weather)
REQ: PREDICT
SPEC: FORECAST (forecast implied)
Output: [REQ:PREDICT:SPECS:FORECAST]
Example 2: Data Extraction
Input: "Extract all email addresses and phone numbers from this document"
Signals: EXTRACTION (extract)
Artifacts: None (no JSON/list patterns in prompt)
REQ: EXTRACT
SPEC: ENTITIES (email addresses, phone numbers are entities)
Output: [REQ:EXTRACT:SPECS:ENTITIES]
Example 3: Compliance Validation
Input: "Verify that the agent followed all required disclosure steps and
validate compliance with company policies"
Signals: VALIDATION (verify, validate)
Artifacts: VALIDATION (verify, validate, compliance keywords)
Has validation target: Yes (implied TEXT)
REQ: VALIDATE
SPEC: VALIDATION_RESULT (validation context)
Output: [REQ:VALIDATE:SPECS:VALIDATION_RESULT]
Example 4: Report Generation
Input: "Create a summary report of customer feedback trends from Q4"
Signals: GENERATION (create, summary)
Artifacts: TEXT (report)
REQ: GENERATE (has TEXT artifact)
SPEC: REPORT (report explicitly mentioned)
Output: [REQ:GENERATE:SPECS:REPORT]
Example 5: Probability Distribution (Synthetic)
Input: "Generate a probability distribution for rolling two dice"
Signals: GENERATION (generate)
Artifacts: PROBABILITY (probability)
Epistemic: No (synthetic scenario, not real-world prediction)
REQ: GENERATE (probability without epistemic grounding)
SPEC: None (PROBABILITY_DISTRIBUTION excluded as non-domain)
Output: [REQ:GENERATE]
Example 6: Data Transformation
Input: "Convert this CSV data to JSON format {csv_data}"
Signals: TRANSFORMATION (convert)
Artifacts: STRUCTURED (JSON mentioned, {csv_data} pattern)
Has transform target: Yes (STRUCTURED)
REQ: TRANSFORM
SPEC: None
Output: [REQ:TRANSFORM]
Example 7: Recommendation/Ranking
Input: "Rank these candidates by best fit for the senior engineer position"
Signals: RANKING (rank)
Artifacts: DECISION (best)
REQ: RANK
SPEC: RANKING
Output: [REQ:RANK:SPECS:RANKING]
Example 8: Troubleshooting Guide
Input: "Generate troubleshooting steps for network connectivity issues"
Signals: GENERATION (generate)
Artifacts: TEXT (guide implied), LIST (steps)
REQ: GENERATE
SPEC: TROUBLESHOOTING_GUIDE (troubleshooting keyword)
Output: [REQ:GENERATE:SPECS:TROUBLESHOOTING_GUIDE]
Key Takeaways
- REQ detection is hierarchical - certain REQs take priority (VALIDATE > EXTRACT > TRANSFORM > PREDICT > GENERATE)
- Signals + Artifacts + Context all contribute to the final decision
- SPEC adds domain specificity to the output type (REPORT, FORECAST, VALIDATION_RESULT, etc.)
- Epistemic grounding distinguishes PREDICT from GENERATE for probabilistic tasks
- Default is ANALYZE when no clear signals are detected
Complete Vocabulary Reference
The complete trigger phrase vocabularies are defined in language-specific vocabulary files:
Location: clm_core/dictionary/{lang}/vocabulary.py
Available languages:
- en - English (ENVocabulary)
- es - Spanish (ESVocabulary)
- pt - Portuguese (PTVocabulary)
- fr - French (FRVocabulary)
Key vocabulary properties:
-
REQ_TOKENS - Maps REQ types to trigger phrases:
python "ANALYZE": ["analyze", "review", "examine", "evaluate", "assess", ...] "EXTRACT": ["extract", "pull out", "identify", "find", "retrieve", ...] "GENERATE": ["generate", "create", "write", "draft", "compose", ...] "VALIDATE": ["validate", "verify", "check", "confirm", "ensure", ...] "TRANSFORM": ["convert", "transform", "change", "rewrite", ...] "FORMAT": ["format", "structure", "organize", "layout", ...] "DEBUG": ["debug", "troubleshoot", "diagnose", "fix bug", ...] "SEARCH": ["search", "query", "lookup", "find", "look for", ...] "RANK": ["prioritize", "order", "sort by", "rate", "rank", ...] "PREDICT": ["predict", "forecast", "project", "estimate future", ...] "CALCULATE": ["calculate", "compute", "figure out", "quantify", ...] "EXECUTE": ["use", "apply", "implement", "run", "perform", ...] -
EPISTEMIC_KEYWORDS - Keywords for epistemic grounding:
python "future": ["next", "upcoming", "future", "will", "expected", "forecast", ...] "uncertainty": ["chance", "likelihood", "probability", "odds", "risk", ...] "real_world": ["match", "season", "weather", "election", "market", ...] -
Other useful vocabularies:
ACTION_VERBS- General action verbsCOMPOUND_PHRASES- Multi-word phrases (e.g., "customer support" → "TICKET")TYPE_MAP- Document type mappingsCONTEXT_MAP- Domain context mappings
Example usage:
from clm_core.dictionary.en.vocabulary import ENVocabulary
vocab = ENVocabulary()
# Get all trigger phrases for EXTRACT
extract_phrases = vocab.REQ_TOKENS["EXTRACT"]
# ["extract", "pull out", "identify", "find", ...]
# Get epistemic keywords
future_keywords = vocab.EPISTEMIC_KEYWORDS["future"]
# ["next", "upcoming", "future", "will", ...]
TARGET Token (Object/Source)
Common values:
- TRANSCRIPT - Conversation record
- DOCUMENT - General document
- TICKET - Support ticket
- CODE - Source code
- DATA - Dataset
- EMAIL - Email message
- INVOICE - Invoice document
- REPORT - Analysis report
Attributes:
- DOMAIN - Subject area: DOMAIN=SUPPORT, DOMAIN=FINANCE
- TYPE - Specific subtype: TYPE=INVOICE, TYPE=CONTRACT
- TOPIC - Subject matter: TOPIC=BILLING, TOPIC=TECHNICAL
Examples:
[TARGET:TRANSCRIPT]
[TARGET:TRANSCRIPT:DOMAIN=SUPPORT]
[TARGET:DOCUMENT:TYPE=INVOICE:DOMAIN=FINANCE]
EXTRACT Token (Fields to Extract)
Common values:
Customer Service:
- SENTIMENT, URGENCY, ISSUE, RESOLUTION, COMPLIANCE, DISCLOSURES
Entities:
- NAMES, DATES, AMOUNTS, EMAILS, PHONES, ADDRESSES
Technical:
- BUGS, ERRORS, PERFORMANCE, SECURITY
Business:
- METRICS, DECISIONS, ACTIONS, NEXT_STEPS, OWNERS
Attributes:
- TYPE - Data structure: TYPE=LIST, TYPE=TABLE
- DOMAIN - Context: DOMAIN=LEGAL, DOMAIN=FINANCE
- SOURCE - Origin: SOURCE=AGENT, SOURCE=CUSTOMER
Examples:
[EXTRACT:SENTIMENT,URGENCY,RESOLUTION]
[EXTRACT:COMPLIANCE:SOURCE=AGENT]
[EXTRACT:NAMES,EMAILS,PHONES:TYPE=LIST]
CTX Token (Context)
Common patterns:
[CTX:CUSTOMER_SERVICE]
[CTX:LANGUAGE=EN]
[CTX:TONE=PROFESSIONAL]
[CTX:ESCALATE_IF=BASIC_FAILED:TARGET=TIER2]
OUT Token (Output Format)
Simple format:
[OUT:JSON]
[OUT:MARKDOWN]
[OUT:TABLE]
[OUT:LIST]
Structured JSON:
Basic:
[OUT_JSON:{field1,field2,field3}]
With types (infer_types=True):
[OUT_JSON:{summary:STR,score:FLOAT,items:[STR]}]
Nested:
[OUT_JSON:{summary:STR,scores:{accuracy:FLOAT,compliance:FLOAT}}]
With enums (add_attrs=True):
[OUT_JSON:{score:FLOAT}:ENUMS={"ranges":[{"min":0.0,"max":0.49,"label":"FAIL"}]}]
REF Token (References)
Examples:
[REF:CASE=12345]
[REF:TICKET=TKT-789]
[REF:KB=KB-001]
[REF:POLICY=POL-2024-05]
Intent Detection in the Encoding Pipeline
The IntentDetectorV2 is the first step in the system prompt encoding pipeline:
┌─────────────────────────────────────────────────────────────┐
│ 1. INTENT DETECTION (IntentDetectorV2) │
│ Input: Natural language prompt │
│ Output: Intent (REQ + SPEC) │
│ │
│ Process: │
│ • Detect signals from vocabulary │
│ • Detect artifacts from patterns │
│ • Check epistemic grounding │
│ • Resolve REQ token (priority-based) │
│ • Extract SPEC (scoring-based) │
└─────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ 2. TARGET DETECTION │
│ Extract what the operation targets (TRANSCRIPT, etc.) │
│ Output: [TARGET:TRANSCRIPT:DOMAIN=SUPPORT] │
└─────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ 3. EXTRACTION FIELD DETECTION │
│ Identify fields to extract (SENTIMENT, URGENCY, etc.) │
│ Output: [EXTRACT:SENTIMENT,URGENCY,RESOLUTION] │
└─────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ 4. CONTEXT DETECTION │
│ Extract context and conditions │
│ Output: [CTX:LANGUAGE=EN] │
└─────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ 5. OUTPUT FORMAT DETECTION │
│ Parse output schema and format requirements │
│ Output: [OUT_JSON:{field:TYPE,...}:ENUMS={...}] │
└─────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ 6. REFERENCE DETECTION │
│ Extract IDs and references │
│ Output: [REF:TICKET=TKT-123] │
└─────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ FINAL COMPRESSED OUTPUT: │
│ [REQ:VALIDATE:SPECS:REPORT] │
│ [TARGET:TRANSCRIPT:DOMAIN=SUPPORT] │
│ [EXTRACT:SENTIMENT,URGENCY,RESOLUTION] │
│ [CTX:LANGUAGE=EN] │
│ [OUT_JSON:{summary:STR,scores:{...}}:ENUMS={...}] │
│ [REF:TICKET=TKT-123] │
└─────────────────────────────────────────────────────────────┘
Key points: - Intent detection happens first and determines the REQ token - REQ token guides the rest of the encoding process - SPEC provides additional domain context for the output - All components work together to form the complete compressed prompt
Complete System Prompt Example
Original:
You are a Call QA & Compliance Scoring System for customer service operations.
TASK:
Analyze the transcript and score the agent's compliance across required QA categories.
ANALYSIS CRITERIA:
- Mandatory disclosures and verification steps
- Policy adherence
- Soft-skill behaviors (empathy, clarity, ownership)
OUTPUT FORMAT:
{
"summary": "short_summary",
"qa_scores": {
"verification": 0.0,
"policy_adherence": 0.0,
"soft_skills": 0.0,
"compliance": 0.0
},
"violations": ["list_any_detected"]
}
Compressed (Level 1: No types, no attrs):
[REQ:ANALYZE] [TARGET:TRANSCRIPT:DOMAIN=QA]
[EXTRACT:COMPLIANCE,DISCLOSURES,VERIFICATION,POLICY,SOFT_SKILLS:TYPE=LIST,DOMAIN=LEGAL]
[OUT_JSON:{summary,qa_scores:{verification,policy_adherence,soft_skills,compliance},violations}]
Compression: 70.7%
Compressed (Level 4: Types + attrs):
[REQ:ANALYZE] [TARGET:TRANSCRIPT:DOMAIN=QA]
[EXTRACT:COMPLIANCE,DISCLOSURES,VERIFICATION,POLICY,SOFT_SKILLS:TYPE=LIST,DOMAIN=LEGAL]
[OUT_JSON:{summary:STR,qa_scores:{verification:FLOAT,policy_adherence:FLOAT,soft_skills:FLOAT,compliance:FLOAT},violations:[STR]}:ENUMS={"ranges":[{"min":0.0,"max":0.49,"label":"FAIL"},{"min":0.5,"max":0.74,"label":"NEEDS_IMPROVEMENT"},{"min":0.75,"max":0.89,"label":"GOOD"},{"min":0.9,"max":1.0,"label":"EXCELLENT"}]}]
Compression: 26.6%
Part 1b: Configuration Prompt Tokenization
Purpose
Configuration prompts are template-based system instructions that define an agent's persistent behavior. Unlike task prompts that focus on a specific action, configuration prompts establish identity, rules, and behavioral patterns.
Key differences from Task Prompts: - Define agent role and persona (not actions) - Contain behavioral rules (basic and custom) - Support runtime placeholders for dynamic values - Include priority definitions for rule conflicts
Configuration Prompt Token Types
+-----------------------------------------+
| 1. PROMPT_MODE - Prompt type | <- Identifier
| [PROMPT_MODE:CONFIGURATION] |
+-----------------------------------------+
|
v
+-----------------------------------------+
| 2. ROLE - Agent identity | <- Who
| [ROLE:CUSTOMER_SUPPORT_AGENT] |
+-----------------------------------------+
|
v
+-----------------------------------------+
| 3. RULES - Active rule sets | <- Behavior
| [RULES:BASIC,CUSTOM] |
+-----------------------------------------+
|
v
+-----------------------------------------+
| 4. PRIORITY - Conflict resolution | <- Precedence
| [PRIORITY:CUSTOM_OVER_BASIC] |
+-----------------------------------------+
|
v
+-----------------------------------------+
| 5. OUT - Output format (optional) | <- Output spec
| [OUT_JSON:{field:TYPE}] |
+-----------------------------------------+
Token Definitions
| Token | Purpose | Required | Examples |
|---|---|---|---|
| PROMPT_MODE | Identifies prompt type | Yes | [PROMPT_MODE:CONFIGURATION] |
| ROLE | Agent identity/persona | When detected | [ROLE:ASSISTANT], [ROLE:CUSTOMER_SUPPORT_AGENT] |
| RULES | Active rule sets | When detected | [RULES:BASIC], [RULES:BASIC,CUSTOM] |
| PRIORITY | Rule conflict resolution | When detected | [PRIORITY:CUSTOM_OVER_BASIC] |
| OUT | Output format | When specified | [OUT_JSON:{response:STR}] |
PROMPT_MODE Token
Purpose: Identifies this as a configuration prompt (vs task prompt)
Values:
- CONFIGURATION - Template-based agent configuration
- TASK - Action-oriented task prompt (default)
Example:
[PROMPT_MODE:CONFIGURATION]
ROLE Token
Purpose: Captures the agent's identity and persona
Detection patterns:
- <role>You are a...</role> tags
- "You are a..." or "Your role is..." phrases
Examples:
[ROLE:HELPFUL_ASSISTANT]
[ROLE:CUSTOMER_SUPPORT_AGENT]
[ROLE:CONTENT_MODERATOR]
[ROLE:PROFESSIONAL_TRANSLATOR]
Normalization: - Spaces replaced with underscores - Converted to uppercase - Articles (a, an, the) removed
RULES Token
Purpose: Indicates which rule sets are active
Values:
- BASIC - Standard/default rules detected
- CUSTOM - User-specific rules detected
Detection patterns:
- <basic_rules> tags or "basic rules" phrase
- <custom_rules> tags or "custom instructions" phrase
Examples:
[RULES:BASIC]
[RULES:CUSTOM]
[RULES:BASIC,CUSTOM]
PRIORITY Token
Purpose: Defines how rule conflicts should be resolved
Values:
- CUSTOM_OVER_BASIC - Custom rules take precedence
Detection patterns: - "custom instructions are paramount" - "prioritize custom instructions" - "custom instructions override"
Example:
[PRIORITY:CUSTOM_OVER_BASIC]
Configuration Prompt Example
Original:
<role>You are a helpful customer support agent</role>
<basic_rules>
- Be polite and professional
- Verify customer identity
- Document all interactions
</basic_rules>
<custom_rules>
- Address customer as: {{customer_name}}
- Account tier: {{account_tier}}
</custom_rules>
Follow the basic rules as your foundation. If there are conflicts
between basic rules and custom instructions, prioritize custom
instructions. Custom instructions are paramount.
OUTPUT:
{
"response": "message",
"escalate": true/false
}
Compressed CL Token:
[PROMPT_MODE:CONFIGURATION][ROLE:CUSTOMER_SUPPORT_AGENT][RULES:BASIC,CUSTOM][PRIORITY:CUSTOM_OVER_BASIC][OUT_JSON:{response:STR,escalate:BOOL}]
Metadata extracted:
- role: "CUSTOMER_SUPPORT_AGENT"
- rules: {"basic": true, "custom": true}
- priority: "CUSTOM_OVER_BASIC"
- placeholders: ["customer_name", "account_tier"]
- output_format: "{response:STR,escalate:BOOL}"
Two-Phase Compression
Configuration prompts use a two-phase compression approach:
Phase 1: CL Token Generation - Extract semantic elements (role, rules, priority) - Generate compressed CL tokens - Detect and encode output format
Phase 2: NL Minimization - Remove redundant meta-instructions - Suppress priority explanations (encoded in CL) - Trim verbose rule descriptions - Remove content already encoded in CL tokens
Result: CL tokens + minimized NL prompt
See Configuration Prompt Encoding for complete documentation.
Part 2: Transcript Tokenization (v2 Schema)
Purpose
Compress customer service conversations into an explicit semantic contract while preserving: - Interaction metadata (channel, duration, language) - Domain and service context - Customer intent (derived from customer utterances) - Context provided (PII-safe) - Agent and system actions (separated) - Resolution outcome and authoritative state - Commitments and artifacts - Emotional trajectory
The 14 Semantic Blocks
┌──────────────────────────────────────────────────┐
│ 1. INTERACTION - Interaction metadata │ ← Setup
│ [INTERACTION:SUPPORT:CHANNEL=VOICE] │
└──────────────────────────────────────────────────┘
↓
┌──────────────────────────────────────────────────┐
│ 2. DURATION - Call duration │ ← Time
│ [DURATION=6m] │
└──────────────────────────────────────────────────┘
↓
┌──────────────────────────────────────────────────┐
│ 3. LANG - Language │ ← Language
│ [LANG=EN] │
└──────────────────────────────────────────────────┘
↓
┌──────────────────────────────────────────────────┐
│ 4. DOMAIN - Service area classification │ ← Domain
│ [DOMAIN:BILLING] │
└──────────────────────────────────────────────────┘
↓
┌──────────────────────────────────────────────────┐
│ 5. SERVICE - Service within domain │ ← Service
│ [SERVICE:SUBSCRIPTION] │
└──────────────────────────────────────────────────┘
↓
┌──────────────────────────────────────────────────┐
│ 6. CUSTOMER_INTENT - Customer's goal │ ← Intent
│ [CUSTOMER_INTENT:REPORT_DUPLICATE_CHARGE] │
└──────────────────────────────────────────────────┘
↓
┌──────────────────────────────────────────────────┐
│ 7. CONTEXT - Facts provided (PII-safe) │ ← Context
│ [CONTEXT:EMAIL_PROVIDED] │
└──────────────────────────────────────────────────┘
↓
┌──────────────────────────────────────────────────┐
│ 8. AGENT_ACTIONS - Agent operations chain │ ← Actions
│ [AGENT_ACTIONS:VERIFIED→DIAGNOSED→REFUNDED] │
└──────────────────────────────────────────────────┘
↓
┌──────────────────────────────────────────────────┐
│ 9. SYSTEM_ACTIONS - Automated events │ ← System
│ [SYSTEM_ACTIONS:PAYMENT_RETRY_DETECTED] │
└──────────────────────────────────────────────────┘
↓
┌──────────────────────────────────────────────────┐
│ 10. RESOLUTION - Outcome type │ ← Outcome
│ [RESOLUTION:ISSUE_RESOLVED] │
└──────────────────────────────────────────────────┘
↓
┌──────────────────────────────────────────────────┐
│ 11. STATE - Authoritative status │ ← Status
│ [STATE:RESOLVED] │
└──────────────────────────────────────────────────┘
↓
┌──────────────────────────────────────────────────┐
│ 12. COMMITMENT - SLA / promised actions │ ← Promises
│ [COMMITMENT:REFUND_3-5_DAYS] │
└──────────────────────────────────────────────────┘
↓
┌──────────────────────────────────────────────────┐
│ 13. ARTIFACT - Structured identifiers │ ← IDs
│ [ARTIFACT:REFUND_REF=RFD-908712] │
└──────────────────────────────────────────────────┘
↓
┌──────────────────────────────────────────────────┐
│ 14. SENTIMENT - Emotional trajectory │ ← Feeling
│ [SENTIMENT:NEUTRAL→GRATEFUL] │
└──────────────────────────────────────────────────┘
Token Definitions
| Token | Format | Purpose | Example |
|---|---|---|---|
| INTERACTION | [INTERACTION:TYPE:CHANNEL=ch] |
Interaction metadata | [INTERACTION:SUPPORT:CHANNEL=VOICE] |
| DURATION | [DURATION=Xm] |
Call duration | [DURATION=6m] |
| LANG | [LANG=XX] |
Language | [LANG=EN] |
| DOMAIN | [DOMAIN:TYPE] |
Domain classification | [DOMAIN:BILLING] |
| SERVICE | [SERVICE:TYPE] |
Service area | [SERVICE:SUBSCRIPTION] |
| CUSTOMER_INTENT | [CUSTOMER_INTENT:INTENT] |
Customer's goal | [CUSTOMER_INTENT:REQUEST_REFUND] |
| CONTEXT | [CONTEXT:TYPE] |
PII-safe context | [CONTEXT:EMAIL_PROVIDED] |
| AGENT_ACTIONS | [AGENT_ACTIONS:A1→A2→A3] |
Agent action chain | [AGENT_ACTIONS:ACCOUNT_VERIFIED→REFUND_INITIATED] |
| SYSTEM_ACTIONS | [SYSTEM_ACTIONS:E1→E2] |
System events | [SYSTEM_ACTIONS:PAYMENT_RETRY_DETECTED] |
| RESOLUTION | [RESOLUTION:TYPE] |
Outcome type | [RESOLUTION:ISSUE_RESOLVED] |
| STATE | [STATE:STATUS] |
Authoritative status | [STATE:RESOLVED] |
| COMMITMENT | [COMMITMENT:PROMISE] |
SLA/promised actions | [COMMITMENT:REFUND_3-5_DAYS] |
| ARTIFACT | [ARTIFACT:TYPE=VALUE] |
Structured identifiers | [ARTIFACT:REFUND_REF=RFD-908712] |
| SENTIMENT | [SENTIMENT:START→END] |
Emotional trajectory | [SENTIMENT:NEUTRAL→GRATEFUL] |
INTERACTION Token
Format: [INTERACTION:TYPE:CHANNEL=channel]
Type values:
- SUPPORT - Customer support
- SALES - Sales interaction
- BILLING - Billing-specific
Channel values:
- VOICE - Phone call
- CHAT - Live chat
- EMAIL - Email
- SLACK - Slack thread
Examples:
[INTERACTION:SUPPORT:CHANNEL=VOICE]
[INTERACTION:SALES:CHANNEL=CHAT]
[INTERACTION:BILLING:CHANNEL=EMAIL]
DURATION Token
Format: [DURATION=Xm]
Purpose: Approximate call duration in minutes (derived from turn count: ~2 turns/minute)
Examples:
[DURATION=6m]
[DURATION=15m]
[DURATION=1m]
LANG Token
Format: [LANG=XX]
Purpose: Language metadata. Schema is language-invariant — extraction normalizes cross-language expressions to the same enums.
Values: EN, ES, PT, FR
DOMAIN Token
Format: [DOMAIN:TYPE]
Purpose: Explicit service area classification
Domain types (20–30 max):
- BILLING - Payment and billing issues
- AUTHENTICATION - Login, access, credentials
- BOOKINGS - Reservations, scheduling
- API - API-related issues
- PERFORMANCE - Speed, reliability
- TECHNICAL - Technical support
- SHIPPING - Delivery and shipping
Examples:
[DOMAIN:BILLING]
[DOMAIN:AUTHENTICATION]
[DOMAIN:TECHNICAL]
SERVICE Token
Format: [SERVICE:TYPE]
Purpose: Service area within the domain
Service types:
- SUBSCRIPTION - Subscription management
- HOST_STAY - Hospitality/hosting
- PAYMENT - Payment processing
- DASHBOARD - Dashboard/UI
- EXPORTS - Data exports
Examples:
[SERVICE:SUBSCRIPTION]
[SERVICE:PAYMENT]
CUSTOMER_INTENT Token
Format: [CUSTOMER_INTENT:INTENT]
Purpose: Primary customer intent derived strictly from customer utterances. Must not be inferred solely from agent actions.
Intent types (20–40 max):
- REQUEST_REFUND
- REPORT_DUPLICATE_CHARGE
- ACCOUNT_UNLOCK
- FEATURE_INQUIRY
- CANCEL_BOOKING
- REPORT_OUTAGE
- REQUEST_UPGRADE
Rules: - One primary intent required - Optional secondary intent allowed - Must not be inferred solely from agent actions
Examples:
[CUSTOMER_INTENT:REPORT_DUPLICATE_CHARGE]
[CUSTOMER_INTENT:REQUEST_REFUND]
[CUSTOMER_INTENT:ACCOUNT_UNLOCK]
CONTEXT Token
Format: [CONTEXT:TYPE]
Purpose: Indicates fact-of-information provided by customer without leaking PII
Context types:
- EMAIL_PROVIDED
- PHONE_PROVIDED
- BOOKING_ID_PROVIDED
- PAYMENT_METHOD_PROVIDED
- ACCOUNT_ID_PROVIDED
- ORDER_ID_PROVIDED
- TRACKING_ID_PROVIDED
Redacted variant:
[CONTEXT:PAYMENT_METHOD_REDACTED]
Examples:
[CONTEXT:EMAIL_PROVIDED]
[CONTEXT:BOOKING_ID_PROVIDED]
AGENT_ACTIONS Token
Format: [AGENT_ACTIONS:ACTION1→ACTION2→ACTION3]
Purpose: Operational actions performed by human agent, joined as an ordered chain with arrows (→)
Action types (30–50 max):
- ACCOUNT_VERIFIED
- DIAGNOSTIC_PERFORMED
- REFUND_INITIATED
- BOOKING_CANCELLED
- ACCOUNT_UNLOCKED
- API_KEY_ROTATED
- ESCALATED_TIER2
Avoid generic verbs: TROUBLESHOOT, CHECKED, ACTION_TAKEN
Examples:
[AGENT_ACTIONS:ACCOUNT_VERIFIED→DIAGNOSTIC_PERFORMED→REFUND_INITIATED]
[AGENT_ACTIONS:ACCOUNT_UNLOCKED]
[AGENT_ACTIONS:API_KEY_ROTATED→ESCALATED_TIER2]
SYSTEM_ACTIONS Token
Format: [SYSTEM_ACTIONS:EVENT1→EVENT2]
Purpose: Automated system-level events (optional, only emitted when detected)
System action types:
- PAYMENT_RETRY_DETECTED
- AUTO_ESCALATION_TRIGGERED
- SLA_BREACH_DETECTED
- FRAUD_ALERT_TRIGGERED
- ACCOUNT_AUTO_LOCKED
- NOTIFICATION_SENT
Examples:
[SYSTEM_ACTIONS:PAYMENT_RETRY_DETECTED]
[SYSTEM_ACTIONS:AUTO_ESCALATION_TRIGGERED→SLA_BREACH_DETECTED]
RESOLUTION Token
Format: [RESOLUTION:TYPE]
Purpose: Describes outcome type (not state)
Resolution types:
- ISSUE_RESOLVED - Issue fixed
- ACCOUNT_UNLOCKED - Access restored
- ANSWER_PROVIDED - Information given
- ESCALATED - Sent to higher tier
- PENDING - Awaiting resolution
- CANCELLED - Cancelled
Examples:
[RESOLUTION:ISSUE_RESOLVED]
[RESOLUTION:ESCALATED]
[RESOLUTION:ANSWER_PROVIDED]
STATE Token
Format: [STATE:STATUS]
Purpose: Authoritative interaction status. Mutually exclusive — only one STATE per transcript.
State values (5–10 max):
- RESOLVED - Fully resolved
- PENDING_SETTLEMENT - Awaiting settlement
- PENDING_CUSTOMER - Awaiting customer action
- ESCALATED - Escalated
- UNRESOLVED - Not resolved
Examples:
[STATE:RESOLVED]
[STATE:PENDING_SETTLEMENT]
[STATE:ESCALATED]
COMMITMENT Token
Format: [COMMITMENT:PROMISE]
Purpose: Encodes SLA or promised actions by the agent
Examples:
[COMMITMENT:REFUND_3-5_DAYS]
[COMMITMENT:FOLLOWUP_BY_FRIDAY]
[COMMITMENT:CALLBACK_24h]
[COMMITMENT:TECHNICIAN_VISIT_MONDAY]
ARTIFACT Token
Format: [ARTIFACT:TYPE=VALUE]
Purpose: Structured identifiers extracted from conversation
Artifact types:
- REFUND_REF - Refund reference number
- REFUND_AMT - Refund amount
- BOOKING_ID - Booking identifier
- ORDER_ID - Order identifier
- TRACKING_ID - Tracking number
- TICKET_ID - Support ticket ID
- CASE_ID - Case number
- CLAIM_ID - Claim number
- PRODUCT_ID - Product model
Examples:
[ARTIFACT:REFUND_REF=RFD-908712]
[ARTIFACT:REFUND_AMT=$14.99]
[ARTIFACT:ORDER_ID=ORD-456789]
[ARTIFACT:TRACKING_ID=TRK-1234]
SENTIMENT Token
Format: [SENTIMENT:START→END]
Purpose: Tracks emotional trajectory through conversation
Sentiment values:
- FRUSTRATED, ANGRY, CONCERNED
- NEUTRAL, CALM
- SATISFIED, GRATEFUL, POSITIVE
Special notation: Uses arrows (→) to show progression. Turning points are deduplicated.
Examples:
[SENTIMENT:FRUSTRATED→NEUTRAL→SATISFIED]
[SENTIMENT:NEUTRAL→GRATEFUL]
[SENTIMENT:ANGRY→CALM→GRATEFUL]
Complete Transcript Example (v2)
Original conversation (billing dispute call):
Agent Raj: Thank you for calling customer support. My name is Raj. How can I help you today?
Customer: Hi Raj, I have a billing issue. I was charged twice this month for my subscription.
Agent Raj: I'm sorry to hear that. Let me look into your account. Can I have your email?
Customer: Sure, it's melissa.jordan@example.com
Agent Raj: I see two charges. The system retried payment after the first succeeded.
I'll process a full refund. Reference number RFD-908712, 3 to 5 business days.
Customer: Thank you so much for your help!
Compressed (v2):
[INTERACTION:SUPPORT:CHANNEL=VOICE]
[DURATION=6m]
[LANG=EN]
[DOMAIN:BILLING]
[SERVICE:SUBSCRIPTION]
[CUSTOMER_INTENT:REPORT_DUPLICATE_CHARGE]
[CONTEXT:EMAIL_PROVIDED]
[AGENT_ACTIONS:ACCOUNT_VERIFIED→DIAGNOSTIC_PERFORMED→REFUND_INITIATED]
[SYSTEM_ACTIONS:PAYMENT_RETRY_DETECTED]
[RESOLUTION:ISSUE_RESOLVED]
[STATE:RESOLVED]
[COMMITMENT:REFUND_3-5_DAYS]
[ARTIFACT:REFUND_REF=RFD-908712]
[SENTIMENT:NEUTRAL→GRATEFUL]
Original: ~1,450 tokens Compressed: ~145 tokens Reduction: 90%
Key Differences from System Prompts
| Aspect | System Prompts | Transcripts (v2) |
|---|---|---|
| Token Types | REQ, TARGET, EXTRACT, CTX, OUT, REF | INTERACTION, DOMAIN, CUSTOMER_INTENT, AGENT_ACTIONS, STATE, SENTIMENT, etc. |
| Structure | Hierarchical (instruction flow) | Sequential (semantic blocks) |
| Purpose | Instruction compression | Conversation compression as explicit semantic contract |
| Flow | Logical (what→how→output) | Temporal (metadata→intent→actions→outcome→sentiment) |
| Special Features | Nested JSON with types/enums | PII-safe context, separated agent/system actions |
| Actions | Comma-separated: REQ:ACTION1,ACTION2 |
Arrow-chained: AGENT_ACTIONS:A1→A2→A3 |
Part 3: Structured Data Format
Purpose
Compress tabular data (catalogs, products, rules) while preserving: - Field schema - Record structure - Relationships - Data types
Format Structure
NOT token-based - uses header + row format:
[DATASET_NAME:COUNT]{FIELD1,FIELD2,FIELD3,...}
[value1,value2,value3,...]
[value1,value2,value3,...]
Components
1. Header:
[DATASET_NAME:COUNT]{FIELD_NAMES}
DATASET_NAME: Data type (e.g., NBA_CATALOG, PRODUCT, RULE)COUNT: Number of records{FIELD_NAMES}: Comma-separated field list
2. Rows:
[value1,value2,value3,...]
- One row per record
- Values in same order as header
- Nested objects:
{KEY:VALUE} - Arrays:
[ITEM1,ITEM2]
Data Type Handling
| Type | Original | Compressed |
|---|---|---|
| String | "Hello World" |
HELLO_WORLD |
| Number | 1299.99 |
1299.99 |
| Boolean | true |
TRUE |
| Array | ["a", "b", "c"] |
[A,B,C] |
| Null | null |
NULL |
| Date | "2024-10-15" |
2024-10-15 |
| Object | {"key": "value"} |
{KEY:VALUE} |
Complete Example: NBA Catalog
Original:
[
{
"nba_id": "NBA-001",
"action": "Offer Premium Upgrade",
"description": "Recommend premium tier to qualified customers",
"conditions": ["tenure > 12 months", "no recent complaints"],
"priority": "high",
"channel": "phone",
"expected_value": 450
},
{
"nba_id": "NBA-002",
"action": "Cross-sell Credit Card",
"description": "Offer co-branded credit card to active users",
"conditions": ["good credit score", "active checking"],
"priority": "medium",
"channel": "email",
"expected_value": 300
}
]
Compressed:
[NBA_CATALOG:2]{NBA_ID,ACTION,DESCRIPTION,CONDITIONS,PRIORITY,CHANNEL,EXPECTED_VALUE}
[NBA-001,OFFER_PREMIUM_UPGRADE,RECOMMEND_PREMIUM_TIER,[TENURE>12M,NO_COMPLAINTS],HIGH,PHONE,450]
[NBA-002,CROSS_SELL_CREDIT_CARD,OFFER_COBRANDED_CARD,[GOOD_CREDIT,ACTIVE_CHECKING],MEDIUM,EMAIL,300]
Compression: ~82%
Nested Structures
Original:
{
"sku": "LAPTOP-001",
"name": "Professional Laptop",
"specifications": {
"processor": "Intel i7",
"ram": "16GB DDR5",
"storage": "512GB SSD"
},
"features": ["Backlit keyboard", "Fingerprint reader"]
}
Compressed:
[PRODUCT:1]{SKU,NAME,SPECIFICATIONS,FEATURES}
[LAPTOP-001,PROFESSIONAL_LAPTOP,{PROCESSOR:I7,RAM:16GB_DDR5,STORAGE:512GB_SSD},[BACKLIT_KB,FINGERPRINT]]
Key Differences from Token Systems
| Aspect | System Prompt / Transcript | Structured Data |
|---|---|---|
| Format | Semantic tokens | Header + rows |
| Categories | 6 or 7 token types | No token categories |
| Purpose | Meaning compression | Schema + data compression |
| Structure | Token hierarchy/flow | Tabular (spreadsheet-like) |
| Nesting | Via token attributes | Via {}, [] notation |
| Semantic | High (preserves meaning) | Medium (preserves structure) |
Common Principles
Despite using different systems, all three encoders share core principles:
1. Semantic Preservation
Goal: Maintain complete meaning in compressed form
All three systems preserve the essential meaning of the original content, just using different methods: - System Prompts: Hierarchical semantic tokens - Transcripts: Sequential semantic blocks (v2 schema) - Structured Data: Schema-based compression
2. LLM-Native Format
Goal: LLMs understand without decompression
All three formats are designed to be understood by modern LLMs (GPT-4, Claude, etc.) without requiring decompression:
System Prompt: [REQ:ANALYZE] [TARGET:TRANSCRIPT] [EXTRACT:SENTIMENT]
Transcript: [DOMAIN:BILLING] [CUSTOMER_INTENT:REQUEST_REFUND] [STATE:RESOLVED]
Structured: [NBA_CATALOG:2]{ID,ACTION} [NBA-001,UPGRADE] [NBA-002,CROSS_SELL]
LLMs can process all three formats directly.
3. Predictable Structure
Goal: Consistent, parseable format
Each system has clear syntax rules:
- System Prompts: [CATEGORY:VALUE:ATTR=VAL]
- Transcripts: [TOKEN:TYPE:ATTR=VAL]
- Structured Data: [HEADER]{FIELDS} + [values]
4. Compression Without Loss
Goal: Dramatic size reduction while preserving information
All three achieve 60-95% token reduction while maintaining semantic completeness.
Best Practices
1. Use the Right System for Your Content
Instructions/Prompts → System Prompt Encoder
Conversations → Transcript Encoder
Tabular Data → Structured Data Encoder
2. Understand System-Specific Features
System Prompts: - Use hierarchical flow (REQ → TARGET → EXTRACT → OUT) - Leverage OUT_JSON for structured output - Use CTX for conditions and escalation rules
Transcripts: - Follow semantic block order (INTERACTION → DOMAIN → INTENT → ACTIONS → STATE → SENTIMENT) - Use AGENT_ACTIONS chain for ordered agent operations - Separate system events into SYSTEM_ACTIONS - Use CONTEXT tokens for PII-safe fact-of-information
Structured Data: - Define clear field schema in header - Use nested notation for complex structures - Maintain consistent field order across rows
3. Test Compression Quality
# Verify compression preserves meaning
result = encoder.encode(content)
# Check compression ratio
print(f"Reduction: {result.compression_ratio:.1%}")
# Test LLM understanding
llm_response = llm.complete(
system=result.compressed,
user="Test query"
)
# Verify LLM understood the compressed content
Troubleshooting
Issue: Wrong Encoder Used
Symptom: Poor compression or unexpected output
Solution: Use the correct encoder for your content type
# For instructions:
sys_encoder = CLMEncoder(cfg=CLMConfig(lang="en"))
# For conversations:
transcript_encoder = CLMEncoder(cfg=CLMConfig(lang="en"))
# For tabular data:
sd_encoder = CLMEncoder(cfg=CLMConfig(
lang="en",
ds_config=SDCompressionConfig(...)
))
Issue: LLM Doesn't Understand Compressed Format
Symptom: LLM response quality degraded
Cause: Modern LLMs understand structured tokens
Solution: - Verify syntax is correct for the encoder type - Ensure tokens are well-formed - Test with different LLM models
Issue: Information Loss
Symptom: Important details missing from compressed output
Solution for System Prompts:
# Use less aggressive compression
config = CLMConfig(
lang="en",
sys_prompt_config=SysPromptConfig(
infer_types=True, # Add type information
add_attrs=True # Include enums/ranges
)
)
Solution for Structured Data:
# Lower importance threshold
config = CLMConfig(
lang="en",
ds_config=SDCompressionConfig(
importance_threshold=0.4, # Include more fields
max_field_length=300 # Preserve more content
)
)
Next Steps
- System Prompt Encoder - Overview of system prompt compression
- Task Prompts - Using the 6-token hierarchy
- Configuration Prompts - Template-based agent configuration
- Structured Data Encoder - Using header + row format
- CLM Vocabulary - Understanding vocabulary mappings
- CLM Configuration - Configuring the encoders