CLM Tokenization

Overview

CLM uses three different compression systems, each optimized for its specific content type. These systems are NOT interchangeable and do NOT share token vocabularies.

Core principle: Compress meaning, not characters.

Automatic Intent Detection

CLM features IntentDetectorV2, an intelligent system that automatically determines the correct REQ (request/action) token from natural language input. The detector analyzes: - Signals - Vocabulary-based phrase matching - Artifacts - Structural patterns in the text - Epistemic grounding - Context for prediction vs generation - SPEC detection - Domain-specific output types

This means you can write natural language prompts and CLM will automatically compress them into the optimal token format.

Three Independent Systems

Encoder	System	Structure	Compression
System Prompt	6-Token Hierarchy	Hierarchical instruction flow	65-90%
Transcript	v2 Semantic Blocks	Sequential semantic contract	85-92%
Structured Data	Header + Row Format	Tabular schema + data	70-85%

Why Three Different Systems?

Each content type has fundamentally different characteristics:

System Prompts: - Complex, nested instructions - Hierarchical relationships (action → target → fields → output) - Requires logical flow preservation - Solution: 6-token hierarchy (REQ, TARGET, EXTRACT, CTX, OUT, REF)

Transcripts: - Sequential conversations - Temporal flow (metadata → intent → actions → resolution → sentiment) - Explicit semantic contract separating intent, actions, and state - Solution: v2 semantic blocks (INTERACTION, DOMAIN, CUSTOMER_INTENT, AGENT_ACTIONS, STATE, SENTIMENT)

Structured Data: - Tabular information - Schema + records - Repeated field structure - Solution: Header + row format (not semantic tokens)

Part 1: System Prompt Tokenization

Purpose

Compress system instructions while preserving: - What to do (actions/operations) - What to operate on (data sources) - What to extract (specific fields) - How to format output (structure)

The 6-Token Hierarchy

┌─────────────────────────────────────────┐
│  1. REQ - What to do                    │  ← Actions
│     [REQ:ANALYZE,EXTRACT]               │
└─────────────────────────────────────────┘
           ↓
┌─────────────────────────────────────────┐
│  2. TARGET - What to operate on         │  ← Data source
│     [TARGET:TRANSCRIPT:DOMAIN=SUPPORT]  │
└─────────────────────────────────────────┘
           ↓
┌─────────────────────────────────────────┐
│  3. EXTRACT - What fields to get        │  ← Specific data
│     [EXTRACT:SENTIMENT,URGENCY]         │
└─────────────────────────────────────────┘
           ↓
┌─────────────────────────────────────────┐
│  4. CTX - Additional context            │  ← Metadata
│     [CTX:LANGUAGE=EN]                   │
└─────────────────────────────────────────┘
           ↓
┌─────────────────────────────────────────┐
│  5. OUT - How to format                 │  ← Output spec
│     [OUT_JSON:{summary,score}]          │
└─────────────────────────────────────────┘
           ↓
┌─────────────────────────────────────────┐
│  6. REF - Identifiers                   │  ← References
│     [REF:TICKET=TKT-123]                │
└─────────────────────────────────────────┘

Token Categories

Token	Purpose	Required	Examples
REQ	Actions/Operations	✅ Always	`[REQ:ANALYZE]`, `[REQ:EXTRACT,SUMMARIZE]`
TARGET	Objects/Data Sources	✅ Always	`[TARGET:TRANSCRIPT]`, `[TARGET:DOCUMENT:TYPE=INVOICE]`
EXTRACT	Fields to Extract	⚠️ When extracting	`[EXTRACT:SENTIMENT,COMPLIANCE]`
CTX	Context/Conditions	⚠️ When applicable	`[CTX:TONE=PROFESSIONAL]`
OUT	Output Format	⚠️ When specified	`[OUT:JSON]`, `[OUT_JSON:{fields}]`
REF	References/IDs	⚠️ When present	`[REF:CASE=12345]`

Syntax

Basic structure:

[CATEGORY:VALUE]
[CATEGORY:VALUE:ATTRIBUTE=VALUE]
[CATEGORY:VALUE1,VALUE2,VALUE3]

Examples:

Simple:     [REQ:ANALYZE]
Attribute:  [TARGET:TRANSCRIPT:DOMAIN=SUPPORT]
Multiple:   [REQ:ANALYZE,EXTRACT,SUMMARIZE]
Complex:    [EXTRACT:SENTIMENT,URGENCY:TYPE=LIST,DOMAIN=LEGAL]

REQ Token (Request/Action)

The REQ token represents the primary action or operation to be performed. It is the most critical token in the system prompt hierarchy as it determines the fundamental task type.

REQ Categorization Taxonomy

REQ tokens are organized into six fundamental categories based on their purpose and what they produce:

A. ANALYSIS / EVALUATION

Purpose: Interpret, assess, or explain something. Produces: Insight, judgment, reasoning.

Examples: - Code review - Risk assessment - Performance evaluation - Root cause analysis

Canonical REQs: ANALYZE, EVALUATE, ASSESS, DIAGNOSE

⚠️ Use sparingly — this is the most generic bucket.

B. GENERATION / CREATION

Purpose: Create a new artifact. Produces: Text, data, structure, plans, content.

Examples: - Write a report - Generate a schema - Produce odds - Draft an email

Canonical REQs: GENERATE_REPORT, GENERATE_SCHEMA, GENERATE_BETTING_ODDS, GENERATE_SUMMARY

ℹ️ This is where most business prompts land.

C. PREDICTION / FORECASTING

Purpose: Estimate future outcomes or probabilities. Produces: Probabilities, forecasts, predictions.

Examples: - Match outcome probabilities - Sales forecast - Risk likelihood

Canonical REQs: PREDICT_OUTCOME, FORECAST_METRIC, ESTIMATE_PROBABILITY

⚠️ Often overlaps with GENERATION — choose the dominant intent.

For betting: - Outcome = odds → GENERATION wins - Time series forecast → PREDICTION wins

D. DECISION / RECOMMENDATION

Purpose: Choose or advise among options. Produces: A decision, ranking, or recommendation.

Examples: - Best investment option - Recommended action - Prioritized list

Canonical REQs: RECOMMEND_ACTION, SELECT_OPTION, RANK_ALTERNATIVES

E. EXTRACTION / TRANSFORMATION

Purpose: Convert or extract from existing input. Produces: Structured data from unstructured input.

Examples: - Extract entities - Parse schema - Normalize text

Canonical REQs: EXTRACT_FIELDS, TRANSFORM_SCHEMA, NORMALIZE_TEXT

ℹ️ This is often machine-facing, not user-facing.

F. VALIDATION / VERIFICATION

Purpose: Check correctness, compliance, or consistency. Produces: Pass/fail, issues, validation results.

Examples: - Policy compliance - Schema validation - Constraint checking

Canonical REQs: VALIDATE_OUTPUT, VERIFY_COMPLIANCE, CHECK_CONSISTENCY

REQ Token Values & Category Mapping

The following REQ tokens are available in CLM, organized by their category:

Category A: ANALYSIS / EVALUATION - ANALYZE - Examine and evaluate content - CLASSIFY - Categorize items (now mapped via ANALYZE) - DEBUG - Find and fix issues

Category B: GENERATION / CREATION - GENERATE - Create new content (reports, summaries, structured data) - SUMMARIZE - Condense information (now mapped via GENERATE)

Category C: PREDICTION / FORECASTING - PREDICT - Make future projections based on uncertainty and real-world grounding

Category D: DECISION / RECOMMENDATION - RECOMMEND - Provide recommendations (deprecated, use RANK) - RANK - Order items by priority or preference

Category E: EXTRACTION / TRANSFORMATION - EXTRACT - Pull out specific data or entities - TRANSFORM - Convert format or restructure data - FORMAT - Reformat without changing meaning

Category F: VALIDATION / VERIFICATION - VALIDATE - Check correctness, compliance, or verification

Utility / Other: - SEARCH - Search or find information - EXECUTE - Execute operations or commands

Examples:

[REQ:ANALYZE]
[REQ:EXTRACT]
[REQ:GENERATE:SPECS:REPORT]
[REQ:VALIDATE]
[REQ:PREDICT:SPECS:FORECAST]

How REQ Tokens are Automatically Detected

CLM uses IntentDetectorV2 to automatically determine the correct REQ token from natural language input. The detection system analyzes three key dimensions:

1. Signals (Vocabulary-Based Detection)

Signals are detected by matching phrases from the input text against a vocabulary dictionary:

Signal	Trigger Words	Maps to REQ
ANALYSIS	analyze, assess, review, evaluate	`ANALYZE`
EXTRACTION	extract, pull, get, retrieve	`EXTRACT`
GENERATION	generate, create, produce, summarize, list	`GENERATE`
PREDICTION	predict, forecast, calculate, project	`PREDICT`
TRANSFORMATION	transform, convert, restructure	`TRANSFORM`
FORMATTING	format, reformat, style	`FORMAT`
VALIDATION	validate, verify, check, ensure	`VALIDATE`
RANKING	rank, order, prioritize, best	`RANK`
DEBUGGING	debug, troubleshoot, fix	`DEBUG`
SEARCH	search, find, lookup	`SEARCH`
EXECUTION	execute, run, perform	`EXECUTE`

2. Artifacts (Pattern-Based Detection)

Artifacts are structural patterns detected in the text that indicate what type of output is expected:

Artifact	Detection Pattern	Indicates
STRUCTURED	JSON objects: `{...}`	Structured data output
PROBABILITY	Keywords: probability, odds, chance, likelihood	Probabilistic output
LIST	Markdown lists: `- item` or `* item`	List output
VALIDATION	Keywords: validate, verify, check compliance, ensure	Validation task
DECISION	Keywords: recommend, best option, choose, decision	Decision/ranking task
TEXT	Keywords: report, analysis	Text-based output

3. Epistemic Grounding (Context-Based Detection)

For probabilistic tasks, the system distinguishes between GENERATE and PREDICT based on epistemic grounding:

PREDICT is chosen when: - Uncertainty indicators are present: "likely", "probably", "might", "could" - AND either: - Future indicators: "will", "tomorrow", "next year", "forecast" - OR Real-world indicators: "weather", "market", "election", "outcome"

GENERATE is chosen when: - Probability artifacts exist but without epistemic grounding - Example: "Generate probability distribution" (synthetic data)

Examples:

"What's the likelihood it will rain tomorrow?"
→ PREDICT (uncertainty + future + real-world)

"Generate a probability distribution for dice rolls"
→ GENERATE (probability but synthetic, no real-world grounding)

4. REQ Resolution Decision Tree

The system resolves the final REQ token using this priority order:

1. VALIDATE
   └─ If: (Artifact.VALIDATION OR Signal.VALIDATION)
      AND has_validation_target (STRUCTURED, TEXT, or DECISION artifacts)

2. EXTRACT
   └─ If: Signal.EXTRACTION
      AND NOT Artifact.PROBABILITY

3. TRANSFORM
   └─ If: Signal.TRANSFORMATION
      AND has_transform_target (STRUCTURED or TEXT artifacts)

4. FORMAT
   └─ If: Signal.FORMATTING

5. PREDICT
   └─ If: Artifact.PROBABILITY
      AND epistemic_grounding (uncertainty + future/real-world)

6. GENERATE
   └─ If: Artifact.PROBABILITY (without epistemic grounding)
      OR Artifact.STRUCTURED
      OR Artifact.TEXT
      OR Artifact.LIST

7. RANK
   └─ If: Signal.RANKING
      OR Artifact.DECISION

8. DEBUG
   └─ If: Signal.DEBUGGING

9. SEARCH
   └─ If: Signal.SEARCH

10. EXECUTE
    └─ If: Signal.EXECUTION

11. ANALYZE (default)
    └─ If: None of the above

5. SPEC Detection (Output Specialization)

In addition to REQ detection, IntentDetectorV2 also detects SPEC (specification) which refines what type of output is being generated/predicted/extracted:

SPEC detection uses three methods (scored):

Method	Score	Description
Explicit patterns	3 points	Phrases like "generate a REPORT", "return a SUMMARY"
Artifact mapping	2 points	Artifact.VALIDATION → VALIDATION_RESULT, Artifact.DECISION → RECOMMENDATION
Keyword matching	1 point	Domain-specific keywords (see below)

SPEC Ontology (domain artifacts): - SUPPORT_RESPONSE - Customer support responses - TROUBLESHOOTING_GUIDE - Step-by-step troubleshooting - BETTING_ODDS - Betting or odds information - PROBABILITY_DISTRIBUTION - Statistical distributions (excluded as non-domain) - FORECAST - Future projections - REPORT - Analysis reports - SUMMARY - Condensed summaries - RECOMMENDATION - Recommendations or decisions - RANKING - Ordered lists - JSON_OBJECT / JSON_SCHEMA - (excluded as format, not domain) - FIELDS - Field extraction - ENTITIES - Entity extraction - VALIDATION_RESULT - Validation outcomes

SPEC keyword mappings:

BETTING_ODDS: ["odds", "betting", "bookmaker"]
FORECAST: ["forecast", "projection"]
SUMMARY: ["summary", "recap", "overview"]
REPORT: ["report", "analysis document"]
SUPPORT_RESPONSE: ["support", "ticket", "issue", "incident"]
TROUBLESHOOTING_GUIDE: ["troubleshoot", "troubleshooting", "steps"]

How SPEC appears in tokens:

[REQ:GENERATE:SPECS:REPORT]
[REQ:PREDICT:SPECS:FORECAST]
[REQ:VALIDATE:SPECS:VALIDATION_RESULT]

Complete Detection Example

Input:

"Analyze this customer support transcript and generate a detailed report
with sentiment analysis. Check if the agent followed compliance guidelines."

Detection process:

Signals detected:
"analyze" → Signal.ANALYSIS
"generate" → Signal.GENERATION
Artifacts detected:
"report" → Artifact.TEXT
"check" / "compliance" → Artifact.VALIDATION
REQ resolution:
Has validation signal + has validation target (TEXT)
Result: REQ.VALIDATE (validation takes priority)
SPEC detection:
"generate a...report" → explicit pattern (3 points) → "REPORT"
"compliance" keywords → "VALIDATION_RESULT"
Highest scorer: REPORT

Final output:

[REQ:VALIDATE:SPECS:REPORT]

More Real-World Examples

Example 1: Weather Prediction

Input: "What's the probability it will rain tomorrow in Seattle?"

Signals: PREDICTION (predict)
Artifacts: PROBABILITY (probability)
Epistemic: Yes (probability + will + real-world:weather)
REQ: PREDICT
SPEC: FORECAST (forecast implied)

Output: [REQ:PREDICT:SPECS:FORECAST]

Example 2: Data Extraction

Input: "Extract all email addresses and phone numbers from this document"

Signals: EXTRACTION (extract)
Artifacts: None (no JSON/list patterns in prompt)
REQ: EXTRACT
SPEC: ENTITIES (email addresses, phone numbers are entities)

Output: [REQ:EXTRACT:SPECS:ENTITIES]

Example 3: Compliance Validation

Input: "Verify that the agent followed all required disclosure steps and
validate compliance with company policies"

Signals: VALIDATION (verify, validate)
Artifacts: VALIDATION (verify, validate, compliance keywords)
Has validation target: Yes (implied TEXT)
REQ: VALIDATE
SPEC: VALIDATION_RESULT (validation context)

Output: [REQ:VALIDATE:SPECS:VALIDATION_RESULT]

Example 4: Report Generation

Input: "Create a summary report of customer feedback trends from Q4"

Signals: GENERATION (create, summary)
Artifacts: TEXT (report)
REQ: GENERATE (has TEXT artifact)
SPEC: REPORT (report explicitly mentioned)

Output: [REQ:GENERATE:SPECS:REPORT]

Example 5: Probability Distribution (Synthetic)

Input: "Generate a probability distribution for rolling two dice"

Signals: GENERATION (generate)
Artifacts: PROBABILITY (probability)
Epistemic: No (synthetic scenario, not real-world prediction)
REQ: GENERATE (probability without epistemic grounding)
SPEC: None (PROBABILITY_DISTRIBUTION excluded as non-domain)

Output: [REQ:GENERATE]

Example 6: Data Transformation

Input: "Convert this CSV data to JSON format {csv_data}"

Signals: TRANSFORMATION (convert)
Artifacts: STRUCTURED (JSON mentioned, {csv_data} pattern)
Has transform target: Yes (STRUCTURED)
REQ: TRANSFORM
SPEC: None

Output: [REQ:TRANSFORM]

Example 7: Recommendation/Ranking

Input: "Rank these candidates by best fit for the senior engineer position"

Signals: RANKING (rank)
Artifacts: DECISION (best)
REQ: RANK
SPEC: RANKING

Output: [REQ:RANK:SPECS:RANKING]

Example 8: Troubleshooting Guide

Input: "Generate troubleshooting steps for network connectivity issues"

Signals: GENERATION (generate)
Artifacts: TEXT (guide implied), LIST (steps)
REQ: GENERATE
SPEC: TROUBLESHOOTING_GUIDE (troubleshooting keyword)

Output: [REQ:GENERATE:SPECS:TROUBLESHOOTING_GUIDE]

Key Takeaways

REQ detection is hierarchical - certain REQs take priority (VALIDATE > EXTRACT > TRANSFORM > PREDICT > GENERATE)
Signals + Artifacts + Context all contribute to the final decision
SPEC adds domain specificity to the output type (REPORT, FORECAST, VALIDATION_RESULT, etc.)
Epistemic grounding distinguishes PREDICT from GENERATE for probabilistic tasks
Default is ANALYZE when no clear signals are detected

Complete Vocabulary Reference

The complete trigger phrase vocabularies are defined in language-specific vocabulary files:

Location: clm_core/dictionary/{lang}/vocabulary.py

Available languages: - en - English (ENVocabulary) - es - Spanish (ESVocabulary) - pt - Portuguese (PTVocabulary) - fr - French (FRVocabulary)

Key vocabulary properties:

REQ_TOKENS - Maps REQ types to trigger phrases: python "ANALYZE": ["analyze", "review", "examine", "evaluate", "assess", ...] "EXTRACT": ["extract", "pull out", "identify", "find", "retrieve", ...] "GENERATE": ["generate", "create", "write", "draft", "compose", ...] "VALIDATE": ["validate", "verify", "check", "confirm", "ensure", ...] "TRANSFORM": ["convert", "transform", "change", "rewrite", ...] "FORMAT": ["format", "structure", "organize", "layout", ...] "DEBUG": ["debug", "troubleshoot", "diagnose", "fix bug", ...] "SEARCH": ["search", "query", "lookup", "find", "look for", ...] "RANK": ["prioritize", "order", "sort by", "rate", "rank", ...] "PREDICT": ["predict", "forecast", "project", "estimate future", ...] "CALCULATE": ["calculate", "compute", "figure out", "quantify", ...] "EXECUTE": ["use", "apply", "implement", "run", "perform", ...]
EPISTEMIC_KEYWORDS - Keywords for epistemic grounding: python "future": ["next", "upcoming", "future", "will", "expected", "forecast", ...] "uncertainty": ["chance", "likelihood", "probability", "odds", "risk", ...] "real_world": ["match", "season", "weather", "election", "market", ...]
Other useful vocabularies:
ACTION_VERBS - General action verbs
COMPOUND_PHRASES - Multi-word phrases (e.g., "customer support" → "TICKET")
TYPE_MAP - Document type mappings
CONTEXT_MAP - Domain context mappings

Example usage:

from clm_core.dictionary.en.vocabulary import ENVocabulary

vocab = ENVocabulary()

# Get all trigger phrases for EXTRACT
extract_phrases = vocab.REQ_TOKENS["EXTRACT"]
# ["extract", "pull out", "identify", "find", ...]

# Get epistemic keywords
future_keywords = vocab.EPISTEMIC_KEYWORDS["future"]
# ["next", "upcoming", "future", "will", ...]

TARGET Token (Object/Source)

Common values: - TRANSCRIPT - Conversation record - DOCUMENT - General document - TICKET - Support ticket - CODE - Source code - DATA - Dataset - EMAIL - Email message - INVOICE - Invoice document - REPORT - Analysis report

Attributes: - DOMAIN - Subject area: DOMAIN=SUPPORT, DOMAIN=FINANCE - TYPE - Specific subtype: TYPE=INVOICE, TYPE=CONTRACT - TOPIC - Subject matter: TOPIC=BILLING, TOPIC=TECHNICAL

Examples:

[TARGET:TRANSCRIPT]
[TARGET:TRANSCRIPT:DOMAIN=SUPPORT]
[TARGET:DOCUMENT:TYPE=INVOICE:DOMAIN=FINANCE]

EXTRACT Token (Fields to Extract)

Common values:

Customer Service: - SENTIMENT, URGENCY, ISSUE, RESOLUTION, COMPLIANCE, DISCLOSURES

Entities: - NAMES, DATES, AMOUNTS, EMAILS, PHONES, ADDRESSES

Technical: - BUGS, ERRORS, PERFORMANCE, SECURITY

Business: - METRICS, DECISIONS, ACTIONS, NEXT_STEPS, OWNERS

Attributes: - TYPE - Data structure: TYPE=LIST, TYPE=TABLE - DOMAIN - Context: DOMAIN=LEGAL, DOMAIN=FINANCE - SOURCE - Origin: SOURCE=AGENT, SOURCE=CUSTOMER

Examples:

[EXTRACT:SENTIMENT,URGENCY,RESOLUTION]
[EXTRACT:COMPLIANCE:SOURCE=AGENT]
[EXTRACT:NAMES,EMAILS,PHONES:TYPE=LIST]

CTX Token (Context)

Common patterns:

[CTX:CUSTOMER_SERVICE]
[CTX:LANGUAGE=EN]
[CTX:TONE=PROFESSIONAL]
[CTX:ESCALATE_IF=BASIC_FAILED:TARGET=TIER2]

OUT Token (Output Format)

Simple format:

[OUT:JSON]
[OUT:MARKDOWN]
[OUT:TABLE]
[OUT:LIST]

Structured JSON:

Basic:

[OUT_JSON:{field1,field2,field3}]

With types (infer_types=True):

[OUT_JSON:{summary:STR,score:FLOAT,items:[STR]}]

Nested:

[OUT_JSON:{summary:STR,scores:{accuracy:FLOAT,compliance:FLOAT}}]

With enums (add_attrs=True):

[OUT_JSON:{score:FLOAT}:ENUMS={"ranges":[{"min":0.0,"max":0.49,"label":"FAIL"}]}]

REF Token (References)

Examples:

[REF:CASE=12345]
[REF:TICKET=TKT-789]
[REF:KB=KB-001]
[REF:POLICY=POL-2024-05]

Intent Detection in the Encoding Pipeline

The IntentDetectorV2 is the first step in the system prompt encoding pipeline:

┌─────────────────────────────────────────────────────────────┐
│  1. INTENT DETECTION (IntentDetectorV2)                     │
│     Input: Natural language prompt                          │
│     Output: Intent (REQ + SPEC)                             │
│                                                              │
│     Process:                                                 │
│     • Detect signals from vocabulary                        │
│     • Detect artifacts from patterns                        │
│     • Check epistemic grounding                             │
│     • Resolve REQ token (priority-based)                    │
│     • Extract SPEC (scoring-based)                          │
└─────────────────────────────────────────────────────────────┘
                            ↓
┌─────────────────────────────────────────────────────────────┐
│  2. TARGET DETECTION                                         │
│     Extract what the operation targets (TRANSCRIPT, etc.)   │
│     Output: [TARGET:TRANSCRIPT:DOMAIN=SUPPORT]              │
└─────────────────────────────────────────────────────────────┘
                            ↓
┌─────────────────────────────────────────────────────────────┐
│  3. EXTRACTION FIELD DETECTION                               │
│     Identify fields to extract (SENTIMENT, URGENCY, etc.)   │
│     Output: [EXTRACT:SENTIMENT,URGENCY,RESOLUTION]          │
└─────────────────────────────────────────────────────────────┘
                            ↓
┌─────────────────────────────────────────────────────────────┐
│  4. CONTEXT DETECTION                                        │
│     Extract context and conditions                          │
│     Output: [CTX:LANGUAGE=EN]                               │
└─────────────────────────────────────────────────────────────┘
                            ↓
┌─────────────────────────────────────────────────────────────┐
│  5. OUTPUT FORMAT DETECTION                                  │
│     Parse output schema and format requirements             │
│     Output: [OUT_JSON:{field:TYPE,...}:ENUMS={...}]         │
└─────────────────────────────────────────────────────────────┘
                            ↓
┌─────────────────────────────────────────────────────────────┐
│  6. REFERENCE DETECTION                                      │
│     Extract IDs and references                              │
│     Output: [REF:TICKET=TKT-123]                            │
└─────────────────────────────────────────────────────────────┘
                            ↓
┌─────────────────────────────────────────────────────────────┐
│  FINAL COMPRESSED OUTPUT:                                    │
│  [REQ:VALIDATE:SPECS:REPORT]                                │
│  [TARGET:TRANSCRIPT:DOMAIN=SUPPORT]                         │
│  [EXTRACT:SENTIMENT,URGENCY,RESOLUTION]                     │
│  [CTX:LANGUAGE=EN]                                           │
│  [OUT_JSON:{summary:STR,scores:{...}}:ENUMS={...}]          │
│  [REF:TICKET=TKT-123]                                        │
└─────────────────────────────────────────────────────────────┘

Key points: - Intent detection happens first and determines the REQ token - REQ token guides the rest of the encoding process - SPEC provides additional domain context for the output - All components work together to form the complete compressed prompt

Complete System Prompt Example

Original:

You are a Call QA & Compliance Scoring System for customer service operations.

TASK:
Analyze the transcript and score the agent's compliance across required QA categories.

ANALYSIS CRITERIA:
- Mandatory disclosures and verification steps
- Policy adherence
- Soft-skill behaviors (empathy, clarity, ownership)

OUTPUT FORMAT:
{
    "summary": "short_summary",
    "qa_scores": {
        "verification": 0.0,
        "policy_adherence": 0.0,
        "soft_skills": 0.0,
        "compliance": 0.0
    },
    "violations": ["list_any_detected"]
}

Compressed (Level 1: No types, no attrs):

[REQ:ANALYZE] [TARGET:TRANSCRIPT:DOMAIN=QA] 
[EXTRACT:COMPLIANCE,DISCLOSURES,VERIFICATION,POLICY,SOFT_SKILLS:TYPE=LIST,DOMAIN=LEGAL] 
[OUT_JSON:{summary,qa_scores:{verification,policy_adherence,soft_skills,compliance},violations}]

Compression: 70.7%

Compressed (Level 4: Types + attrs):

[REQ:ANALYZE] [TARGET:TRANSCRIPT:DOMAIN=QA] 
[EXTRACT:COMPLIANCE,DISCLOSURES,VERIFICATION,POLICY,SOFT_SKILLS:TYPE=LIST,DOMAIN=LEGAL] 
[OUT_JSON:{summary:STR,qa_scores:{verification:FLOAT,policy_adherence:FLOAT,soft_skills:FLOAT,compliance:FLOAT},violations:[STR]}:ENUMS={"ranges":[{"min":0.0,"max":0.49,"label":"FAIL"},{"min":0.5,"max":0.74,"label":"NEEDS_IMPROVEMENT"},{"min":0.75,"max":0.89,"label":"GOOD"},{"min":0.9,"max":1.0,"label":"EXCELLENT"}]}]

Compression: 26.6%

Part 1b: Configuration Prompt Tokenization

Purpose

Configuration prompts are template-based system instructions that define an agent's persistent behavior. Unlike task prompts that focus on a specific action, configuration prompts establish identity, rules, and behavioral patterns.

Key differences from Task Prompts: - Define agent role and persona (not actions) - Contain behavioral rules (basic and custom) - Support runtime placeholders for dynamic values - Include priority definitions for rule conflicts

Configuration Prompt Token Types

+-----------------------------------------+
|  1. PROMPT_MODE - Prompt type           |  <- Identifier
|     [PROMPT_MODE:CONFIGURATION]         |
+-----------------------------------------+
           |
           v
+-----------------------------------------+
|  2. ROLE - Agent identity               |  <- Who
|     [ROLE:CUSTOMER_SUPPORT_AGENT]       |
+-----------------------------------------+
           |
           v
+-----------------------------------------+
|  3. RULES - Active rule sets            |  <- Behavior
|     [RULES:BASIC,CUSTOM]                |
+-----------------------------------------+
           |
           v
+-----------------------------------------+
|  4. PRIORITY - Conflict resolution      |  <- Precedence
|     [PRIORITY:CUSTOM_OVER_BASIC]        |
+-----------------------------------------+
           |
           v
+-----------------------------------------+
|  5. OUT - Output format (optional)      |  <- Output spec
|     [OUT_JSON:{field:TYPE}]             |
+-----------------------------------------+

Token Definitions

Token	Purpose	Required	Examples
PROMPT_MODE	Identifies prompt type	Yes	`[PROMPT_MODE:CONFIGURATION]`
ROLE	Agent identity/persona	When detected	`[ROLE:ASSISTANT]`, `[ROLE:CUSTOMER_SUPPORT_AGENT]`
RULES	Active rule sets	When detected	`[RULES:BASIC]`, `[RULES:BASIC,CUSTOM]`
PRIORITY	Rule conflict resolution	When detected	`[PRIORITY:CUSTOM_OVER_BASIC]`
OUT	Output format	When specified	`[OUT_JSON:{response:STR}]`

PROMPT_MODE Token

Purpose: Identifies this as a configuration prompt (vs task prompt)

Values: - CONFIGURATION - Template-based agent configuration - TASK - Action-oriented task prompt (default)

Example:

[PROMPT_MODE:CONFIGURATION]

ROLE Token

Purpose: Captures the agent's identity and persona

Detection patterns: - <role>You are a...</role> tags - "You are a..." or "Your role is..." phrases

Examples:

[ROLE:HELPFUL_ASSISTANT]
[ROLE:CUSTOMER_SUPPORT_AGENT]
[ROLE:CONTENT_MODERATOR]
[ROLE:PROFESSIONAL_TRANSLATOR]

Normalization: - Spaces replaced with underscores - Converted to uppercase - Articles (a, an, the) removed

RULES Token

Purpose: Indicates which rule sets are active

Values: - BASIC - Standard/default rules detected - CUSTOM - User-specific rules detected

Detection patterns: - <basic_rules> tags or "basic rules" phrase - <custom_rules> tags or "custom instructions" phrase

Examples:

[RULES:BASIC]
[RULES:CUSTOM]
[RULES:BASIC,CUSTOM]

PRIORITY Token

Purpose: Defines how rule conflicts should be resolved

Values: - CUSTOM_OVER_BASIC - Custom rules take precedence

Detection patterns: - "custom instructions are paramount" - "prioritize custom instructions" - "custom instructions override"

Example:

[PRIORITY:CUSTOM_OVER_BASIC]

Configuration Prompt Example

Original:

<role>You are a helpful customer support agent</role>

<basic_rules>
- Be polite and professional
- Verify customer identity
- Document all interactions
</basic_rules>

<custom_rules>
- Address customer as: {{customer_name}}
- Account tier: {{account_tier}}
</custom_rules>

Follow the basic rules as your foundation. If there are conflicts
between basic rules and custom instructions, prioritize custom
instructions. Custom instructions are paramount.

OUTPUT:
{
    "response": "message",
    "escalate": true/false
}

Compressed CL Token:

[PROMPT_MODE:CONFIGURATION][ROLE:CUSTOMER_SUPPORT_AGENT][RULES:BASIC,CUSTOM][PRIORITY:CUSTOM_OVER_BASIC][OUT_JSON:{response:STR,escalate:BOOL}]

Metadata extracted: - role: "CUSTOMER_SUPPORT_AGENT" - rules: {"basic": true, "custom": true} - priority: "CUSTOM_OVER_BASIC" - placeholders: ["customer_name", "account_tier"] - output_format: "{response:STR,escalate:BOOL}"

Two-Phase Compression

Configuration prompts use a two-phase compression approach:

Phase 1: CL Token Generation - Extract semantic elements (role, rules, priority) - Generate compressed CL tokens - Detect and encode output format

Phase 2: NL Minimization - Remove redundant meta-instructions - Suppress priority explanations (encoded in CL) - Trim verbose rule descriptions - Remove content already encoded in CL tokens

Result: CL tokens + minimized NL prompt

See Configuration Prompt Encoding for complete documentation.

Part 2: Transcript Tokenization (v2 Schema)

Purpose

Compress customer service conversations into an explicit semantic contract while preserving: - Interaction metadata (channel, duration, language) - Domain and service context - Customer intent (derived from customer utterances) - Context provided (PII-safe) - Agent and system actions (separated) - Resolution outcome and authoritative state - Commitments and artifacts - Emotional trajectory

The 14 Semantic Blocks

┌──────────────────────────────────────────────────┐
│  1. INTERACTION - Interaction metadata           │  ← Setup
│     [INTERACTION:SUPPORT:CHANNEL=VOICE]          │
└──────────────────────────────────────────────────┘
           ↓
┌──────────────────────────────────────────────────┐
│  2. DURATION - Call duration                     │  ← Time
│     [DURATION=6m]                                │
└──────────────────────────────────────────────────┘
           ↓
┌──────────────────────────────────────────────────┐
│  3. LANG - Language                              │  ← Language
│     [LANG=EN]                                    │
└──────────────────────────────────────────────────┘
           ↓
┌──────────────────────────────────────────────────┐
│  4. DOMAIN - Service area classification         │  ← Domain
│     [DOMAIN:BILLING]                             │
└──────────────────────────────────────────────────┘
           ↓
┌──────────────────────────────────────────────────┐
│  5. SERVICE - Service within domain              │  ← Service
│     [SERVICE:SUBSCRIPTION]                       │
└──────────────────────────────────────────────────┘
           ↓
┌──────────────────────────────────────────────────┐
│  6. CUSTOMER_INTENT - Customer's goal            │  ← Intent
│     [CUSTOMER_INTENT:REPORT_DUPLICATE_CHARGE]    │
└──────────────────────────────────────────────────┘
           ↓
┌──────────────────────────────────────────────────┐
│  7. CONTEXT - Facts provided (PII-safe)          │  ← Context
│     [CONTEXT:EMAIL_PROVIDED]                     │
└──────────────────────────────────────────────────┘
           ↓
┌──────────────────────────────────────────────────┐
│  8. AGENT_ACTIONS - Agent operations chain       │  ← Actions
│     [AGENT_ACTIONS:VERIFIED→DIAGNOSED→REFUNDED]  │
└──────────────────────────────────────────────────┘
           ↓
┌──────────────────────────────────────────────────┐
│  9. SYSTEM_ACTIONS - Automated events            │  ← System
│     [SYSTEM_ACTIONS:PAYMENT_RETRY_DETECTED]      │
└──────────────────────────────────────────────────┘
           ↓
┌──────────────────────────────────────────────────┐
│  10. RESOLUTION - Outcome type                   │  ← Outcome
│      [RESOLUTION:ISSUE_RESOLVED]                 │
└──────────────────────────────────────────────────┘
           ↓
┌──────────────────────────────────────────────────┐
│  11. STATE - Authoritative status                │  ← Status
│      [STATE:RESOLVED]                            │
└──────────────────────────────────────────────────┘
           ↓
┌──────────────────────────────────────────────────┐
│  12. COMMITMENT - SLA / promised actions         │  ← Promises
│      [COMMITMENT:REFUND_3-5_DAYS]                │
└──────────────────────────────────────────────────┘
           ↓
┌──────────────────────────────────────────────────┐
│  13. ARTIFACT - Structured identifiers           │  ← IDs
│      [ARTIFACT:REFUND_REF=RFD-908712]            │
└──────────────────────────────────────────────────┘
           ↓
┌──────────────────────────────────────────────────┐
│  14. SENTIMENT - Emotional trajectory            │  ← Feeling
│      [SENTIMENT:NEUTRAL→GRATEFUL]                │
└──────────────────────────────────────────────────┘

Token Definitions

Token	Format	Purpose	Example
INTERACTION	`[INTERACTION:TYPE:CHANNEL=ch]`	Interaction metadata	`[INTERACTION:SUPPORT:CHANNEL=VOICE]`
DURATION	`[DURATION=Xm]`	Call duration	`[DURATION=6m]`
LANG	`[LANG=XX]`	Language	`[LANG=EN]`
DOMAIN	`[DOMAIN:TYPE]`	Domain classification	`[DOMAIN:BILLING]`
SERVICE	`[SERVICE:TYPE]`	Service area	`[SERVICE:SUBSCRIPTION]`
CUSTOMER_INTENT	`[CUSTOMER_INTENT:INTENT]`	Customer's goal	`[CUSTOMER_INTENT:REQUEST_REFUND]`
CONTEXT	`[CONTEXT:TYPE]`	PII-safe context	`[CONTEXT:EMAIL_PROVIDED]`
AGENT_ACTIONS	`[AGENT_ACTIONS:A1→A2→A3]`	Agent action chain	`[AGENT_ACTIONS:ACCOUNT_VERIFIED→REFUND_INITIATED]`
SYSTEM_ACTIONS	`[SYSTEM_ACTIONS:E1→E2]`	System events	`[SYSTEM_ACTIONS:PAYMENT_RETRY_DETECTED]`
RESOLUTION	`[RESOLUTION:TYPE]`	Outcome type	`[RESOLUTION:ISSUE_RESOLVED]`
STATE	`[STATE:STATUS]`	Authoritative status	`[STATE:RESOLVED]`
COMMITMENT	`[COMMITMENT:PROMISE]`	SLA/promised actions	`[COMMITMENT:REFUND_3-5_DAYS]`
ARTIFACT	`[ARTIFACT:TYPE=VALUE]`	Structured identifiers	`[ARTIFACT:REFUND_REF=RFD-908712]`
SENTIMENT	`[SENTIMENT:START→END]`	Emotional trajectory	`[SENTIMENT:NEUTRAL→GRATEFUL]`

INTERACTION Token

Format: [INTERACTION:TYPE:CHANNEL=channel]

Type values: - SUPPORT - Customer support - SALES - Sales interaction - BILLING - Billing-specific

Channel values: - VOICE - Phone call - CHAT - Live chat - EMAIL - Email - SLACK - Slack thread

Examples:

[INTERACTION:SUPPORT:CHANNEL=VOICE]
[INTERACTION:SALES:CHANNEL=CHAT]
[INTERACTION:BILLING:CHANNEL=EMAIL]

DURATION Token

Format: [DURATION=Xm]

Purpose: Approximate call duration in minutes (derived from turn count: ~2 turns/minute)

Examples:

[DURATION=6m]
[DURATION=15m]
[DURATION=1m]

LANG Token

Format: [LANG=XX]

Purpose: Language metadata. Schema is language-invariant — extraction normalizes cross-language expressions to the same enums.

Values: EN, ES, PT, FR

DOMAIN Token

Format: [DOMAIN:TYPE]

Purpose: Explicit service area classification

Domain types (20–30 max): - BILLING - Payment and billing issues - AUTHENTICATION - Login, access, credentials - BOOKINGS - Reservations, scheduling - API - API-related issues - PERFORMANCE - Speed, reliability - TECHNICAL - Technical support - SHIPPING - Delivery and shipping

Examples:

[DOMAIN:BILLING]
[DOMAIN:AUTHENTICATION]
[DOMAIN:TECHNICAL]

SERVICE Token

Format: [SERVICE:TYPE]

Purpose: Service area within the domain

Service types: - SUBSCRIPTION - Subscription management - HOST_STAY - Hospitality/hosting - PAYMENT - Payment processing - DASHBOARD - Dashboard/UI - EXPORTS - Data exports

Examples:

[SERVICE:SUBSCRIPTION]
[SERVICE:PAYMENT]

CUSTOMER_INTENT Token

Format: [CUSTOMER_INTENT:INTENT]

Purpose: Primary customer intent derived strictly from customer utterances. Must not be inferred solely from agent actions.

Intent types (20–40 max): - REQUEST_REFUND - REPORT_DUPLICATE_CHARGE - ACCOUNT_UNLOCK - FEATURE_INQUIRY - CANCEL_BOOKING - REPORT_OUTAGE - REQUEST_UPGRADE

Rules: - One primary intent required - Optional secondary intent allowed - Must not be inferred solely from agent actions

Examples:

[CUSTOMER_INTENT:REPORT_DUPLICATE_CHARGE]
[CUSTOMER_INTENT:REQUEST_REFUND]
[CUSTOMER_INTENT:ACCOUNT_UNLOCK]

CONTEXT Token

Format: [CONTEXT:TYPE]

Purpose: Indicates fact-of-information provided by customer without leaking PII

Context types: - EMAIL_PROVIDED - PHONE_PROVIDED - BOOKING_ID_PROVIDED - PAYMENT_METHOD_PROVIDED - ACCOUNT_ID_PROVIDED - ORDER_ID_PROVIDED - TRACKING_ID_PROVIDED

Redacted variant:

[CONTEXT:PAYMENT_METHOD_REDACTED]

Examples:

[CONTEXT:EMAIL_PROVIDED]
[CONTEXT:BOOKING_ID_PROVIDED]

AGENT_ACTIONS Token

Format: [AGENT_ACTIONS:ACTION1→ACTION2→ACTION3]

Purpose: Operational actions performed by human agent, joined as an ordered chain with arrows (→)

Action types (30–50 max): - ACCOUNT_VERIFIED - DIAGNOSTIC_PERFORMED - REFUND_INITIATED - BOOKING_CANCELLED - ACCOUNT_UNLOCKED - API_KEY_ROTATED - ESCALATED_TIER2

Avoid generic verbs: TROUBLESHOOT, CHECKED, ACTION_TAKEN

Examples:

[AGENT_ACTIONS:ACCOUNT_VERIFIED→DIAGNOSTIC_PERFORMED→REFUND_INITIATED]
[AGENT_ACTIONS:ACCOUNT_UNLOCKED]
[AGENT_ACTIONS:API_KEY_ROTATED→ESCALATED_TIER2]

SYSTEM_ACTIONS Token

Format: [SYSTEM_ACTIONS:EVENT1→EVENT2]

Purpose: Automated system-level events (optional, only emitted when detected)

System action types: - PAYMENT_RETRY_DETECTED - AUTO_ESCALATION_TRIGGERED - SLA_BREACH_DETECTED - FRAUD_ALERT_TRIGGERED - ACCOUNT_AUTO_LOCKED - NOTIFICATION_SENT

Examples:

[SYSTEM_ACTIONS:PAYMENT_RETRY_DETECTED]
[SYSTEM_ACTIONS:AUTO_ESCALATION_TRIGGERED→SLA_BREACH_DETECTED]

RESOLUTION Token

Format: [RESOLUTION:TYPE]

Purpose: Describes outcome type (not state)

Resolution types: - ISSUE_RESOLVED - Issue fixed - ACCOUNT_UNLOCKED - Access restored - ANSWER_PROVIDED - Information given - ESCALATED - Sent to higher tier - PENDING - Awaiting resolution - CANCELLED - Cancelled

Examples:

[RESOLUTION:ISSUE_RESOLVED]
[RESOLUTION:ESCALATED]
[RESOLUTION:ANSWER_PROVIDED]

STATE Token

Format: [STATE:STATUS]

Purpose: Authoritative interaction status. Mutually exclusive — only one STATE per transcript.

State values (5–10 max): - RESOLVED - Fully resolved - PENDING_SETTLEMENT - Awaiting settlement - PENDING_CUSTOMER - Awaiting customer action - ESCALATED - Escalated - UNRESOLVED - Not resolved

Examples:

[STATE:RESOLVED]
[STATE:PENDING_SETTLEMENT]
[STATE:ESCALATED]

COMMITMENT Token

Format: [COMMITMENT:PROMISE]

Purpose: Encodes SLA or promised actions by the agent

Examples:

[COMMITMENT:REFUND_3-5_DAYS]
[COMMITMENT:FOLLOWUP_BY_FRIDAY]
[COMMITMENT:CALLBACK_24h]
[COMMITMENT:TECHNICIAN_VISIT_MONDAY]

ARTIFACT Token

Format: [ARTIFACT:TYPE=VALUE]

Purpose: Structured identifiers extracted from conversation

Artifact types: - REFUND_REF - Refund reference number - REFUND_AMT - Refund amount - BOOKING_ID - Booking identifier - ORDER_ID - Order identifier - TRACKING_ID - Tracking number - TICKET_ID - Support ticket ID - CASE_ID - Case number - CLAIM_ID - Claim number - PRODUCT_ID - Product model

Examples:

[ARTIFACT:REFUND_REF=RFD-908712]
[ARTIFACT:REFUND_AMT=$14.99]
[ARTIFACT:ORDER_ID=ORD-456789]
[ARTIFACT:TRACKING_ID=TRK-1234]

SENTIMENT Token

Format: [SENTIMENT:START→END]

Purpose: Tracks emotional trajectory through conversation

Sentiment values: - FRUSTRATED, ANGRY, CONCERNED - NEUTRAL, CALM - SATISFIED, GRATEFUL, POSITIVE

Special notation: Uses arrows (→) to show progression. Turning points are deduplicated.

Examples:

[SENTIMENT:FRUSTRATED→NEUTRAL→SATISFIED]
[SENTIMENT:NEUTRAL→GRATEFUL]
[SENTIMENT:ANGRY→CALM→GRATEFUL]

Complete Transcript Example (v2)

Original conversation (billing dispute call):

Agent Raj: Thank you for calling customer support. My name is Raj. How can I help you today?

Customer: Hi Raj, I have a billing issue. I was charged twice this month for my subscription.

Agent Raj: I'm sorry to hear that. Let me look into your account. Can I have your email?

Customer: Sure, it's melissa.jordan@example.com

Agent Raj: I see two charges. The system retried payment after the first succeeded.
I'll process a full refund. Reference number RFD-908712, 3 to 5 business days.

Customer: Thank you so much for your help!

Compressed (v2):

[INTERACTION:SUPPORT:CHANNEL=VOICE]
[DURATION=6m]
[LANG=EN]
[DOMAIN:BILLING]
[SERVICE:SUBSCRIPTION]
[CUSTOMER_INTENT:REPORT_DUPLICATE_CHARGE]
[CONTEXT:EMAIL_PROVIDED]
[AGENT_ACTIONS:ACCOUNT_VERIFIED→DIAGNOSTIC_PERFORMED→REFUND_INITIATED]
[SYSTEM_ACTIONS:PAYMENT_RETRY_DETECTED]
[RESOLUTION:ISSUE_RESOLVED]
[STATE:RESOLVED]
[COMMITMENT:REFUND_3-5_DAYS]
[ARTIFACT:REFUND_REF=RFD-908712]
[SENTIMENT:NEUTRAL→GRATEFUL]

Original: ~1,450 tokens Compressed: ~145 tokens Reduction: 90%

Key Differences from System Prompts

Aspect	System Prompts	Transcripts (v2)
Token Types	REQ, TARGET, EXTRACT, CTX, OUT, REF	INTERACTION, DOMAIN, CUSTOMER_INTENT, AGENT_ACTIONS, STATE, SENTIMENT, etc.
Structure	Hierarchical (instruction flow)	Sequential (semantic blocks)
Purpose	Instruction compression	Conversation compression as explicit semantic contract
Flow	Logical (what→how→output)	Temporal (metadata→intent→actions→outcome→sentiment)
Special Features	Nested JSON with types/enums	PII-safe context, separated agent/system actions
Actions	Comma-separated: `REQ:ACTION1,ACTION2`	Arrow-chained: `AGENT_ACTIONS:A1→A2→A3`

Part 3: Structured Data Format

Purpose

Compress tabular data (catalogs, products, rules) while preserving: - Field schema - Record structure - Relationships - Data types

Format Structure

NOT token-based - uses header + row format:

[DATASET_NAME:COUNT]{FIELD1,FIELD2,FIELD3,...}
[value1,value2,value3,...]
[value1,value2,value3,...]

Components

1. Header:

[DATASET_NAME:COUNT]{FIELD_NAMES}

DATASET_NAME: Data type (e.g., NBA_CATALOG, PRODUCT, RULE)
COUNT: Number of records
{FIELD_NAMES}: Comma-separated field list

2. Rows:

[value1,value2,value3,...]

One row per record
Values in same order as header
Nested objects: {KEY:VALUE}
Arrays: [ITEM1,ITEM2]

Data Type Handling

Type	Original	Compressed
String	`"Hello World"`	`HELLO_WORLD`
Number	`1299.99`	`1299.99`
Boolean	`true`	`TRUE`
Array	`["a", "b", "c"]`	`[A,B,C]`
Null	`null`	`NULL`
Date	`"2024-10-15"`	`2024-10-15`
Object	`{"key": "value"}`	`{KEY:VALUE}`

Complete Example: NBA Catalog

Original:

[
  {
    "nba_id": "NBA-001",
    "action": "Offer Premium Upgrade",
    "description": "Recommend premium tier to qualified customers",
    "conditions": ["tenure > 12 months", "no recent complaints"],
    "priority": "high",
    "channel": "phone",
    "expected_value": 450
  },
  {
    "nba_id": "NBA-002",
    "action": "Cross-sell Credit Card",
    "description": "Offer co-branded credit card to active users",
    "conditions": ["good credit score", "active checking"],
    "priority": "medium",
    "channel": "email",
    "expected_value": 300
  }
]

Compressed:

[NBA_CATALOG:2]{NBA_ID,ACTION,DESCRIPTION,CONDITIONS,PRIORITY,CHANNEL,EXPECTED_VALUE}
[NBA-001,OFFER_PREMIUM_UPGRADE,RECOMMEND_PREMIUM_TIER,[TENURE>12M,NO_COMPLAINTS],HIGH,PHONE,450]
[NBA-002,CROSS_SELL_CREDIT_CARD,OFFER_COBRANDED_CARD,[GOOD_CREDIT,ACTIVE_CHECKING],MEDIUM,EMAIL,300]

Compression: ~82%

Nested Structures

Original:

{
  "sku": "LAPTOP-001",
  "name": "Professional Laptop",
  "specifications": {
    "processor": "Intel i7",
    "ram": "16GB DDR5",
    "storage": "512GB SSD"
  },
  "features": ["Backlit keyboard", "Fingerprint reader"]
}

Compressed:

[PRODUCT:1]{SKU,NAME,SPECIFICATIONS,FEATURES}
[LAPTOP-001,PROFESSIONAL_LAPTOP,{PROCESSOR:I7,RAM:16GB_DDR5,STORAGE:512GB_SSD},[BACKLIT_KB,FINGERPRINT]]

Key Differences from Token Systems

Aspect	System Prompt / Transcript	Structured Data
Format	Semantic tokens	Header + rows
Categories	6 or 7 token types	No token categories
Purpose	Meaning compression	Schema + data compression
Structure	Token hierarchy/flow	Tabular (spreadsheet-like)
Nesting	Via token attributes	Via {}, [] notation
Semantic	High (preserves meaning)	Medium (preserves structure)

Common Principles

Despite using different systems, all three encoders share core principles:

1. Semantic Preservation

Goal: Maintain complete meaning in compressed form

All three systems preserve the essential meaning of the original content, just using different methods: - System Prompts: Hierarchical semantic tokens - Transcripts: Sequential semantic blocks (v2 schema) - Structured Data: Schema-based compression

2. LLM-Native Format

Goal: LLMs understand without decompression

All three formats are designed to be understood by modern LLMs (GPT-4, Claude, etc.) without requiring decompression:

System Prompt:  [REQ:ANALYZE] [TARGET:TRANSCRIPT] [EXTRACT:SENTIMENT]
Transcript:     [DOMAIN:BILLING] [CUSTOMER_INTENT:REQUEST_REFUND] [STATE:RESOLVED]
Structured:     [NBA_CATALOG:2]{ID,ACTION} [NBA-001,UPGRADE] [NBA-002,CROSS_SELL]

LLMs can process all three formats directly.

3. Predictable Structure

Goal: Consistent, parseable format

Each system has clear syntax rules: - System Prompts: [CATEGORY:VALUE:ATTR=VAL] - Transcripts: [TOKEN:TYPE:ATTR=VAL] - Structured Data: [HEADER]{FIELDS} + [values]

4. Compression Without Loss

Goal: Dramatic size reduction while preserving information

All three achieve 60-95% token reduction while maintaining semantic completeness.

Best Practices

1. Use the Right System for Your Content

Instructions/Prompts → System Prompt Encoder
Conversations → Transcript Encoder  
Tabular Data → Structured Data Encoder

2. Understand System-Specific Features

System Prompts: - Use hierarchical flow (REQ → TARGET → EXTRACT → OUT) - Leverage OUT_JSON for structured output - Use CTX for conditions and escalation rules

Transcripts: - Follow semantic block order (INTERACTION → DOMAIN → INTENT → ACTIONS → STATE → SENTIMENT) - Use AGENT_ACTIONS chain for ordered agent operations - Separate system events into SYSTEM_ACTIONS - Use CONTEXT tokens for PII-safe fact-of-information

Structured Data: - Define clear field schema in header - Use nested notation for complex structures - Maintain consistent field order across rows

3. Test Compression Quality

# Verify compression preserves meaning
result = encoder.encode(content)

# Check compression ratio
print(f"Reduction: {result.compression_ratio:.1%}")

# Test LLM understanding
llm_response = llm.complete(
    system=result.compressed,
    user="Test query"
)
# Verify LLM understood the compressed content

Troubleshooting

Issue: Wrong Encoder Used

Symptom: Poor compression or unexpected output

Solution: Use the correct encoder for your content type

# For instructions:
sys_encoder = CLMEncoder(cfg=CLMConfig(lang="en"))

# For conversations:
transcript_encoder = CLMEncoder(cfg=CLMConfig(lang="en"))

# For tabular data:
sd_encoder = CLMEncoder(cfg=CLMConfig(
    lang="en",
    ds_config=SDCompressionConfig(...)
))

Issue: LLM Doesn't Understand Compressed Format

Symptom: LLM response quality degraded

Cause: Modern LLMs understand structured tokens

Solution: - Verify syntax is correct for the encoder type - Ensure tokens are well-formed - Test with different LLM models

Issue: Information Loss

Symptom: Important details missing from compressed output

Solution for System Prompts:

# Use less aggressive compression
config = CLMConfig(
    lang="en",
    sys_prompt_config=SysPromptConfig(
        infer_types=True,   # Add type information
        add_attrs=True      # Include enums/ranges
    )
)

Solution for Structured Data:

# Lower importance threshold
config = CLMConfig(
    lang="en",
    ds_config=SDCompressionConfig(
        importance_threshold=0.4,  # Include more fields
        max_field_length=300       # Preserve more content
    )
)

Next Steps

System Prompt Encoder - Overview of system prompt compression
Task Prompts - Using the 6-token hierarchy
Configuration Prompts - Template-based agent configuration
Structured Data Encoder - Using header + row format
CLM Vocabulary - Understanding vocabulary mappings
CLM Configuration - Configuring the encoders