CLM Configuration

Overview

The CLMConfig object is the central configuration system for CLM compression. It controls language selection, compression behavior, and provides access to language-specific vocabularies and pattern matching rules.

Key components: - Language selection (4 supported languages) - Structured data compression configuration - System prompt compression configuration - Language-specific vocabularies (semantic tokens) - Language-specific rules (regex patterns)

CLMConfig Object

Structure

from clm_core import CLMConfig, SDCompressionConfig, SysPromptConfig
from clm_core.types import ThreadConfig

config = CLMConfig(
    lang="en",                              # Language selection
    ds_config=SDCompressionConfig(...),     # Structured data config
    sys_prompt_config=SysPromptConfig(...), # System prompt config
    thread_config=ThreadConfig(...),        # Thread Encoder config
)

Parameters

`lang` (string, default: `"en"`)

Supported languages:

Code	Language	Status	Vocabulary	Rules
`en`	English	✅ Full	Complete	Complete
`pt`	Portuguese	✅ Full	Complete	Partial
`es`	Spanish	✅ Full	Complete	Partial
`fr`	French	✅ Full	Complete	Partial

Usage:

# English (full support)
config = CLMConfig(lang="en")

# Portuguese (full support)
config = CLMConfig(lang="pt")

# Spanish (full support)
config = CLMConfig(lang="es")

# French (full support)
config = CLMConfig(lang="fr")

Note: Languages marked as Beta have limited vocabulary and rule coverage. Production use is recommended only for fully supported languages (en, pt, es, fr).

`ds_config` (SDCompressionConfig)

Configuration for structured data compression. See Structured Data Encoder for complete documentation.

Default behavior:

# Auto-created with defaults if not provided
config = CLMConfig(lang="en")  # Uses default SDCompressionConfig

Custom configuration:

from clm_core import SDCompressionConfig, CLMConfig

config = CLMConfig(
    lang="en",
    ds_config=SDCompressionConfig(
        importance_threshold=0.7,
        max_truncation_length=150
    )
)

`thread_config` (ThreadConfig)

Configuration for Thread Encoder behaviour. See Thread Encoder for complete documentation.

Default behaviour:

# Auto-created with defaults if not provided
config = CLMConfig(lang="en")  # Uses default ThreadConfig

Custom configuration:

from clm_core import CLMConfig
from clm_core.types import ThreadConfig

config = CLMConfig(
    lang="en",
    thread_config=ThreadConfig(
        detect_lang=True,
        include_ctx_values=True,
        estimate_thread_duration=True,
        include_summary=True,
    )
)

ThreadConfig Parameters

Parameter	Type	Default	Description
`detect_lang`	`bool`	`True`	Detect thread language and include `[LANG=...]` in compressed output
`include_ctx_values`	`bool`	`False`	Append NER-extracted values to context tokens (e.g. `[CONTEXT:EMAIL_PROVIDED:doe@mail.com]`). When `False`, only the fact of detection is emitted
`estimate_thread_duration`	`bool`	`False`	Estimate duration from conversation content, overriding any `duration` value in metadata
`include_summary`	`bool`	`False`	Generate a natural-language summary from the compressed output without an LLM call
`custom_summary_template`	`str \\| None`	`None`	Jinja2 template for summary generation; uses built-in template when `None`
`redaction_pattern`	`str`	Built-in pattern	Regex to detect redacted PII fields. Defaults to matching `[REDACTED]`, `[REDACTED]`, `***`, `<redacted>`, `XXX`, `[PII]`

`sys_prompt_config` (SysPromptConfig)

Configuration for system prompt compression. See System Prompt Encoder for complete documentation.

Default behavior:

# Auto-created with defaults if not provided
config = CLMConfig(lang="en")  # Uses default SysPromptConfig

Custom configuration:

from clm_core import SysPromptConfig, CLMConfig

config = CLMConfig(
    lang="en",
    sys_prompt_config=SysPromptConfig(
        infer_types=True,
        add_attrs=False,
        use_structured_output_abstraction=True
    )
)

SysPromptConfig Parameters

Parameter	Type	Default	Description
`infer_types`	bool	`False`	Add type annotations to JSON output fields
`add_attrs`	bool	`False`	Include enums, ranges, and constraints in output
`use_structured_output_abstraction`	bool	`True`	Compress output format to CL tokens

infer_types - When True, adds explicit type information to JSON fields:

# infer_types=False
[OUT_JSON:{summary,score}]

# infer_types=True
[OUT_JSON:{summary:STR,score:FLOAT}]

add_attrs - When True, preserves enums, ranges, and constraints:

# add_attrs=False
[OUT_JSON:{score:FLOAT}]

# add_attrs=True
[OUT_JSON:{score:FLOAT}:ENUMS={"ranges":[{"min":0.0,"max":0.49,"label":"FAIL"}]}]

use_structured_output_abstraction - When True, compresses output format definitions into CL tokens. When False, output format remains in natural language. This is particularly useful for Configuration Prompts where output format should be encoded in CL tokens rather than kept in NL.

Computed Properties

`vocab` Property

Purpose: Provides access to the language-specific vocabulary for semantic token generation.

Type: BaseVocabulary

Usage:

config = CLMConfig(lang="en")

# Access vocabulary
vocab = config.vocab

# Vocabulary contains mappings for:
# - REQ tokens (actions/operations)
# - TARGET tokens (objects/data sources)
# - EXTRACT tokens (fields to extract)
# - CTX tokens (contextual information)
# - OUT tokens (output formats)
# - REF tokens (references/IDs)

Language-specific vocabularies:

# Each language has its own vocabulary
en_config = CLMConfig(lang="en")
en_vocab = en_config.vocab  # ENVocabulary()

pt_config = CLMConfig(lang="pt")
pt_vocab = pt_config.vocab  # PTVocabulary()

es_config = CLMConfig(lang="es")
es_vocab = es_config.vocab  # ESVocabulary()

fr_config = CLMConfig(lang="fr")
fr_vocab = fr_config.vocab  # FRVocabulary()

Vocabulary mapping structure:

Each vocabulary defines mappings from natural language to semantic tokens:

# Example vocabulary mappings (conceptual)
{
    # REQ (Request/Action) tokens
    "analyze": "ANALYZE",
    "extract": "EXTRACT",
    "summarize": "SUMMARIZE",
    "diagnose": "DIAGNOSE",

    # TARGET tokens
    "thread_encoder": "TRANSCRIPT",
    "document": "DOCUMENT",
    "invoice": "INVOICE",

    # EXTRACT tokens
    "sentiment": "SENTIMENT",
    "compliance": "COMPLIANCE",
    "entities": "ENTITIES",

    # ... and more
}

See CLM Vocabulary for complete vocabulary documentation.

`rules` Property

Purpose: Provides access to language-specific pattern matching rules for intelligent text analysis.

Type: BaseRules

Usage:

config = CLMConfig(lang="en")

# Access rules
rules = config.rules

# Rules contain regex patterns for:
# - Comparison patterns (differences, similarities)
# - Duration patterns (time expressions)
# - Tone/style patterns
# - Explanation patterns
# - And more...

Language-specific rules:

# Currently, only English has complete rules
en_config = CLMConfig(lang="en")
en_rules = en_config.rules  # ENRules() - Complete

pt_config = CLMConfig(lang="pt")
pt_rules = pt_config.rules  # None (uses fallback)

es_config = CLMConfig(lang="es")
es_rules = es_config.rules  # None (uses fallback)

Rule categories:

Rules are organized into pattern categories:

COMPARISON_MAP - Identifying comparison requests
DURATION_PATTERNS - Extracting time durations
TONE_MAP - Detecting tone/style requirements
EXPLAIN_PATTERNS - Recognizing explanation requests
ACTION_PATTERNS - Identifying action verbs
And more...

See Pattern Matching Rules below for details.

Pattern Matching Rules

Overview

Rules use regular expressions to identify patterns in text and map them to compressed representations. This enables intelligent compression that preserves semantic meaning.

Rule Structure

Each rule category contains regex patterns mapped to compressed tokens:

class ENRules(BaseRules):
    @property
    def COMPARISON_MAP(self) -> dict[str, str]:
        return {
            r"\bdifferences?\b": "DIFFERENCES",
            r"\bdistinguish\b": "DIFFERENCES",
            r"\bcontrast\b": "DIFFERENCES",
            r"\bsimilarities?\b": "SIMILARITIES",
            r"\bcommon\b": "SIMILARITIES",
            r"\bpros\s*(and|&)?\s*cons\b": "PROS_CONS",
            r"\badvantages\s*(and|&)?\s*disadvantages\b": "PROS_CONS",
            r"\bbenefits\s*(and|&)?\s*drawbacks\b": "PROS_CONS",
            r"\btrade-?offs?\b": "TRADEOFFS",
        }

COMPARISON_MAP

Identifies requests for comparisons and contrasts:

Pattern examples: - "What are the differences between X and Y?" → DIFFERENCES - "Distinguish between A and B" → DIFFERENCES - "Contrast these approaches" → DIFFERENCES - "What are the similarities?" → SIMILARITIES - "What do they have in common?" → SIMILARITIES - "List the pros and cons" → PROS_CONS - "Advantages and disadvantages" → PROS_CONS - "What are the trade-offs?" → TRADEOFFS

Compression example:

Original: "Compare the differences between approach A and approach B"
Compressed: [REQ:COMPARE] [TARGET:APPROACHES] [EXTRACT:DIFFERENCES]

DURATION_PATTERNS

Extracts and normalizes time duration expressions:

Pattern examples: - "5 minutes" → 5M - "2 hours" → 2H - "3 days" → 3D - "1 week" → 1W - "6 months" → 6MO - "2 years" → 2Y

Compression example:

Original: "The call lasted 15 minutes and 30 seconds"
Compressed: [CALL:DURATION=15M30S]

Regex patterns:

@property
def DURATION_PATTERNS(self) -> dict[str, str]:
    return {
        r"(\d+)\s*minutes?": r"\1M",
        r"(\d+)\s*hours?": r"\1H",
        r"(\d+)\s*days?": r"\1D",
        r"(\d+)\s*weeks?": r"\1W",
        r"(\d+)\s*months?": r"\1MO",
        r"(\d+)\s*years?": r"\1Y",
        r"(\d+)\s*seconds?": r"\1S",
    }

TONE_MAP

Identifies tone and style requirements:

Pattern examples: - "Use a professional tone" → TONE_PROFESSIONAL - "Be friendly and approachable" → TONE_FRIENDLY - "Keep it formal" → TONE_FORMAL - "Use casual language" → TONE_CASUAL - "Be empathetic" → TONE_EMPATHETIC - "Stay neutral" → TONE_NEUTRAL

Compression example:

Original: "Respond in a friendly and empathetic tone"
Compressed: [CTX:TONE=FRIENDLY,EMPATHETIC]

Regex patterns:

@property
def TONE_MAP(self) -> dict[str, str]:
    return {
        r"\bprofessional\b": "PROFESSIONAL",
        r"\bfriendly\b": "FRIENDLY",
        r"\bformal\b": "FORMAL",
        r"\bcasual\b": "CASUAL",
        r"\bempathetic\b": "EMPATHETIC",
        r"\bneutral\b": "NEUTRAL",
        r"\btechnical\b": "TECHNICAL",
        r"\bsimple\b": "SIMPLE",
    }

EXPLAIN_PATTERNS

Recognizes explanation and elaboration requests:

Pattern examples: - "Explain how this works" → REQ:EXPLAIN - "Describe the process" → REQ:DESCRIBE - "Walk me through the steps" → REQ:GUIDE - "Elaborate on this topic" → REQ:ELABORATE - "Detail the requirements" → REQ:DETAIL

Compression example:

Original: "Explain the differences between these two approaches"
Compressed: [REQ:EXPLAIN] [TARGET:APPROACHES] [EXTRACT:DIFFERENCES]

ACTION_PATTERNS

Identifies action verbs and operations:

Pattern examples: - "Analyze the data" → REQ:ANALYZE - "Extract key information" → REQ:EXTRACT - "Summarize the findings" → REQ:SUMMARIZE - "Classify the issues" → REQ:CLASSIFY - "Validate the input" → REQ:VALIDATE - "Generate a report" → REQ:GENERATE

Regex patterns:

@property
def ACTION_PATTERNS(self) -> dict[str, str]:
    return {
        r"\banalyze\b": "ANALYZE",
        r"\bextract\b": "EXTRACT",
        r"\bsummarize\b": "SUMMARIZE",
        r"\bclassify\b": "CLASSIFY",
        r"\bvalidate\b": "VALIDATE",
        r"\bgenerate\b": "GENERATE",
        r"\bdiagnose\b": "DIAGNOSE",
        r"\btroubleshoot\b": "TROUBLESHOOT",
    }

Language Support Details

Full Support Languages (en, pt, es, fr)

Capabilities: ✅ Complete vocabulary coverage (all token categories) ✅ Pattern matching rules (English has most comprehensive) ✅ Tested on production data ✅ High compression ratios (70-95%) ✅ High validation accuracy (>90%)

Recommended for: - Production deployments - Mission-critical applications - High-volume processing

Configuration Examples

Example 1: Basic English Configuration

from clm_core import CLMConfig

# Simple English configuration
config = CLMConfig(lang="en")

# Accessing vocabulary and rules
print(config.vocab)  # ENVocabulary()
print(config.rules)  # ENRules()

Example 2: Portuguese with Custom Structured Data Config

from clm_core import CLMConfig, SDCompressionConfig

config = CLMConfig(
    lang="pt",
    ds_config=SDCompressionConfig(
        required_fields=["id", "nome"],
        importance_threshold=0.7
    )
)

Example 3: Spanish with System Prompt Configuration

from cllm import CLMConfig, SysPromptConfig

config = CLMConfig(
    lang="es",
    sys_prompt_config=SysPromptConfig(
        infer_types=True,
        add_attrs=False
    )
)

Example 4: Multi-Language Processing

from cllm import CLMConfig, CLMEncoder

# Process content in different languages
languages = ["en", "pt", "es", "fr"]

results = {}
for lang in languages:
    config = CLMConfig(lang=lang)
    encoder = CLMEncoder(cfg=config)

    # Compress content in this language
    result = encoder.encode(content[lang])
    results[lang] = result.compressed

Example 5: Full Configuration

from clm_core import CLMConfig, SDCompressionConfig, SysPromptConfig
from clm_core.types import ThreadConfig

# Complete configuration with all options
config = CLMConfig(
    lang="en",
    ds_config=SDCompressionConfig(
        auto_detect=True,
        required_fields=["id", "name"],
        importance_threshold=0.6,
        max_truncation_length=150,
        preserve_structure=True
    ),
    sys_prompt_config=SysPromptConfig(
        infer_types=True,
        add_attrs=True
    ),
    thread_config=ThreadConfig(
        detect_lang=True,
        include_ctx_values=True,
        estimate_thread_duration=False,
        include_summary=True,
        redaction_pattern=r"\[.*?REDACTED.*?\]",
    )
)

# Use configuration
encoder = CLMEncoder(cfg=config)

Advanced: Understanding Rule Execution

Rule Processing Order

Pattern matching - Regex rules identify patterns
Token generation - Patterns mapped to semantic tokens
Vocabulary lookup - Tokens resolved using language vocabulary
Compression - Final compressed output generated

Example: Complete Flow

Input:

"Analyze the transcript and extract sentiment, comparing differences 
between customer and agent responses over a 15 minute call"

Rule Processing:

ACTION_PATTERNS matches "Analyze" → REQ:ANALYZE
ACTION_PATTERNS matches "extract" → REQ:EXTRACT
COMPARISON_MAP matches "differences" → DIFFERENCES
DURATION_PATTERNS matches "15 minute" → 15M

Vocabulary Lookup: - "transcript" → TARGET:TRANSCRIPT - "sentiment" → EXTRACT:SENTIMENT - "customer and agent" → SOURCE=CUSTOMER,AGENT

Compressed Output:

[REQ:ANALYZE,EXTRACT] [TARGET:TRANSCRIPT] 
[EXTRACT:SENTIMENT:SOURCE=CUSTOMER,AGENT] 
[EXTRACT:DIFFERENCES] [DURATION=15M]

Best Practices

1. Use Fully Supported Languages for Production

# Production ✅
config = CLMConfig(lang="en")  # Full support
config = CLMConfig(lang="pt")  # Full support

# Development only ⚠️
config = CLMConfig(lang="de")  # Beta

2. Configure Based on Use Case

from clm_core.types import ThreadConfig

# Thread (transcript / free-form) compression
config = CLMConfig(
    lang="en",
    thread_config=ThreadConfig(
        include_ctx_values=True,   # Surface extracted entity values
        include_summary=True,      # Generate a human-readable summary
    )
)

# Structured data
config = CLMConfig(
    lang="en",
    ds_config=SDCompressionConfig(
        importance_threshold=0.7
    )
)

# System prompts
config = CLMConfig(
    lang="en",
    sys_prompt_config=SysPromptConfig(
        infer_types=True  # If you need type hints
    )
)

3. Reuse Configuration Objects

# Create once, use many times
config = CLMConfig(lang="en")
encoder = CLMEncoder(cfg=config)

# Reuse for multiple compressions
result1 = encoder.encode(content1)
result2 = encoder.encode(content2)
result3 = encoder.encode(content3)

4. Test with Representative Data

# Test configuration with real data
config = CLMConfig(lang="en")
encoder = CLMEncoder(cfg=config)

# Validate compression quality
test_data = load_test_set()
for item in test_data:
    result = encoder.encode(item)
    assert result.compression_ratio >= 0.70
    assert validate_output(result.compressed)

Extending CLLM (Advanced)

Adding Custom Rules

While not officially supported, you can understand the pattern for extending rules:

# Example: Custom rule pattern (conceptual)
class CustomENRules(ENRules):
    @property
    def CUSTOM_PATTERNS(self) -> dict[str, str]:
        return {
            r"\bcritical\b": "CRITICAL",
            r"\burgent\b": "URGENT",
            r"\bhigh[-\s]priority\b": "HIGH_PRIORITY",
        }

Note: This requires deep understanding of CLLM internals and is not recommended for production use without consultation.

Troubleshooting

Issue: Language Not Fully Supported

Symptom: Lower compression ratios or accuracy in non-English languages

Solution:

# Check language support status
config = CLMConfig(lang="de")  # Beta language

# If compression quality insufficient:
# 1. Use English if possible
# 2. Wait for language maturity
# 3. Contact support for enterprise language support

Issue: Rules Not Matching

Symptom: Patterns not being recognized

Solution:

# Ensure language has rule support
config = CLMConfig(lang="en")  # Full rules
print(config.rules)  # Should not be None

# Test specific patterns
text = "analyze the differences"
# Should match both ACTION_PATTERNS and COMPARISON_MAP

Issue: Vocabulary Mismatches

Symptom: Unexpected token generation

Solution:

# Inspect vocabulary
config = CLMConfig(lang="en")
vocab = config.vocab

# Check available mappings
# (Vocabulary inspection methods depend on implementation)

Next Steps

CLM Vocabulary - Complete vocabulary reference
Token Hierarchy - Understanding semantic tokens
System Prompt Encoder - Overview of system prompt compression
Task Prompts - Action-oriented instruction compression
Configuration Prompts - Template-based agent configuration
Transcript Encoder - Using transcript compression
Structured Data Encoder - Using structured data compression

CLM Configuration

Overview

CLMConfig Object

Structure

Parameters

lang (string, default: "en")

ds_config (SDCompressionConfig)

thread_config (ThreadConfig)

ThreadConfig Parameters

sys_prompt_config (SysPromptConfig)

SysPromptConfig Parameters

Computed Properties

vocab Property

rules Property

Pattern Matching Rules

Overview

Rule Structure

COMPARISON_MAP

DURATION_PATTERNS

TONE_MAP

EXPLAIN_PATTERNS

ACTION_PATTERNS

Language Support Details

Full Support Languages (en, pt, es, fr)

Configuration Examples

Example 1: Basic English Configuration

Example 2: Portuguese with Custom Structured Data Config

Example 3: Spanish with System Prompt Configuration

Example 4: Multi-Language Processing

Example 5: Full Configuration

Advanced: Understanding Rule Execution

Rule Processing Order

Example: Complete Flow

Best Practices

1. Use Fully Supported Languages for Production

2. Configure Based on Use Case

3. Reuse Configuration Objects

4. Test with Representative Data

Extending CLLM (Advanced)

Adding Custom Rules

Troubleshooting

Issue: Language Not Fully Supported

Issue: Rules Not Matching

Issue: Vocabulary Mismatches

Next Steps

`lang` (string, default: `"en"`)

`ds_config` (SDCompressionConfig)

`thread_config` (ThreadConfig)

`sys_prompt_config` (SysPromptConfig)

`vocab` Property

`rules` Property