CLM Vocabulary
Overview
The CLM Vocabulary system is the semantic foundation of CLLM compression. It defines mappings from natural language words and phrases to compressed semantic tokens, enabling intelligent compression that preserves meaning while dramatically reducing token count.
Purpose: - Maps verbose phrases to concise tokens - Identifies important vs. redundant words - Provides language-specific semantic understanding - Enables consistent token generation across compression types
Structure:
Each language has its own Vocabulary class (e.g., ENVocabulary, PTVocabulary) that inherits from BaseVocabulary and defines language-specific word lists, mappings, and patterns.
Vocabulary Architecture
Language-Specific Vocabularies
# Available vocabularies
ENVocabulary() # English - Complete
PTVocabulary() # Portuguese - Complete
ESVocabulary() # Spanish - Complete
FRVocabulary() # French - Complete
# Others in development
Access via CLMConfig:
from clm_core import CLMConfig, CLMEncoder
config = CLMConfig(lang="en")
vocab = config.vocab # ENVocabulary instance
# Use vocabulary for compression
encoder = CLMEncoder(cfg=config)
Core Vocabulary Categories
The vocabulary system is organized into 25+ categories, each serving a specific purpose in compression.
1. CODE_INDICATORS
Purpose: Identifies code and technical content
Examples:
CODE_INDICATORS = (
"code", "script", "function", "program", "algorithm",
"api", "class", "method", "variable", "git", "commit",
"unittest", "test case", "debug", "refactor"
)
Use case:
Input: "Review the code in the function and check for bugs"
Detection: "code", "function" → CODE domain
Output: [REQ:REVIEW] [TARGET:CODE:DOMAIN=CODE] [EXTRACT:BUGS]
2. ACTION_VERBS
Purpose: Comprehensive list of action verbs for intent detection
Categories: - Modification: reduce, increase, improve, optimize, enhance, update, modify - Problem-solving: fix, solve, resolve, debug, repair, troubleshoot, diagnose - Creation: create, generate, build, make, produce, design, develop - Analysis: analyze, examine, review, evaluate, assess, validate, verify - Explanation: explain, describe, clarify, define, document - Processing: calculate, compute, determine, find, process - Organization: compare, contrast, classify, categorize, sort, organize - Data handling: extract, transform, load, aggregate, filter - Operations: deploy, release, rollback, scale, provision
Examples (60+ verbs):
ACTION_VERBS = (
"reduce", "increase", "improve", "optimize", "enhance",
"fix", "solve", "resolve", "debug", "repair",
"create", "generate", "build", "make", "produce",
"analyze", "examine", "review", "evaluate", "assess",
"explain", "describe", "clarify", "define", "document",
# ... 40+ more
)
Use case:
Input: "Analyze the data and generate a report"
Verbs detected: "analyze" → ANALYZE, "generate" → GENERATE
Output: [REQ:ANALYZE,GENERATE] [TARGET:DATA] [OUT:REPORT]
3. STOPWORDS
Purpose: Words to remove during compression (no semantic value)
Examples:
STOPWORDS = (
"it", "this", "that", "these", "those",
"a", "an", "the",
"some", "any", "all", "none",
"something", "anything", "everything", "nothing",
"someone", "anyone", "everyone", "no one"
)
Use case:
Input: "Please analyze this data and provide the results"
Removal: "Please", "this", "the" → filtered out
Output: [REQ:ANALYZE] [TARGET:DATA] [OUT:RESULTS]
4. PRONOUNS
Purpose: Personal pronouns to filter or simplify
Examples:
PRONOUNS = (
"i", "we", "you", "they", "he", "she", "it",
"me", "us", "them", "him", "her",
"my", "our", "your", "their", "his", "her", "its"
)
Use case:
Input: "I need you to analyze my data"
Simplified: "I", "you", "my" → filtered
Output: [REQ:ANALYZE] [TARGET:DATA]
5. MODALS
Purpose: Modal verbs that add no semantic content
Examples:
MODALS = (
"can", "could", "should", "would", "will", "shall",
"may", "might", "must",
"do", "does", "did"
)
Use case:
Input: "You should analyze the code and you could check for bugs"
Removed: "should", "could" → no impact on meaning
Output: [REQ:ANALYZE] [TARGET:CODE] [EXTRACT:BUGS]
6. COMPOUND_PHRASES
Purpose: Multi-word phrases mapped to single tokens
Examples:
COMPOUND_PHRASES = {
"customer support": "TICKET",
"customer service": "TICKET",
"support ticket": "TICKET",
"help desk": "TICKET",
"chat thread_encoder": "TRANSCRIPT",
"conversation thread_encoder": "TRANSCRIPT",
"source code": "CODE",
"code review": "CODE",
"pull request": "CODE",
"error message": "LOG",
"stack trace": "LOG",
"system log": "LOG",
"api endpoint": "ENDPOINT",
"rest api": "ENDPOINT",
"unit test": "TEST",
"test case": "TEST",
"business plan": "PLAN",
"project plan": "PLAN",
"database query": "QUERY",
"sql query": "QUERY"
}
Use case:
Input: "Analyze the customer support ticket and check the error message"
Mapping: "customer support" → TICKET, "error message" → LOG
Output: [REQ:ANALYZE] [TARGET:TICKET] [EXTRACT:LOG]
Impact: Reduces multi-word phrases to single tokens, significant compression
7. TYPE_MAP
Purpose: Document/content type identification
Examples:
TYPE_MAP = {
"call": "CALL",
"phone call": "CALL",
"meeting": "MEETING",
"chat": "CHAT",
"email": "EMAIL",
"message": "EMAIL",
"conversation": "CONVERSATION",
"report": "REPORT",
"document": "DOCUMENT",
"article": "ARTICLE",
"thread_encoder": "TRANSCRIPT",
"ticket": "TICKET",
"case": "TICKET",
"complaint": "COMPLAINT",
"feedback": "FEEDBACK",
"inquiry": "INQUIRY",
"request": "REQUEST"
}
Use case:
Input: "Analyze the customer support call transcript"
Type detection: "call" → CALL, "transcript" → TRANSCRIPT
Output: [REQ:ANALYZE] [TARGET:CALL,TRANSCRIPT:DOMAIN=SUPPORT]
8. CONTEXT_MAP
Purpose: Domain/context identification
Examples:
CONTEXT_MAP = {
"customer": "CUSTOMER",
"support": "SUPPORT",
"sales": "SALES",
"technical": "TECHNICAL",
"engineering": "TECHNICAL",
"product": "PRODUCT",
"marketing": "MARKETING",
"business": "BUSINESS",
"finance": "FINANCE",
"legal": "LEGAL",
"hr": "HR",
"operations": "OPERATIONS"
}
Use case:
Input: "Analyze the technical support ticket"
Context: "technical" → TECHNICAL, "support" → SUPPORT
Output: [REQ:ANALYZE] [TARGET:TICKET:DOMAIN=TECHNICAL,SUPPORT]
9. domain_candidates
Purpose: Keywords that indicate specific domains
Domains (15 total):
CODE:
"bug", "error", "security", "performance", "code", "script",
"function", "algorithm", "debug", "compile", "library", "api"
ENTITIES:
"names", "dates", "amounts", "addresses", "emails", "phones"
QA:
"verification", "policy", "soft_skills", "accuracy",
"compliance", "disclosures"
SUPPORT:
"issue", "sentiment", "actions", "urgency", "priority",
"ticket", "customer", "agent", "troubleshoot"
TECHNICAL:
"bug", "error", "stacktrace", "api", "server", "log",
"debug", "crash", "deployment", "backend"
DOCUMENT:
"document", "article", "manual", "guide", "thread_encoder",
"notes", "summary", "instructions"
BUSINESS:
"report", "analysis", "executive", "management", "dashboard",
"kpi", "roi", "quarterly", "presentation"
LEGAL:
"contract", "policy", "compliance", "gdpr", "clause",
"agreement", "terms", "privacy"
FINANCE:
"invoice", "billing", "payment", "transaction", "refund",
"expense", "balance", "statement"
SECURITY:
"breach", "risk", "threat", "alert", "malware", "phishing",
"permissions", "access control", "audit"
MEDICAL:
"patient", "diagnosis", "prescription", "clinical", "symptoms",
"treatment", "doctor"
SALES:
"lead", "crm", "opportunity", "pipeline", "prospect", "deal", "quote"
EDUCATION:
"lesson", "curriculum", "teacher", "student", "training", "course"
Use case:
Input: "Analyze the invoice for compliance with GDPR policy"
Keywords: "invoice" → FINANCE, "compliance", "gdpr", "policy" → LEGAL
Output: [REQ:ANALYZE] [TARGET:INVOICE:DOMAIN=FINANCE,LEGAL]
10. REQ_TOKENS
Purpose: Maps action phrases to standardized request tokens
Main categories (23 total):
ANALYZE:
"analyze", "review", "examine", "evaluate", "assess",
"inspect", "check out", "audit", "investigate"
EXTRACT:
"extract", "pull out", "identify", "find", "locate",
"get", "retrieve", "return", "include", "select"
GENERATE:
"generate", "create", "write", "draft", "compose",
"produce", "build", "develop", "suggest", "formulate"
SUMMARIZE:
"summarize", "condense", "brief", "synopsis", "sum up",
"digest", "recap"
TRANSFORM:
"convert", "transform", "change", "rewrite", "translate",
"modify", "adapt", "rephrase", "edit", "add", "remove"
EXPLAIN:
"explain", "describe", "clarify", "elaborate", "tell me about",
"detail", "illustrate", "discuss", "define"
COMPARE:
"compare", "contrast", "versus", "vs", "difference between",
"differentiate", "distinguish"
CLASSIFY:
"classify", "categorize", "sort", "group", "label",
"organize", "arrange", "segment"
DEBUG:
"debug", "troubleshoot", "diagnose", "fix bug", "investigate bug",
"find bug", "track down", "identify issue"
OPTIMIZE:
"optimize", "improve", "enhance", "refactor", "speed up",
"streamline", "maximize", "minimize", "reduce", "increase"
VALIDATE:
"validate", "verify", "check", "confirm", "test",
"ensure", "certify", "authenticate"
And 12 more categories: SEARCH, RANK, PREDICT, FORMAT, DETECT, CALCULATE, AGGREGATE, DETERMINE, ROUTE, EXECUTE, LIST, MATCH, SELECT
Use case:
Input: "Please summarize the document and extract key entities"
Mapping: "summarize" → SUMMARIZE, "extract" → EXTRACT
Output: [REQ:SUMMARIZE,EXTRACT] [TARGET:DOCUMENT] [EXTRACT:ENTITIES]
11. TARGET_TOKENS
Purpose: Maps object descriptions to standardized target tokens
Categories (40+ targets):
Code & Technical:
"CODE": ["code", "script", "program", "function"],
"QUERY": ["query", "sql", "database query"],
"ENDPOINT": ["endpoint", "api", "rest endpoint"],
"COMPONENT": ["component", "module", "package"],
"SYSTEM": ["system", "application", "software"],
"TEST": ["test", "unit test", "test case"],
"LOG": ["log", "logs", "error log"]
Documents:
"DOCUMENT": ["document", "doc", "file", "report"],
"EMAIL": ["email", "message", "correspondence"],
"REPORT": ["report", "analysis", "findings"],
"TRANSCRIPT": ["thread_encoder", "conversation", "chat log"]
Customer Service:
"TICKET": ["ticket", "support ticket", "issue", "case"],
"COMPLAINT": ["complaint", "issue", "problem"],
"REQUEST": ["request", "service request"],
"INQUIRY": ["inquiry", "question", "query"],
"CALL": ["call", "phone call", "support call"]
Business:
"PLAN": ["plan", "business plan", "project plan"],
"POST": ["post", "linkedin post", "blog post"],
"SUMMARY": ["summary", "executive summary"],
"METRICS": ["revenue", "metrics", "statistics"]
Specialized:
"NBA_CATALOG": ["nba", "next best action", "predefined actions"],
"CUSTOMER_INTENT": ["customer intent", "customer need"],
"CORRELATION": ["correlation", "relationship"],
"TRADEOFF": ["trade-off", "tradeoffs"],
"PATTERN": ["pattern", "patterns", "trend"],
"CHURN": ["churn", "customer churn", "attrition"]
Use case:
Input: "Analyze the NBA catalog and identify patterns in customer churn"
Mapping: "nba catalog" → NBA_CATALOG, "patterns" → PATTERN, "customer churn" → CHURN
Output: [REQ:ANALYZE] [TARGET:NBA_CATALOG] [EXTRACT:PATTERN,CHURN]
12. EXTRACT_FIELDS
Purpose: Standardized field names for extraction
Categories (50+ fields):
Customer Service:
"ISSUE", "SENTIMENT", "ACTIONS", "NEXT_STEPS",
"URGENCY", "PRIORITY", "CUSTOMER_INTENT"
Entities:
"NAMES", "DATES", "AMOUNTS", "EMAILS", "PHONES", "ADDRESSES"
Technical:
"BUGS", "SECURITY", "PERFORMANCE", "ERRORS", "WARNINGS"
Analysis:
"KEYWORDS", "TOPICS", "ENTITIES", "FACTS", "DECISIONS",
"REQUIREMENTS", "FEATURES", "PROBLEMS", "SOLUTIONS", "RISKS"
Metrics:
"METRICS", "KPI", "SCORES", "RATINGS", "RELEVANCE_SCORE",
"MATCH_CONFIDENCE", "SEMANTIC_SIMILARITY"
Metadata:
"OWNERS", "ASSIGNEES", "STAKEHOLDERS", "PARTICIPANTS",
"TIMESTAMPS", "DURATIONS", "FREQUENCIES", "QUANTITIES",
"CATEGORIES", "TAGS", "LABELS", "STATUS", "TYPE"
Use case:
Input: "Extract sentiment, urgency, and next steps from the ticket"
Fields: "sentiment" → SENTIMENT, "urgency" → URGENCY, "next steps" → NEXT_STEPS
Output: [REQ:EXTRACT] [TARGET:TICKET] [EXTRACT:SENTIMENT,URGENCY,NEXT_STEPS]
13. OUTPUT_FORMATS
Purpose: Specifies desired output format
Examples:
OUTPUT_FORMATS = {
"JSON": ["json", "json format"],
"MARKDOWN": ["markdown", "md"],
"TABLE": ["table", "tabular"],
"LIST": ["list", "bullet points", "bullets"],
"PLAIN": ["plain text", "text only"],
"CSV": ["csv", "comma-separated"]
}
Use case:
Input: "Analyze the data and return results as JSON"
Format: "json" → JSON
Output: [REQ:ANALYZE] [TARGET:DATA] [OUT:JSON]
14. rank_triggers
Purpose: Identifies ranking/sorting requests
Examples:
rank_triggers = {
"rank", "ranking", "sort", "sort by", "order", "order by",
"prioritize", "priority", "most important", "least important",
"top", "bottom", "first", "last", "highest", "lowest",
"best", "worst", "greatest", "least", "maximum", "minimum",
"arrange"
}
Use case:
Input: "Rank the tickets by priority and sort by urgency"
Triggers: "rank", "sort" → REQ:RANK
Output: [REQ:RANK] [TARGET:TICKETS] [BY:PRIORITY,URGENCY]
15. NOISE_VERBS
Purpose: Common verbs with little semantic value to filter
Examples:
NOISE_VERBS = {
"be", "have", "do", "can", "could", "should", "would",
"may", "might", "must", "will", "shall",
"go", "come", "take", "get", "make", "work", "live",
"know", "think", "feel", "say", "call"
}
Use case:
Input: "I would like you to have the system analyze the data"
Filter: "would", "have" → removed
Output: [REQ:ANALYZE] [TARGET:DATA]
16. IMPERATIVE_PATTERNS
Purpose: Command patterns that map to actions
Examples:
IMPERATIVE_PATTERNS = [
(["list", "enumerate", "itemize"], "LIST", "ITEMS"),
(["name", "identify"], "GENERATE", "ITEMS"),
(["give", "provide", "suggest"], "GENERATE", "ITEMS"),
(["tell", "explain", "describe"], "EXPLAIN", "CONCEPT")
]
Use case:
Input: "List the top 5 issues"
Pattern: "list" → LIST + ITEMS
Output: [REQ:LIST] [TARGET:ITEMS] [LIMIT:5]
17. QUESTION_WORDS
Purpose: Question word detection for query analysis
Examples:
QUESTION_WORDS = ["what", "who", "where", "when", "why", "how", "which"]
Use case:
Input: "What are the main issues?"
Detection: "what" → question pattern
Output: [REQ:EXTRACT] [TARGET:ISSUES:TYPE=MAIN]
Vocabulary Usage in Compression
Flow: Natural Language → Compressed Tokens
Step 1: Text Input
"Please analyze the customer support transcript and extract
sentiment, urgency, and next steps"
Step 2: Vocabulary Matching
"analyze" → REQ_TOKENS["ANALYZE"]
"customer support transcript" → COMPOUND_PHRASES["customer support"] + TYPE_MAP["transcript"]
"extract" → REQ_TOKENS["EXTRACT"]
"sentiment" → EXTRACT_FIELDS["SENTIMENT"]
"urgency" → EXTRACT_FIELDS["URGENCY"]
"next steps" → EXTRACT_FIELDS["NEXT_STEPS"]
Step 3: Token Generation
[REQ:ANALYZE,EXTRACT]
[TARGET:TRANSCRIPT:DOMAIN=SUPPORT]
[EXTRACT:SENTIMENT,URGENCY,NEXT_STEPS]
Step 4: Compression
Original: 87 tokens
Compressed: 23 tokens
Reduction: 73.6%
Language-Specific Vocabularies
English (ENVocabulary)
Status: ✅ Complete
Coverage: All 25+ categories fully populated
Quality: Production-ready
Size: 1,000+ mappings
Strengths: - Comprehensive technical vocabulary - Extensive business terminology - Rich customer service terms - Complete domain coverage
Portuguese (PTVocabulary)
Status: ✅ Complete
Coverage: All categories with Portuguese translations
Quality: Production-ready
Example mappings:
REQ_TOKENS = {
"ANALYZE": ["analisar", "revisar", "examinar", "avaliar"],
"EXTRACT": ["extrair", "obter", "identificar", "localizar"],
"GENERATE": ["gerar", "criar", "escrever", "produzir"]
}
TARGET_TOKENS = {
"DOCUMENTO": ["documento", "doc", "arquivo"],
"TRANSCRIPT": ["transcrição", "conversa", "diálogo"]
}
Spanish (ESVocabulary)
Status: ✅ Complete
Coverage: All categories with Spanish translations
Quality: Production-ready
Example mappings:
REQ_TOKENS = {
"ANALYZE": ["analizar", "revisar", "examinar", "evaluar"],
"EXTRACT": ["extraer", "obtener", "identificar", "localizar"],
"GENERATE": ["generar", "crear", "escribir", "producir"]
}
French (FRVocabulary)
Status: ✅ Complete
Coverage: All categories with French translations
Quality: Production-ready
Example mappings:
REQ_TOKENS = {
"ANALYZE": ["analyser", "examiner", "évaluer", "inspecter"],
"EXTRACT": ["extraire", "obtenir", "identifier", "localiser"],
"GENERATE": ["générer", "créer", "écrire", "produire"]
}
Extending Vocabulary
Understanding Vocabulary Categories
Each category serves a specific purpose:
| Category | Purpose | Impact |
|---|---|---|
| REQ_TOKENS | Action detection | High - Determines primary operation |
| TARGET_TOKENS | Object identification | High - What to operate on |
| EXTRACT_FIELDS | Field names | High - What data to extract |
| COMPOUND_PHRASES | Multi-word compression | Very High - 5-10x compression |
| domain_candidates | Domain detection | Medium - Context setting |
| STOPWORDS | Noise removal | Medium - Cleanup |
| CODE_INDICATORS | Technical detection | Low - Domain hint |
| rank_triggers | Sorting detection | Low - Specialized use |
Best Practices
1. Leverage Compound Phrases
Impact: Highest compression ratio
# Without compound phrase
"customer support ticket" → 3 tokens
# With COMPOUND_PHRASES mapping
"customer support" → TICKET → 1 token
Total: "customer support ticket" → TICKET → 1 token (66% reduction)
2. Use Domain-Specific Vocabularies
# Generic
config = CLMConfig(lang="en")
# Add domain context helps
# (domain is inferred from vocabulary matches)
3. Understand Token Priority
Priority order: 1. Required tokens (REQ, TARGET) - Always present 2. Extract fields - When extraction requested 3. Context - When domain detected 4. Output - When format specified 5. Metadata - When available
4. Test Vocabulary Coverage
# Check if vocabulary handles your domain
test_phrases = [
"analyze customer support ticket",
"extract sentiment from thread_encoder",
"generate next best action"
]
for phrase in test_phrases:
result = encoder.encode(phrase)
print(f"{phrase} → {result.compressed}")
print(f"Compression: {result.compression_ratio:.1%}")
Advanced: Vocabulary Statistics
English Vocabulary Size
| Category | Count |
|---|---|
| ACTION_VERBS | 60+ |
| REQ_TOKENS | 23 categories × 5-10 synonyms = 150+ |
| TARGET_TOKENS | 40+ categories × 3-7 variants = 200+ |
| EXTRACT_FIELDS | 50+ |
| COMPOUND_PHRASES | 45+ |
| domain_candidates | 15 domains × 10-20 keywords = 200+ |
| STOPWORDS | 25+ |
| Total mappings | ~1,000+ |
Coverage by Domain
Based on domain_candidates:
| Domain | Keywords | Use Cases |
|---|---|---|
| SUPPORT | 18 | Customer service, tickets, complaints |
| CODE | 19 | Software development, debugging |
| TECHNICAL | 12 | IT operations, system issues |
| BUSINESS | 11 | Reports, analytics, presentations |
| FINANCE | 9 | Invoices, billing, transactions |
| LEGAL | 9 | Contracts, compliance, policies |
| SECURITY | 9 | Threats, breaches, access control |
| MEDICAL | 8 | Patient care, diagnoses |
| SALES | 8 | CRM, leads, deals |
| EDUCATION | 7 | Training, courses, students |
| QA | 6 | Quality assurance, compliance |
| ENTITIES | 6 | Names, dates, amounts |
| DOCUMENT | 10 | General documents |
Troubleshooting
Issue: Low Compression in Specific Domain
Symptom: Compression ratio lower than expected for your content type
Solution: Check if vocabulary has coverage for your domain
# Check domain candidates
config = CLMConfig(lang="en")
vocab = config.vocab
# Does your domain exist?
domains = vocab.domain_candidates.keys()
print(f"Available domains: {domains}")
# If missing: content may not compress well
# Consider contributing to vocabulary or using web feedback
Issue: Important Terms Not Recognized
Symptom: Specific jargon or terminology not mapping correctly
Solution: Use explicit field names that match EXTRACT_FIELDS
# Instead of domain jargon
"Get the customer NPS score"
# Use vocabulary terms
"Extract the satisfaction rating" # Maps to RATINGS
Issue: Over-Compression Losing Meaning
Symptom: Compressed output lacks necessary detail
Solution: Vocabulary prioritizes semantic preservation
# The vocabulary is designed to preserve meaning
# If compression seems too aggressive, it may be working correctly
# Test with LLM to verify understanding
Next Steps
- CLM Configuration - Using vocabularies via config
- Token Hierarchy - Understanding token structure
- System Prompt Encoder - Vocabulary in system prompts
- Transcript Encoder - Vocabulary in transcripts