Kolmogorov Complexity Analysis
Overview
KolmogorovAnalyzer approximates Kolmogorov complexity using zlib compression to verify that the CLM token string preserves the structural information density of the original transcript. It is the first of three quality gate checks and carries a 25% weight in the final retention score.
Core insight: K(x) ≈ len(zlib.compress(x))
If two strings encode the same information, compressing them to their theoretical minimum should yield similar sizes. The CLM token string is expected to be simpler and shorter than the original — that is the point of compression — but it should not collapse to near-zero complexity, which would indicate that meaning was discarded rather than restructured.
Theory
Kolmogorov Complexity
Kolmogorov complexity K(x) is the length of the shortest program that produces string x. It is uncomputable in general, but zlib at maximum compression level (level=9) provides a practical approximation:
K(x) ≈ len(zlib.compress(x, level=9))
Two metrics are derived from this approximation:
complexity_ratio
complexity_ratio = zlib_size(compressed) / zlib_size(original)
Measures how much smaller the token string's compressed form is relative to the original's compressed form. A ratio well below 1.0 is expected and healthy — CLM removes redundancy. A ratio above the threshold suggests the token string is unexpectedly close to the original in raw information content, which may indicate encoding failure.
Pass condition: complexity_ratio ≤ 0.85
information_efficiency
information_efficiency = (zlib_bytes / raw_bytes) for compressed
÷ (zlib_bytes / raw_bytes) for original
Measures bits of zlib content per raw character — higher means more information-dense. CLM tokens are designed to pack more meaning per character than natural language, so the compressed output should score higher on this metric than the original.
Higher is better. A value above 1.0 means each character in the token string carries more compressed information than each character in the original.
KolmogorovModel
class KolmogorovModel(BaseModel):
original_bytes: int # Raw byte size of original
compressed_bytes: int # zlib size of CLM token string
clm_bytes: int # zlib size of CLM token string (same as compressed_bytes)
original_zlib_bytes: int # zlib size of original
clm_zlib_bytes: int # zlib size of CLM token string
complexity_ratio: float # clm_zlib / original_zlib
information_efficiency: float # density ratio (compressed / original)
passed: bool # True if complexity_ratio ≤ 0.85
Thresholds
| Metric | Threshold | Meaning |
|---|---|---|
complexity_ratio |
≤ 0.85 | CLM output is meaningfully simpler than source |
information_efficiency |
No threshold | Informational only — higher is better |
The threshold is intentionally lenient. Structurally simpler output is the expected and desired outcome. The check exists to catch pathological cases where the token string has collapsed to near-random content or near-empty output.
Standalone Usage
from clm_core import KolmogorovAnalyzer
analyzer = KolmogorovAnalyzer()
result = analyzer.analyze(original_text, compressed_token_string)
print(f"Complexity ratio: {result.complexity_ratio:.3f}")
print(f"Information density: {result.information_efficiency:.2f}x")
print(f"Passed: {result.passed}")
Interpreting Results
Typical healthy output
complexity_ratio: 0.41
information_efficiency: 1.18x
passed: True
A ratio around 0.4–0.6 is typical for well-formed CLM output. The token string compresses to about half the zlib footprint of the original, and each character carries more information density.
Warning signs
| Symptom | Likely cause |
|---|---|
complexity_ratio > 0.85 |
Token string nearly as verbose as original; encoder may not have compressed meaningfully |
complexity_ratio < 0.1 |
Token string is extremely short; critical content may have been discarded |
information_efficiency < 0.5 |
Each token character carries far less information density than expected |
Relationship to other checks
Kolmogorov is a structural sanity check. It catches coarse failures (output too long, output too short) but cannot verify that specific semantic fields are present. Use Conditional Entropy for that.
A pass here with a failure in Conditional Entropy is the most common meaningful failure mode: the output looks structurally reasonable but dropped specific critical fields.
Example with verbose output
from clm_core import KolmogorovAnalyzer
original = """
Agent: Thank you for calling support. How can I help you today?
Customer: I was charged twice this month for my subscription.
Agent: I'm sorry about that. Let me look up your account...
Agent: I can see the duplicate charge. I'll initiate a refund now.
Reference number RFD-908712. Allow 3-5 business days.
Customer: Thank you so much.
"""
compressed = (
"[INTERACTION:SUPPORT:CHANNEL=VOICE] [DURATION=6m] [LANG=EN] "
"[DOMAIN:BILLING] [SERVICE:SUBSCRIPTION] "
"[CUSTOMER_INTENT:REPORT_DUPLICATE_CHARGE] "
"[AGENT_ACTIONS:ACCOUNT_VERIFIED→REFUND_INITIATED] "
"[RESOLUTION:ISSUE_RESOLVED] [STATE:RESOLVED] "
"[COMMITMENT:REFUND_3-5_DAYS] [ARTIFACT:REFUND_REF=RFD-908712] "
"[SENTIMENT:NEUTRAL→GRATEFUL]"
)
result = analyzer.analyze(original, compressed)
print(f"Original raw bytes: {result.original_bytes}")
print(f"Original zlib bytes: {result.original_zlib_bytes}")
print(f"CLM raw bytes: {result.clm_bytes}")
print(f"CLM zlib bytes: {result.clm_zlib_bytes}")
print(f"Complexity ratio: {result.complexity_ratio:.3f}")
print(f"Information density: {result.information_efficiency:.2f}x")
print(f"Passed: {result.passed}")
# Original raw bytes: 467
# Original zlib bytes: 312
# CLM raw bytes: 198
# CLM zlib bytes: 128
# Complexity ratio: 0.410
# Information density: 1.18x
# Passed: True
Next Steps
- Conditional Entropy — Semantic slot comparison (50% weight in retention score)
- Perplexity — LLM comprehension test (25% weight in retention score)
- Quality Gate Index — Unified report and verdict logic