Skip to content

Kolmogorov Complexity Analysis

Overview

KolmogorovAnalyzer approximates Kolmogorov complexity using zlib compression to verify that the CLM token string preserves the structural information density of the original transcript. It is the first of three quality gate checks and carries a 25% weight in the final retention score.

Core insight: K(x) ≈ len(zlib.compress(x))

If two strings encode the same information, compressing them to their theoretical minimum should yield similar sizes. The CLM token string is expected to be simpler and shorter than the original — that is the point of compression — but it should not collapse to near-zero complexity, which would indicate that meaning was discarded rather than restructured.


Theory

Kolmogorov Complexity

Kolmogorov complexity K(x) is the length of the shortest program that produces string x. It is uncomputable in general, but zlib at maximum compression level (level=9) provides a practical approximation:

K(x) ≈ len(zlib.compress(x, level=9))

Two metrics are derived from this approximation:

complexity_ratio

complexity_ratio = zlib_size(compressed) / zlib_size(original)

Measures how much smaller the token string's compressed form is relative to the original's compressed form. A ratio well below 1.0 is expected and healthy — CLM removes redundancy. A ratio above the threshold suggests the token string is unexpectedly close to the original in raw information content, which may indicate encoding failure.

Pass condition: complexity_ratio ≤ 0.85

information_efficiency

information_efficiency = (zlib_bytes / raw_bytes) for compressed
                       ÷ (zlib_bytes / raw_bytes) for original

Measures bits of zlib content per raw character — higher means more information-dense. CLM tokens are designed to pack more meaning per character than natural language, so the compressed output should score higher on this metric than the original.

Higher is better. A value above 1.0 means each character in the token string carries more compressed information than each character in the original.


KolmogorovModel

class KolmogorovModel(BaseModel):
    original_bytes: int          # Raw byte size of original
    compressed_bytes: int        # zlib size of CLM token string
    clm_bytes: int               # zlib size of CLM token string (same as compressed_bytes)
    original_zlib_bytes: int     # zlib size of original
    clm_zlib_bytes: int          # zlib size of CLM token string
    complexity_ratio: float      # clm_zlib / original_zlib
    information_efficiency: float  # density ratio (compressed / original)
    passed: bool                 # True if complexity_ratio ≤ 0.85

Thresholds

Metric Threshold Meaning
complexity_ratio ≤ 0.85 CLM output is meaningfully simpler than source
information_efficiency No threshold Informational only — higher is better

The threshold is intentionally lenient. Structurally simpler output is the expected and desired outcome. The check exists to catch pathological cases where the token string has collapsed to near-random content or near-empty output.


Standalone Usage

from clm_core import KolmogorovAnalyzer

analyzer = KolmogorovAnalyzer()
result = analyzer.analyze(original_text, compressed_token_string)

print(f"Complexity ratio:     {result.complexity_ratio:.3f}")
print(f"Information density:  {result.information_efficiency:.2f}x")
print(f"Passed:               {result.passed}")

Interpreting Results

Typical healthy output

complexity_ratio:     0.41
information_efficiency: 1.18x
passed:               True

A ratio around 0.4–0.6 is typical for well-formed CLM output. The token string compresses to about half the zlib footprint of the original, and each character carries more information density.

Warning signs

Symptom Likely cause
complexity_ratio > 0.85 Token string nearly as verbose as original; encoder may not have compressed meaningfully
complexity_ratio < 0.1 Token string is extremely short; critical content may have been discarded
information_efficiency < 0.5 Each token character carries far less information density than expected

Relationship to other checks

Kolmogorov is a structural sanity check. It catches coarse failures (output too long, output too short) but cannot verify that specific semantic fields are present. Use Conditional Entropy for that.

A pass here with a failure in Conditional Entropy is the most common meaningful failure mode: the output looks structurally reasonable but dropped specific critical fields.


Example with verbose output

from clm_core import KolmogorovAnalyzer

original = """
Agent: Thank you for calling support. How can I help you today?
Customer: I was charged twice this month for my subscription.
Agent: I'm sorry about that. Let me look up your account...
Agent: I can see the duplicate charge. I'll initiate a refund now.
         Reference number RFD-908712. Allow 3-5 business days.
Customer: Thank you so much.
"""

compressed = (
    "[INTERACTION:SUPPORT:CHANNEL=VOICE] [DURATION=6m] [LANG=EN] "
    "[DOMAIN:BILLING] [SERVICE:SUBSCRIPTION] "
    "[CUSTOMER_INTENT:REPORT_DUPLICATE_CHARGE] "
    "[AGENT_ACTIONS:ACCOUNT_VERIFIED→REFUND_INITIATED] "
    "[RESOLUTION:ISSUE_RESOLVED] [STATE:RESOLVED] "
    "[COMMITMENT:REFUND_3-5_DAYS] [ARTIFACT:REFUND_REF=RFD-908712] "
    "[SENTIMENT:NEUTRAL→GRATEFUL]"
)

result = analyzer.analyze(original, compressed)

print(f"Original raw bytes:   {result.original_bytes}")
print(f"Original zlib bytes:  {result.original_zlib_bytes}")
print(f"CLM raw bytes:        {result.clm_bytes}")
print(f"CLM zlib bytes:       {result.clm_zlib_bytes}")
print(f"Complexity ratio:     {result.complexity_ratio:.3f}")
print(f"Information density:  {result.information_efficiency:.2f}x")
print(f"Passed:               {result.passed}")
# Original raw bytes:   467
# Original zlib bytes:  312
# CLM raw bytes:        198
# CLM zlib bytes:       128
# Complexity ratio:     0.410
# Information density:  1.18x
# Passed:               True

Next Steps