Perplexity Analysis
Overview
PerplexityAnalyzer validates that an LLM produces equivalent answers when given the compressed token string versus the original transcript. It directly tests CLM's core claim: an LLM can understand compressed tokens natively without fine-tuning.
Core question: Does the compressed token string give an LLM the same information as the original?
Method: Send both the original and compressed input to an LLM (Anthropic or OpenAI) with a fixed structured extraction task, then compare the responses across three dimensions.
This check carries a 25% weight in the final retention score. It supports Anthropic and OpenAI backends, and falls back to heuristic scoring when no API client is configured.
How It Works
API Mode
- Send
original + EVALUATION_TASKto the configured LLM - Send
compressed + EVALUATION_TASKto the same LLM - Parse both responses as JSON
- Compare across three dimensions
Evaluation Task
Both prompts use the same structured JSON extraction task by default:
{
"primary_issue": "<what the customer needed>",
"resolution": "<how it was resolved>",
"sentiment": "<customer sentiment>",
"follow_up_needed": true|false,
"key_facts": ["<fact1>", "<fact2>", "<fact3>"]
}
You can override this with a custom perplexity_task argument in gate.analyze() or by subclassing — see Tuning the evaluation task.
Scoring Dimensions
| Dimension | Weight | Method |
|---|---|---|
| Fact overlap | 40% | Fraction of original key_facts with 2+ word overlap in compressed response |
| Field-level Jaccard | 40% | Jaccard similarity of primary_issue, resolution, sentiment across both responses |
| JSON structure preservation | 20% | All five expected keys present in compressed response |
comprehension_score = fact_score × 0.4
+ response_similarity × 0.4
+ structure_preserved × 0.2
Pass condition: comprehension_score ≥ 0.82
Heuristic Fallback (Offline Mode)
When no llm_client is configured, or the API key cannot be loaded, the analyzer falls back to token overlap scoring:
orig_tokens = set of uppercase tokens (3+ chars) in original
clm_tokens = set of uppercase tokens (3+ chars) in compressed
coverage = |orig_tokens ∩ clm_tokens| / |orig_tokens|
comprehension_score = coverage
This is a rough proxy that measures how much of the original's vocabulary is present in the compressed output. It is intentionally conservative — token overlap undercounts semantic preservation — and is suitable for CI/CD gating where you want to catch obvious failures without API calls.
Skip perplexity entirely (synthetic perfect score):
report = gate.analyze(
original=transcript,
compressed=compressed,
structured=structured,
run_perplexity=False, # perplexity result is a synthetic pass, no scoring done
)
PerplexityConfig
PerplexityConfig holds the LLM connection settings passed to both PerplexityAnalyzer and CompressionQualityGate.
from clm_core import PerplexityConfig
cfg = PerplexityConfig(
llm_model="claude-haiku-4-5-20251001",
api_key="sk-ant-...",
host_url="https://api.anthropic.com",
temperature=0.0, # optional, default 0.0
)
| Field | Type | Default | Description |
|---|---|---|---|
llm_model |
str |
— | Model identifier passed to the API |
api_key |
str |
— | API key for the chosen provider |
host_url |
str |
— | Base URL for the API endpoint |
temperature |
float |
0.0 |
Sampling temperature |
PerplexityResult
class PerplexityResult(BaseModel):
original_response_tokens: int # Tokens in original response
compressed_response_tokens: int # Tokens in compressed response
original_latency_ms: float # LLM latency for original prompt
compressed_latency_ms: float # LLM latency for compressed prompt
latency_improvement: float # % faster with compressed input
response_similarity: float # 0–1, Jaccard field similarity
structure_preserved: bool # All expected JSON keys present
key_facts_preserved: list[str] # Facts present in both responses
facts_lost: list[str] # Facts from original missing in compressed
comprehension_score: float # Composite 0–1 score
passed: bool # True if comprehension_score ≥ 0.82
Standalone Usage
With Anthropic
from clm_core import PerplexityAnalyzer, PerplexityConfig
cfg = PerplexityConfig(
llm_model="claude-haiku-4-5-20251001",
api_key="sk-ant-...",
host_url="https://api.anthropic.com",
)
analyzer = PerplexityAnalyzer(llm_client="anthropic", cfg=cfg)
result = analyzer.analyze(
original=transcript_text,
compressed=clm_token_string,
verbose=True,
)
print(f"Comprehension score: {result.comprehension_score:.2f}")
print(f"Response similarity: {result.response_similarity:.2f}")
print(f"Structure preserved: {result.structure_preserved}")
print(f"Latency improvement: {result.latency_improvement:.1f}%")
print(f"Facts preserved: {result.key_facts_preserved}")
print(f"Facts lost: {result.facts_lost}")
print(f"Passed: {result.passed}")
With OpenAI
from clm_core import PerplexityAnalyzer, PerplexityConfig
cfg = PerplexityConfig(
llm_model="gpt-4o-mini",
api_key="sk-...",
host_url="https://api.openai.com/v1",
)
analyzer = PerplexityAnalyzer(llm_client="openai", cfg=cfg)
result = analyzer.analyze(
original=transcript_text,
compressed=clm_token_string,
)
With a custom task
result = analyzer.analyze(
original=transcript_text,
compressed=clm_token_string,
task="""
Given the above context, respond with a JSON object containing:
{
"billing_issue": "<the specific billing problem>",
"refund_amount": "<amount if applicable>",
"resolution_status": "<resolved|pending|escalated>",
"key_facts": ["<fact1>", "<fact2>", "<fact3>"]
}
Respond ONLY with the JSON object.
""",
)
Latency as a Secondary Signal
latency_improvement is not used in the pass/fail decision, but it is captured as a secondary benefit metric. CLM's compressed tokens are shorter, so the LLM processes them faster. Negative values (compressed is slower) can indicate that the compressed string is unexpectedly verbose or that network variability dominated in a short test run.
Interpreting Results
Typical healthy output
comprehension_score: 0.89
response_similarity: 0.86
structure_preserved: True
latency_improvement: 14.2%
facts_preserved: ["duplicate charge confirmed", "refund initiated", "grateful customer"]
facts_lost: []
passed: True
Warning signs
| Symptom | Likely cause |
|---|---|
structure_preserved: False |
LLM produced malformed JSON from compressed input; token string may be ambiguous |
response_similarity < 0.5 |
Core fields (resolution, sentiment) diverged significantly between responses |
facts_lost contains critical facts |
Key information not recoverable from compressed string |
comprehension_score borderline (0.78–0.82) |
May be LLM variability; re-run or accept as acceptable verdict |
Relationship to verdict
Perplexity has no veto power. A failed perplexity check alongside a passed conditional entropy check results in acceptable, not high_risk. This is intentional: LLM response variability means perplexity can fail on a correctly compressed string, and that single failure should not block production use.
If perplexity consistently fails for a given compression pattern, investigate whether the CLM token format is being misunderstood by the model — the EVALUATION_TASK prompt in PerplexityAnalyzer can be tuned to your specific use case.
Configuration
Default models
| Client | Default model |
|---|---|
anthropic |
claude-haiku-4-5-20251001 |
openai |
gpt-5-nano-2025-08-07 |
Pass a different model via PerplexityConfig.llm_model.
Tuning the evaluation task
The EVALUATION_TASK prompt is designed for general customer service transcripts. For domain-specific validation, pass a custom task via perplexity_task at call time (no subclassing needed):
report = gate.analyze(
original=transcript,
compressed=compressed,
structured=structured,
run_perplexity=True,
perplexity_task="""
Given the above context, respond with a JSON object containing:
{
"billing_issue": "<the specific billing problem>",
"refund_amount": "<amount if applicable>",
"refund_reference": "<reference number if applicable>",
"resolution_status": "<resolved|pending|escalated>",
"key_facts": ["<fact1>", "<fact2>", "<fact3>"]
}
Respond ONLY with the JSON object.
""",
)
Or override the class constant via subclass for a permanent change:
from clm_core import PerplexityAnalyzer
class BillingPerplexityAnalyzer(PerplexityAnalyzer):
EVALUATION_TASK = """
Given the above context, respond with a JSON object containing:
{
"billing_issue": "<the specific billing problem>",
"refund_amount": "<amount if applicable>",
"refund_reference": "<reference number if applicable>",
"resolution_status": "<resolved|pending|escalated>",
"key_facts": ["<fact1>", "<fact2>", "<fact3>"]
}
Respond ONLY with the JSON object.
"""
COMPREHENSION_THRESHOLD = 0.90
CI/CD Integration
For pipelines without API access, gate on Kolmogorov and Conditional Entropy only:
from clm_core import CompressionQualityGate
gate = CompressionQualityGate() # no llm_client → heuristic fallback for perplexity
report = gate.analyze(
original=transcript,
compressed=compressed,
structured=structured,
run_perplexity=False, # skip perplexity entirely — gets synthetic perfect score
)
# Gate on the two deterministic checks
if report.conditional.passed and report.kolmogorov.passed:
print("Quality gate passed (offline mode)")
else:
print(f"Quality gate failed: {report.verdict}")
exit(1)
Next Steps
- Kolmogorov Complexity — Structural information equivalence (25% weight)
- Conditional Entropy — Semantic slot comparison (50% weight)
- Quality Gate Index — Unified report, verdict logic, and retention score