Free-Form Encoder

Part of Thread Encoder — Free-Form is one of the encoding modes within the Thread Encoder component (clm_core/components/thread_encoder). It handles unstructured text that does not follow the canonical Speaker: text turn format — emails, SMS messages, Slack threads, raw case notes, and any other prose-style conversation content.

Overview

Not all conversations arrive as clean two-sided transcripts. The Free-Form Encoder extends Thread Encoder to handle unstructured text by automatically detecting the input format and splitting prose into pseudo-turns before running the standard analysis pipeline.

Typical inputs: - Email threads - Slack / Teams message threads - SMS conversations - Support ticket notes - Chat logs without consistent speaker labels

Typical compression: 70–85% token reduction (varies with prose density)

How It Works

Format Detection

When transcript_format="auto" (the default), the encoder runs detect_format() on the raw input before analysis:

Raw input
    │
    ▼
detect_format()
    │
    ├──▶ "turns"      → Standard transcript path (Speaker: text lines)
    │
    └──▶ "free_form"  → Free-Form path
              │
              ▼
         split_free_form()
              │
              ▼
         [Turn(speaker="unknown", text=...), ...]
              │
              ▼
         TranscriptAnalyzer.analyze(turns=...)
              │
              ▼
         Token assembly → ThreadOutput

Detection rule: If at least 50% of non-empty lines carry a recognized speaker prefix (Agent:, Customer:, or language equivalents), the input is classified as "turns". Otherwise it is classified as "free_form". An empty string is always "free_form".

Splitting

split_free_form() splits unstructured text into chunks that become turns:

Paragraph mode — blank-line-separated blocks are used as chunk boundaries (preferred)
Line-by-line fallback — when no blank lines exist (single paragraph), each non-empty line becomes a separate turn

Each chunk is wrapped as Turn(speaker="unknown", text=chunk). The full NLP analysis pipeline then runs on these turns exactly as it does for a labeled transcript.

Quick Start

from clm_core import CLMConfig, CLMEncoder

# Email thread — no speaker labels
email_thread = """
Hi support team,

I noticed my account was charged twice this month — one on the 2nd and another on the 3rd.
Can you please look into this? My account email is melissa.jordan@example.com.

Thanks,
Melissa

---

Hi Melissa,

Thanks for reaching out. I can confirm the duplicate charge — it was caused by a payment
retry that fired after the first transaction already succeeded.

I've initiated a full refund on the second charge. You'll see it within 3–5 business days.
Your reference number is RFD-908712.

Best,
Raj – Support Team
"""

cfg = CLMConfig(lang="en")
encoder = CLMEncoder(cfg=cfg)

result = encoder.encode(
    input_=email_thread,
    metadata={
        "call_id": "EMAIL-0042",
        "channel": "email",
    }
)

print(result.compressed)

Output:

[INTERACTION:SUPPORT:CHANNEL=EMAIL]
[LANG=EN]
[DOMAIN:BILLING]
[SERVICE:SUBSCRIPTION]
[CUSTOMER_INTENT:REPORT_DUPLICATE_CHARGE]
[CONTEXT:EMAIL_PROVIDED]
[AGENT_ACTIONS:REFUND_INITIATED]
[RESOLUTION:ISSUE_RESOLVED]
[STATE:RESOLVED]
[COMMITMENT:REFUND_3-5_DAYS]
[ARTIFACT:REFUND_REF=RFD-908712]
[SENTIMENT:NEUTRAL→GRATEFUL]

The `transcript_format` Parameter

ThreadEncoder.encode() (and therefore CLMEncoder.encode()) exposes a transcript_format parameter:

Value	Behaviour
`"auto"`	Default. Heuristic detection — inspects the input and picks `"turns"` or `"free_form"` automatically
`"turns"`	Skips detection; treats the input as a labeled `Speaker: text` transcript
`"free_form"`	Skips detection; forces the free-form splitting path regardless of content

# Force free-form mode (e.g. input is known to be an email)
result = encoder.encode(
    input_=email_thread,
    metadata={"channel": "email"},
    transcript_format="free_form",
)

# Force turn mode (e.g. input always has speaker labels)
result = encoder.encode(
    input_=transcript,
    metadata={"channel": "voice"},
    transcript_format="turns",
)

The resolved format is recorded in result.metadata["transcript_format"]:

print(result.metadata["transcript_format"])  # "free_form" or "turns"

Supported Input Shapes

Email Thread

email = """
Customer email — no consistent speaker labels, prose paragraphs separated by blank lines.

Each paragraph becomes one turn.
"""

result = encoder.encode(input_=email, metadata={"channel": "email"})

Slack / Chat Thread

slack = """
Hey, my export has been stuck for 2 hours — is something broken?

Hi! Sorry to hear that. Can you share the export ID?

Sure — EXP-7712.

Got it. Our jobs queue had a backlog earlier today. I've manually re-queued yours.
It should complete within the next 15 minutes.
"""

result = encoder.encode(input_=slack, metadata={"channel": "chat"})

Raw Case Notes

notes = """
Customer called about a locked account. Says they never changed their password.
Ran identity verification — passed.
Unlocked the account and reset temporary credentials.
Customer confirmed access was restored.
"""

result = encoder.encode(input_=notes, metadata={"channel": "voice"})

Output

Free-form inputs produce the same ThreadOutput as structured transcripts. All standard tokens apply:

result.compressed          # CLM token string
result.n_tokens            # Estimated original token count
result.c_tokens            # Estimated compressed token count
result.compression_ratio   # Percent reduction
result.to_dict()           # Typed Python dict (same schema as transcript mode)

See the Thread Encoder index for the full to_dict() schema and field reference.

Differences from Transcript Mode

Aspect	Transcript (Turns)	Free-Form
Input format	`Speaker: text` per line	Prose paragraphs or lines
Speaker detection	Labeled (`Agent`, `Customer`, …)	All turns attributed to `"unknown"`
Turn splitting	Built-in by line	`split_free_form()` (blank-line then line)
Analysis pipeline	Full NLP pipeline	Same full NLP pipeline
Token output schema	CLM v2	Same CLM v2
`transcript_format`	`"turns"`	`"free_form"`

Because all turns are "unknown" in free-form mode, the encoder relies more heavily on vocabulary and pattern matching for domain, intent, and action classification rather than speaker-attributed utterance segmentation. Compression ratios are typically slightly lower than for well-labeled transcripts.

Configuration

Free-Form uses the same CLMConfig as the rest of Thread Encoder. No additional settings are required.

cfg = CLMConfig(
    lang="en",                        # Language: en, pt, es, fr
    redaction_pattern=r"\[.*?\]"      # Optional: detect redacted PII fields
)

Parameter	Type	Default	Description
`lang`	`str`	`"en"`	Language for NLP model and dictionary
`redaction_pattern`	`str`	Built-in	Regex to detect redacted PII in the text

Low-Level API

The two free-form utilities are importable directly from the free_form subpackage:

from clm_core.components.thread_encoder.free_form.splitter import (
    detect_format,
    split_free_form,
)
from clm_core.components.thread_encoder.patterns import TranscriptPatterns

patterns = TranscriptPatterns(...)  # or load from dictionary

# Classify raw text
fmt = detect_format(text, patterns)   # "turns" | "free_form"

# Split into turns
turns = split_free_form(text)
# [Turn(speaker="unknown", text="..."), Turn(speaker="unknown", text="..."), ...]

Next Steps

Thread Encoder Overview — Architecture, ThreadOutput reference, data models
Transcript Encoder — Full reference for labeled two-sided transcripts
Advanced: CLM Dictionary — Language-specific vocabularies
Advanced: Token Hierarchy — Token structure deep-dive