Skip to content

Structured Data Encoder (SDEncoderV2)

Overview

The Structured Data Encoder (SDEncoderV2) compresses structured datasets like knowledge bases, product catalogs, business rules, and configuration data. Unlike transcript or system prompt compression, this encoder works with tabular or nested structured data in JSON/dictionary format.

Key characteristics: - Header-first, row-based format with explicit nested schema scoping - Nested schemas defined in header: field:{nested,fields} - Values contain only data, no repeated schemas - Highly configurable field selection with importance thresholds - Maintains data integrity for downstream processing

Typical compression: 40-85% token reduction


What Gets Compressed

Structured data compression targets:

Primary Use Cases

Data Type Examples Compression
Knowledge Bases Help articles, FAQs, documentation entries 70-80%
Product Catalogs SKUs, configurations, specifications 75-85%
Business Rules Validation rules, workflows, decision trees 70-80%
Configuration Data System settings, feature flags, parameters 80-90%
Recommendation Catalogs Offers, suggestions, actions 75-85%

What Gets Preserved

Critical fields: IDs, UUIDs, names, titles, types, status ✅ High-importance fields: Categories, tags, descriptions, priority ✅ Relationships: Parent-child, nested structures ✅ Data types: Strings, numbers, dates, arrays ✅ Field order: Configurable prioritization (IDs and priority first)

What Gets Optimized

  • Text fields: Compressed while preserving meaning
  • Long descriptions: Truncated to configurable length (uniform or per-field)
  • Low-importance fields: Excluded based on thresholds
  • Redundant information: Deduplicated across records
  • Metadata: Optional exclusion of timestamps, versions

Basic Usage

Simple Example

from clm_core import CLMEncoder, CLMConfig
from clm_core import SDCompressionConfig

# Knowledge Base articles
kb_catalog = [
    {
        "article_id": "KB-001",
        "title": "How to Reset Password",
        "content": "To reset your password, go to the login page and click...",
        "category": "Account",
        "tags": ["password", "security", "account"],
        "views": 1523,
        "last_updated": "2024-10-15",
    },
    {
        "article_id": "KB-002",
        "title": "Update Email Address",
        "content": "To update your email, navigate to settings...",
        "category": "Account",
        "tags": ["email", "settings"],
        "views": 892,
        "last_updated": "2024-10-12",
    }
]

# Configure compression
config = CLMConfig(
    lang="en",
    ds_config=SDCompressionConfig(
        auto_detect=True,
        required_fields=["article_id", "title"],
        field_importance={"tags": 0.8, "content": 0.9},
        max_truncation_length=100
    )
)

# Compress
encoder = CLMEncoder(cfg=config)
result = encoder.encode(kb_catalog)
print(result.compressed)

Output:

{article_id,title,content,category,tags,views}[KB-001,How to Reset Password,To reset your password; go to the login page and click...,Account,password+security+account,1523][KB-002,Update Email Address,To update your email; navigate to settings...,Account,email+settings,892]

Note: Commas in text values are escaped with semicolons to preserve the delimiter structure.

Compression: ~40% token reduction


Configuration Reference

SDCompressionConfig Parameters

from clm_core import SDCompressionConfig

config = SDCompressionConfig(
    auto_detect=True,                    # Auto-detect field importance based on name patterns and value heuristics
    drop_non_required_fields=True,       # Only keep required_fields (aggressive compression)
    required_fields=["id", "name"],      # Always include these
    excluded_fields=["metadata"],        # Never include these
    field_importance={"desc": 0.9},      # Custom importance scores (0.0-1.0)
    importance_threshold=0.5,            # Include fields above this
    max_truncation_length=200,           # Default truncation for long text (chars)
    max_truncation_mapping={"title": 50, "desc": 100},  # Per-field truncation lengths
    preserve_structure=True,             # Keep nested dicts/lists
)

Core Parameters

drop_non_required_fields (boolean, default: True)

  • Purpose: When enabled, only fields in required_fields are included
  • When True: Aggressive compression - only required fields kept
  • When False: Uses importance thresholds and auto-detection
  • Use case: Maximum compression when you know exactly which fields you need

auto_detect (boolean, default: True)

  • Purpose: Automatically detect and prioritize important fields
  • When True: Uses default importance scores and heuristics
  • When False: Relies solely on explicit configuration
  • Recommendation: Keep True unless you have very specific requirements

required_fields (list[str], optional)

  • Purpose: Fields that must always be included, regardless of importance
  • Examples: ["id", "article_id", "product_id", "title"]
  • Priority: Highest - never excluded
  • Use case: Ensuring critical identifiers are preserved

excluded_fields (list[str], optional)

  • Purpose: Fields that should never be included
  • Examples: ["metadata", "internal_notes", "debug_info"]
  • Priority: Absolute - always excluded
  • Use case: Removing sensitive or irrelevant data

field_importance (dict[str, float], optional)

  • Purpose: Custom importance scores for specific fields (0.0-1.0)
  • Example: {"description": 0.9, "tags": 0.8, "notes": 0.3}
  • Impact: Overrides default importance scores
  • Range: 0.0 (lowest) to 1.0 (highest)

importance_threshold (float, default: 0.5)

  • Purpose: Minimum importance score for field inclusion
  • Range: 0.0-1.0
  • Example: 0.7 = only include fields with importance ≥ 0.7
  • Trade-off: Higher threshold = more compression, less detail

max_truncation_length (int, default: 200)

  • Purpose: Default maximum length for text fields (in characters)
  • Impact: Long descriptions are truncated with ...
  • Recommendation: Adjust based on field type (50-500)
  • Note: Acts as the fallback for any string field not listed in max_truncation_mapping. Applied recursively to nested objects.

max_truncation_mapping (dict[str, int], optional)

  • Purpose: Per-field truncation limits that override max_truncation_length
  • Example: {"title": 50, "description": 200, "summary": 100}
  • Impact: Each key maps a field name to its maximum character length. Fields matching a key are truncated to that length (with ... appended); all other string fields fall back to max_truncation_length.
  • Nested objects: Truncation is applied recursively — field names are matched at every nesting level, so "description": 100 truncates a description field whether it appears at the top level or inside a nested object.
  • Use case: Different length limits for different field types (short titles, longer descriptions)

preserve_structure (boolean, default: True)

  • Purpose: Maintain nested dictionaries and lists
  • When True: Nested data preserved as-is
  • When False: Flattens nested structures
  • Use case: Keep True for complex hierarchical data

Default Field Importance

The encoder has built-in importance scores for common fields:

Field Importance Priority Typical Use
id, uuid, external_id, status CRITICAL (1.0) Always included Identifiers, state
name, title, type, category, tags, description, priority, severity, resolution, owner, channel HIGH (0.8) Usually included Core attributes
subcategory, details, assignee, department, language MEDIUM (0.5) Often included Secondary info
notes, source, metadata, created_at, updated_at, version LOW (0.2) Rarely included Metadata

Default Simple Fields

These fields are formatted without transformation and appear first in the output: - id, uuid, title, name, type, priority, article_id, product_id

Default Field Order

Fields are ordered in the output as follows: 1. iduuidpriorityarticle_idproduct_idtitlenametype 2. Then other fields in their original order

Importance Scale: - CRITICAL: 1.0 (always included unless explicitly excluded) - HIGH: 0.8 (included with default threshold) - MEDIUM: 0.5 (included with threshold ≤ 0.5) - LOW: 0.2 (excluded with default threshold) - NEVER: 0.0 (always excluded)

You can override these with the field_importance parameter:

config = SDCompressionConfig(
    field_importance={
        "created_at": 0.9,  # Override: LOW → HIGH
        "description": 0.4  # Override: HIGH → MEDIUM
    }
)

Configuration Examples

Example 1: Knowledge Base (Balanced)

# Balanced compression for help articles
config = SDCompressionConfig(
    auto_detect=True,
    required_fields=["article_id", "title"],
    field_importance={
        "content": 0.9,      # Very important
        "tags": 0.8,         # Important
        "views": 0.5         # Moderate importance
    },
    importance_threshold=0.5,
    max_truncation_length=150
)

Result: Includes ID, title, content (truncated), category, tags, views Compression: ~50%


Example 2: Product Catalog (Conservative)

# Preserve more detail for product specs
config = SDCompressionConfig(
    auto_detect=True,
    required_fields=["sku", "name", "price"],
    field_importance={
        "specifications": 1.0,  # Critical
        "description": 0.9,
        "features": 0.8
    },
    importance_threshold=0.4,  # Lower threshold
    max_truncation_length=300,      # Longer fields
    preserve_structure=True    # Keep nested specs
)

Result: Comprehensive product information preserved Compression: ~45%


Example 3: Business Rules (Minimal)

# Lightweight compression for rules
config = SDCompressionConfig(
    auto_detect=False,  # Explicit control
    required_fields=["rule_id", "condition", "action"],
    excluded_fields=["author", "created_at", "metadata"],
    field_importance={
        "condition": 1.0,
        "action": 1.0,
        "priority": 0.9
    },
    importance_threshold=0.8
)

Result: Only rule logic, no metadata Compression: ~60%


Example 4: Nested Tables (Actions / Steps)

When your data contains fields that are arrays of objects with a consistent schema (e.g. actions, steps, line items), the encoder represents them as inline tables — the nested schema appears once in the header and each item becomes a compact row.

# Data with nested list-of-dicts
data = [
    {
        "id": "NBA-001",
        "title": "Billing Issue",
        "actions": [
            {"name": "Verify", "description": "Check the account status"},
            {"name": "Refund", "description": "Issue a partial refund"}
        ]
    },
    {
        "id": "NBA-002",
        "title": "Login Problem",
        "actions": [
            {"name": "Reset", "description": "Reset the password"},
            {"name": "Escalate", "description": "Forward to tier 2"}
        ]
    }
]

config = SDCompressionConfig(
    auto_detect=False,           # Explicit control over fields
    drop_non_required_fields=False,
    preserve_structure=True,     # Required — recurses into nested items
    max_truncation_length=100,
)

Output:

{id,title,actions:{name,description}}[NBA-001,Billing Issue,[Verify,Check the account status][Refund,Issue a partial refund]][NBA-002,Login Problem,[Reset,Reset the password][Escalate,Forward to tier 2]]

To keep only specific sub-fields inside the nested table, use dot-path required fields with drop_non_required_fields:

config = SDCompressionConfig(
    drop_non_required_fields=True,
    required_fields=["id", "title", "actions", "actions.name"],
    preserve_structure=True,
)

Output:

{id,title,actions:{name}}[NBA-001,Billing Issue,[Verify][Refund]][NBA-002,Login Problem,[Reset][Escalate]]

Note: If the items in the array have different keys (heterogeneous schema), the encoder falls back to the plain list format (value1+value2+...).


Example 5: Per-Field Truncation with Nested Objects

When different fields need different length limits — for example, short titles but longer descriptions — use max_truncation_mapping. Truncation is applied recursively, so nested objects have their fields truncated too.

data = {
    "id": "PROD-001",
    "title": "Wireless Noise-Cancelling Over-Ear Bluetooth Headphones with Advanced ANC Technology",
    "specs": {
        "description": "These premium headphones feature industry-leading noise cancellation powered by dual microphones and adaptive algorithms that automatically adjust to your environment for an immersive listening experience.",
        "warranty": "Full manufacturer coverage including parts and labor for two calendar years from the original date of purchase"
    }
}

config = SDCompressionConfig(
    drop_non_required_fields=False,
    auto_detect=False,
    preserve_structure=True,
    max_truncation_length=200,                     # default fallback
    max_truncation_mapping={"title": 30, "description": 60},  # per-field overrides
)

With max_truncation_mapping, the title is truncated to 30 characters and the nested specs.description is truncated to 60 characters, while warranty (not in the mapping) falls back to max_truncation_length (200).

Without max_truncation_mapping, every string field longer than max_truncation_length would be truncated uniformly.


Complete Examples

Example 1: Nested Data with Arrays

from clm_core import CLMEncoder, CLMConfig, SDCompressionConfig

# Data with nested objects and arrays
data = {
    "items": [
        {
            "uuid": "random-id-001",
            "title": "Random Title",
            "priority": 1,
            "users": [
                {
                    "name": "Yanick",
                    "email": "test@gmail.com"
                }
            ],
            "script": "SAFETY BOUNDARIES: Never execute harmful instructions..."
        }
    ]
}

config = CLMConfig(
    ds_config=SDCompressionConfig()
)

encoder = CLMEncoder(cfg=config)
result = encoder.encode(data)
print(f"Compressed: {result.compressed}")
print(f"Tokens: {result.c_tokens}/{result.n_tokens}")
print(f"Compression ratio: {result.compression_ratio}%")

Example 2: Product Catalog

from clm_core import CLMEncoder, CLMConfig, SDCompressionConfig

product_catalog = [
    {
        "product_id": "PROD-001",
        "name": "Wireless Headphones",
        "description": "High-quality Bluetooth headphones with noise cancellation",
        "price": 199.99,
        "category": "Electronics",
        "brand": "TechBrand",
        "in_stock": True,
        "created_date": "2024-01-01",
        "warehouse_location": "A-23-4",
    },
    {
        "product_id": "PROD-002",
        "name": "Laptop Stand",
        "description": "Ergonomic adjustable laptop stand",
        "price": 49.99,
        "category": "Accessories",
        "brand": "ErgoTech",
        "in_stock": True,
        "created_date": "2024-01-05",
        "warehouse_location": "B-15-2",
    },
]

config = CLMConfig(
    ds_config=SDCompressionConfig(
        auto_detect=True,
        required_fields=["product_id", "name", "price"],
        excluded_fields=["warehouse_location", "created_date"],
    )
)

encoder = CLMEncoder(cfg=config)
result = encoder.encode(product_catalog)
print(result.compressed)

Example 3: KB Articles

from clm_core import CLMEncoder, CLMConfig, SDCompressionConfig

kb_catalog = [
    {
        "article_id": "KB-001",
        "title": "How to Reset Password",
        "content": "To reset your password, go to the login page and click...",
        "category": "Account",
        "tags": ["password", "security", "account"],
        "views": 1523,
        "last_updated": "2024-10-15",
    }
]

config = CLMConfig(
    ds_config=SDCompressionConfig(
        auto_detect=True,
        required_fields=["article_id", "title"],
        field_importance={"tags": 0.8, "content": 0.9},
        max_truncation_length=100,
    )
)

encoder = CLMEncoder(cfg=config)
result = encoder.encode(kb_catalog)
print(result.compressed)

Output Format

Structure

Compressed structured data uses different formats for single items vs. arrays:

Single Item:

[value1,value2,value3]

Array of Items (header + rows):

{field1,field2,field3}[value1,value2,value3][value1,value2,value3]

Header (arrays only): - {fields}: Comma-separated list of included field names in order

Rows: - One bracketed row per record: [values] - Values comma-separated in same order as header - Values preserved with minimal transformation - Commas in values are escaped with semicolons

Data Type Handling

Type Original Compressed
String "Hello World" Hello World
String with comma "Hello, World" Hello; World
Number 1299.99 1299.99
Boolean true True
Null null (excluded)
Date "2024-10-15" 2024-10-15
Long text "Very long description..." Truncated to max_truncation_mapping[field] or max_truncation_length with ...

Nested Structure Handling

Nested schemas are defined in the header, values contain only data:

Structure Header Values
Nested object (multi-field) specs:{cpu,ram} [i7,16GB]
Nested object (single-field) context:{task} My task
Array of objects items:{a,b} [1,x][2,y]
Simple array tags tag1+tag2+tag3

Single-field bracket elision: When a nested object has only one field, the brackets are omitted because the header already makes the structure unambiguous. For example, context:{task} with value My task instead of [My task]. Multi-field nested objects and nested table rows always keep their brackets.

Example with nested actions:

Input:  {"id": "1", "actions": [{"name": "A", "desc": "X"}, {"name": "B", "desc": "Y"}]}
Output: {id,actions:{name,desc}}[1,[A,X][B,Y]]

The nested schema {name,desc} appears only once in the header as actions:{name,desc}, not repeated for each item.

Example Output

Input:

[
  {"article_id": "KB-001", "title": "Reset Password", "category": "Account"},
  {"article_id": "KB-002", "title": "Update Email", "category": "Account"}
]

Output:

{article_id,title,category}[KB-001,Reset Password,Account][KB-002,Update Email,Account]

CLMOutput Fields

The encoder returns a CLMOutput object with:

Field Type Description
compressed str The compressed string output
original str \| dict \| list The original input data
n_tokens int Estimated input token count (~4 chars/token)
c_tokens int Estimated compressed token count
compression_ratio float Percentage of token reduction
component str Component name ("ds_compression")
metadata dict Additional compression metadata

Token Estimation

Tokens are estimated at approximately 4 characters per token:

result = encoder.encode(data)
print(f"Input tokens: {result.n_tokens}")      # Estimated from original
print(f"Output tokens: {result.c_tokens}")     # Estimated from compressed
print(f"Compression: {result.compression_ratio}%")  # (1 - c_tokens/n_tokens) * 100

Automatic Fallback

If the compressed output would be larger than the original input, the encoder automatically falls back to the original:

result = encoder.encode(small_data)
# If compression increases size, result.compressed == original
# result.compression_ratio will be 0.0
# result.metadata["description"] will explain the fallback

This ensures compression never increases token usage.

Whitespace Normalization

All compressed output is automatically normalized: - Multiple spaces, tabs, and newlines collapsed to single spaces - Leading and trailing whitespace trimmed

This ensures consistent, compact output regardless of input formatting.


Best Practices

1. Define Required Fields

Always specify critical identifiers:

config = SDCompressionConfig(
    required_fields=["id", "name"],  # Never compress these out
)

2. Tune Importance Threshold

Balance compression vs. detail:

# High-volume, cost-sensitive
importance_threshold=0.7  # More compression

# Detail-critical applications
importance_threshold=0.4  # More detail

3. Set Appropriate Field Lengths

Use max_truncation_mapping for per-field control, or max_truncation_length as a uniform default:

# Per-field truncation (recommended when fields have different lengths)
config = SDCompressionConfig(
    max_truncation_mapping={
        "title": 50,           # Short fields
        "description": 200,    # Medium fields
        "content": 500,        # Long fields
    },
    max_truncation_length=200,  # Fallback for fields not in the mapping
)

# Uniform truncation (simpler, one limit for all)
config = SDCompressionConfig(
    max_truncation_length=200   # Applied to all string fields
)

4. Exclude Unnecessary Metadata

Remove timestamps and internal fields:

config = SDCompressionConfig(
    excluded_fields=[
        "created_at",
        "updated_at",
        "created_by",
        "internal_notes",
        "debug_info"
    ]
)

5. Test with Representative Data

Validate compression with real examples:

# Test with sample data
sample = catalog[:5]
result = encoder.encode(sample)

# Verify critical fields present
assert "ID" in result.compressed or "id" in result.compressed

# Check compression ratio
assert result.compression_ratio >= 30

Troubleshooting

Issue: Too Much Compression

Symptom: Critical fields missing

Solution:

config = SDCompressionConfig(
    required_fields=["id", "name", "missing_field"],
    # OR lower threshold
    importance_threshold=0.4
)

Issue: Not Enough Compression

Symptom: Compression ratio too low

Solution:

config = SDCompressionConfig(
    importance_threshold=0.7,  # Higher
    excluded_fields=["notes", "metadata", "timestamps"],
    max_truncation_length=100
)

Issue: Nested Data Lost

Symptom: Hierarchical structure flattened

Solution:

config = SDCompressionConfig(
    preserve_structure=True  # Must be True for nested data
)

Issue: Nested Tables Rendered as Stringified Dicts

Symptom: A list-of-dicts field shows as {'id': 1, ...}+{'id': 2, ...} instead of [1,...][2,...]

Cause: auto_detect uses field values to determine importance. If a field like wasSunny: False is dropped (falsy values get NEVER importance) while wasSunny: True is kept, the row schemas diverge and nested table encoding falls back to plain list stringification.

Solution: This is fixed in v0.0.10 — nested table filtering now derives the schema from the first item and applies it uniformly to all rows. If you're on an older version, specify required_fields explicitly:

config = SDCompressionConfig(
    required_fields=["hikes.id", "hikes.name", "hikes.distanceKm"],
    preserve_structure=True
)

Next Steps


Support

Questions about structured data compression?