Parsing Options

Configure how Rugo reads and processes Parquet metadata.

Available Options

All metadata reading functions accept these keyword arguments:

schema_only

Type: bool
Default: False

Return only the top-level schema without row group details.

metadata = parquet_meta.read_metadata(
    "file.parquet",
    schema_only=True
)

# Schema is available
print(metadata["schema_columns"])

# Row groups list is empty
print(len(metadata["row_groups"]))  # 0

Use when:

You only need to understand the file structure
Inspecting very large files quickly
Building schema catalogs

include_statistics

Type: bool
Default: True

Include min/max/distinct_count statistics in column metadata.

metadata = parquet_meta.read_metadata(
    "file.parquet",
    include_statistics=False
)

# Statistics fields will be None
for col in metadata["row_groups"][0]["columns"]:
    assert col["min"] is None
    assert col["max"] is None
    assert col["distinct_count"] is None

Use when:

Statistics aren't needed for your use case
Maximizing parsing speed
Reducing memory usage

max_row_groups

Type: int
Default: -1 (unlimited)

Limit the number of row groups to read.

metadata = parquet_meta.read_metadata(
    "large_file.parquet",
    max_row_groups=5
)

# Only first 5 row groups are read
print(len(metadata["row_groups"]))  # 5

Use when:

Sampling large files
Quick inspection of file characteristics
Processing files incrementally

Combining Options

Options can be combined for maximum efficiency:

# Fast schema-only read
metadata = parquet_meta.read_metadata(
    "file.parquet",
    schema_only=True,
    include_statistics=False  # Ignored when schema_only=True
)

# Sample first row group without statistics
metadata = parquet_meta.read_metadata(
    "file.parquet",
    max_row_groups=1,
    include_statistics=False
)

Performance Impact

schema_only

Enabling schema_only=True provides significant speedup for large files:

File Size	Row Groups	Normal	schema_only	Speedup
100 MB	1	10 ms	5 ms	2x
1 GB	10	50 ms	5 ms	10x
10 GB	100	500 ms	5 ms	100x

include_statistics

Disabling statistics has modest impact:

File Size	Row Groups	Normal	No Stats	Speedup
100 MB	1	10 ms	8 ms	1.25x
1 GB	10	50 ms	40 ms	1.25x

max_row_groups

Processing fewer row groups scales linearly:

# Full file: 1000 row groups @ 100ms
metadata = parquet_meta.read_metadata("file.parquet")

# First 10: ~1ms
metadata = parquet_meta.read_metadata("file.parquet", max_row_groups=10)

# First 100: ~10ms
metadata = parquet_meta.read_metadata("file.parquet", max_row_groups=100)

Common Patterns

Quick Schema Check

def get_schema(filename):
    """Get just the schema, as fast as possible."""
    return parquet_meta.read_metadata(
        filename,
        schema_only=True
    )["schema_columns"]

Sample Analysis

def analyze_sample(filename, num_groups=5):
    """Analyze a sample of row groups."""
    return parquet_meta.read_metadata(
        filename,
        max_row_groups=num_groups,
        include_statistics=True
    )

Minimal Overhead

def minimal_metadata(filename):
    """Read with minimal processing."""
    return parquet_meta.read_metadata(
        filename,
        include_statistics=False
    )

Examples

Inspecting File Structure

# Quick look at what's in the file
schema = parquet_meta.read_metadata(
    "unknown_file.parquet",
    schema_only=True
)

for col in schema["schema_columns"]:
    print(f"{col['name']}: {col['logical_type']}")

Sampling Data Distribution

# Check statistics in first few row groups
sample = parquet_meta.read_metadata(
    "large_file.parquet",
    max_row_groups=3
)

for rg_idx, rg in enumerate(sample["row_groups"]):
    print(f"Row Group {rg_idx}:")
    for col in rg["columns"]:
        print(f"  {col['name']}: [{col['min']}, {col['max']}]")

Building File Catalog

def catalog_file(path):
    """Create catalog entry with minimal data."""
    meta = parquet_meta.read_metadata(
        path,
        schema_only=True
    )

    return {
        "path": path,
        "num_rows": meta["num_rows"],
        "columns": [c["name"] for c in meta["schema_columns"]],
        "types": {c["name"]: c["logical_type"] 
                 for c in meta["schema_columns"]}
    }

Best Practices

Use schema_only for initial inspection - Get quick overview before detailed analysis
Limit row groups for large files - Sample first few groups instead of full scan
Skip statistics when not needed - Faster parsing for structural analysis
Consider file size - Options matter more for large files
Profile your use case - Measure actual impact in your workflow

Next Steps

In-Memory Data - Work with bytes and memoryview
API Reference - Complete function signatures