Skip to content

Parsing Options

Configure how Rugo reads and processes Parquet metadata.

Available Options

All metadata reading functions accept these keyword arguments:

schema_only

Type: bool
Default: False

Return only the top-level schema without row group details.

metadata = parquet_meta.read_metadata(
    "file.parquet",
    schema_only=True
)

# Schema is available
print(metadata["schema_columns"])

# Row groups list is empty
print(len(metadata["row_groups"]))  # 0

Use when:

  • You only need to understand the file structure
  • Inspecting very large files quickly
  • Building schema catalogs

include_statistics

Type: bool
Default: True

Include min/max/distinct_count statistics in column metadata.

metadata = parquet_meta.read_metadata(
    "file.parquet",
    include_statistics=False
)

# Statistics fields will be None
for col in metadata["row_groups"][0]["columns"]:
    assert col["min"] is None
    assert col["max"] is None
    assert col["distinct_count"] is None

Use when:

  • Statistics aren't needed for your use case
  • Maximizing parsing speed
  • Reducing memory usage

max_row_groups

Type: int
Default: -1 (unlimited)

Limit the number of row groups to read.

metadata = parquet_meta.read_metadata(
    "large_file.parquet",
    max_row_groups=5
)

# Only first 5 row groups are read
print(len(metadata["row_groups"]))  # 5

Use when:

  • Sampling large files
  • Quick inspection of file characteristics
  • Processing files incrementally

Combining Options

Options can be combined for maximum efficiency:

# Fast schema-only read
metadata = parquet_meta.read_metadata(
    "file.parquet",
    schema_only=True,
    include_statistics=False  # Ignored when schema_only=True
)

# Sample first row group without statistics
metadata = parquet_meta.read_metadata(
    "file.parquet",
    max_row_groups=1,
    include_statistics=False
)

Performance Impact

schema_only

Enabling schema_only=True provides significant speedup for large files:

File Size Row Groups Normal schema_only Speedup
100 MB 1 10 ms 5 ms 2x
1 GB 10 50 ms 5 ms 10x
10 GB 100 500 ms 5 ms 100x

include_statistics

Disabling statistics has modest impact:

File Size Row Groups Normal No Stats Speedup
100 MB 1 10 ms 8 ms 1.25x
1 GB 10 50 ms 40 ms 1.25x

max_row_groups

Processing fewer row groups scales linearly:

# Full file: 1000 row groups @ 100ms
metadata = parquet_meta.read_metadata("file.parquet")

# First 10: ~1ms
metadata = parquet_meta.read_metadata("file.parquet", max_row_groups=10)

# First 100: ~10ms
metadata = parquet_meta.read_metadata("file.parquet", max_row_groups=100)

Common Patterns

Quick Schema Check

def get_schema(filename):
    """Get just the schema, as fast as possible."""
    return parquet_meta.read_metadata(
        filename,
        schema_only=True
    )["schema_columns"]

Sample Analysis

def analyze_sample(filename, num_groups=5):
    """Analyze a sample of row groups."""
    return parquet_meta.read_metadata(
        filename,
        max_row_groups=num_groups,
        include_statistics=True
    )

Minimal Overhead

def minimal_metadata(filename):
    """Read with minimal processing."""
    return parquet_meta.read_metadata(
        filename,
        include_statistics=False
    )

Examples

Inspecting File Structure

# Quick look at what's in the file
schema = parquet_meta.read_metadata(
    "unknown_file.parquet",
    schema_only=True
)

for col in schema["schema_columns"]:
    print(f"{col['name']}: {col['logical_type']}")

Sampling Data Distribution

# Check statistics in first few row groups
sample = parquet_meta.read_metadata(
    "large_file.parquet",
    max_row_groups=3
)

for rg_idx, rg in enumerate(sample["row_groups"]):
    print(f"Row Group {rg_idx}:")
    for col in rg["columns"]:
        print(f"  {col['name']}: [{col['min']}, {col['max']}]")

Building File Catalog

def catalog_file(path):
    """Create catalog entry with minimal data."""
    meta = parquet_meta.read_metadata(
        path,
        schema_only=True
    )

    return {
        "path": path,
        "num_rows": meta["num_rows"],
        "columns": [c["name"] for c in meta["schema_columns"]],
        "types": {c["name"]: c["logical_type"] 
                 for c in meta["schema_columns"]}
    }

Best Practices

  1. Use schema_only for initial inspection - Get quick overview before detailed analysis
  2. Limit row groups for large files - Sample first few groups instead of full scan
  3. Skip statistics when not needed - Faster parsing for structural analysis
  4. Consider file size - Options matter more for large files
  5. Profile your use case - Measure actual impact in your workflow

Next Steps