Parsing Options
Configure how Rugo reads and processes Parquet metadata.
Available Options
All metadata reading functions accept these keyword arguments:
schema_only
Type: bool
Default: False
Return only the top-level schema without row group details.
metadata = parquet_meta.read_metadata(
"file.parquet",
schema_only=True
)
# Schema is available
print(metadata["schema_columns"])
# Row groups list is empty
print(len(metadata["row_groups"])) # 0
Use when:
- You only need to understand the file structure
- Inspecting very large files quickly
- Building schema catalogs
include_statistics
Type: bool
Default: True
Include min/max/distinct_count statistics in column metadata.
metadata = parquet_meta.read_metadata(
"file.parquet",
include_statistics=False
)
# Statistics fields will be None
for col in metadata["row_groups"][0]["columns"]:
assert col["min"] is None
assert col["max"] is None
assert col["distinct_count"] is None
Use when:
- Statistics aren't needed for your use case
- Maximizing parsing speed
- Reducing memory usage
max_row_groups
Type: int
Default: -1 (unlimited)
Limit the number of row groups to read.
metadata = parquet_meta.read_metadata(
"large_file.parquet",
max_row_groups=5
)
# Only first 5 row groups are read
print(len(metadata["row_groups"])) # 5
Use when:
- Sampling large files
- Quick inspection of file characteristics
- Processing files incrementally
Combining Options
Options can be combined for maximum efficiency:
# Fast schema-only read
metadata = parquet_meta.read_metadata(
"file.parquet",
schema_only=True,
include_statistics=False # Ignored when schema_only=True
)
# Sample first row group without statistics
metadata = parquet_meta.read_metadata(
"file.parquet",
max_row_groups=1,
include_statistics=False
)
Performance Impact
schema_only
Enabling schema_only=True provides significant speedup for large files:
| File Size | Row Groups | Normal | schema_only | Speedup |
|---|---|---|---|---|
| 100 MB | 1 | 10 ms | 5 ms | 2x |
| 1 GB | 10 | 50 ms | 5 ms | 10x |
| 10 GB | 100 | 500 ms | 5 ms | 100x |
include_statistics
Disabling statistics has modest impact:
| File Size | Row Groups | Normal | No Stats | Speedup |
|---|---|---|---|---|
| 100 MB | 1 | 10 ms | 8 ms | 1.25x |
| 1 GB | 10 | 50 ms | 40 ms | 1.25x |
max_row_groups
Processing fewer row groups scales linearly:
# Full file: 1000 row groups @ 100ms
metadata = parquet_meta.read_metadata("file.parquet")
# First 10: ~1ms
metadata = parquet_meta.read_metadata("file.parquet", max_row_groups=10)
# First 100: ~10ms
metadata = parquet_meta.read_metadata("file.parquet", max_row_groups=100)
Common Patterns
Quick Schema Check
def get_schema(filename):
"""Get just the schema, as fast as possible."""
return parquet_meta.read_metadata(
filename,
schema_only=True
)["schema_columns"]
Sample Analysis
def analyze_sample(filename, num_groups=5):
"""Analyze a sample of row groups."""
return parquet_meta.read_metadata(
filename,
max_row_groups=num_groups,
include_statistics=True
)
Minimal Overhead
def minimal_metadata(filename):
"""Read with minimal processing."""
return parquet_meta.read_metadata(
filename,
include_statistics=False
)
Examples
Inspecting File Structure
# Quick look at what's in the file
schema = parquet_meta.read_metadata(
"unknown_file.parquet",
schema_only=True
)
for col in schema["schema_columns"]:
print(f"{col['name']}: {col['logical_type']}")
Sampling Data Distribution
# Check statistics in first few row groups
sample = parquet_meta.read_metadata(
"large_file.parquet",
max_row_groups=3
)
for rg_idx, rg in enumerate(sample["row_groups"]):
print(f"Row Group {rg_idx}:")
for col in rg["columns"]:
print(f" {col['name']}: [{col['min']}, {col['max']}]")
Building File Catalog
def catalog_file(path):
"""Create catalog entry with minimal data."""
meta = parquet_meta.read_metadata(
path,
schema_only=True
)
return {
"path": path,
"num_rows": meta["num_rows"],
"columns": [c["name"] for c in meta["schema_columns"]],
"types": {c["name"]: c["logical_type"]
for c in meta["schema_columns"]}
}
Best Practices
- Use schema_only for initial inspection - Get quick overview before detailed analysis
- Limit row groups for large files - Sample first few groups instead of full scan
- Skip statistics when not needed - Faster parsing for structural analysis
- Consider file size - Options matter more for large files
- Profile your use case - Measure actual impact in your workflow
Next Steps
- In-Memory Data - Work with bytes and memoryview
- API Reference - Complete function signatures