Reading Metadata
This guide covers all the ways to read Parquet metadata using Rugo.
Reading from Files
The most common use case is reading metadata from Parquet files on disk:
Path Types
Rugo accepts various path formats:
# String path
metadata = parquet_meta.read_metadata("data.parquet")
# Relative path
metadata = parquet_meta.read_metadata("../data/file.parquet")
# Absolute path
metadata = parquet_meta.read_metadata("/home/user/data/file.parquet")
# Path objects (pathlib)
from pathlib import Path
metadata = parquet_meta.read_metadata(Path("data.parquet"))
Reading from Bytes
Read metadata from in-memory byte strings:
# Read file into memory
with open("example.parquet", "rb") as f:
data = f.read()
# Parse metadata from bytes
metadata = parquet_meta.read_metadata_from_bytes(data)
This is useful when:
- Working with files from network sources
- Processing data from blob storage
- Testing with generated Parquet data
Reading from Memoryview
For zero-copy parsing with memory-mapped files or buffers:
# From bytes
data = open("example.parquet", "rb").read()
metadata = parquet_meta.read_metadata_from_memoryview(memoryview(data))
# With memory mapping
import mmap
with open("large_file.parquet", "rb") as f:
with mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) as mm:
metadata = parquet_meta.read_metadata_from_memoryview(memoryview(mm))
Performance Benefit
Using memoryview provides zero-copy access to the buffer, which can be more efficient for large files.
Function Comparison
All three reading functions accept the same keyword arguments:
| Function | Input | Use Case |
|---|---|---|
read_metadata |
File path | Most common, reading from disk |
read_metadata_from_bytes |
bytes object | In-memory data, network sources |
read_metadata_from_memoryview |
memoryview | Zero-copy parsing, memory-mapped files |
Return Value
All functions return the same metadata structure - a dictionary containing:
See Metadata Structure for complete details.
Error Handling
Rugo raises exceptions for invalid inputs:
try:
metadata = parquet_meta.read_metadata("nonexistent.parquet")
except FileNotFoundError:
print("File not found")
try:
metadata = parquet_meta.read_metadata_from_bytes(b"not parquet data")
except Exception as e:
print(f"Invalid Parquet data: {e}")
Best Practices
Choose the Right Function
- Use
read_metadatafor files on disk - Use
read_metadata_from_byteswhen you already have data in memory - Use
read_metadata_from_memoryviewfor zero-copy scenarios
Use Schema-Only Mode for Large Files
When you only need schema information:
# Faster - skips row group details
schema = parquet_meta.read_metadata(
"huge_file.parquet",
schema_only=True
)
Limit Row Groups for Quick Analysis
For very large files with many row groups:
# Only read first 5 row groups
metadata = parquet_meta.read_metadata(
"file.parquet",
max_row_groups=5
)
Next Steps
- Metadata Structure - Understand the returned data
- Parsing Options - Configure metadata reading