Skip to content

Reading Metadata

This guide covers all the ways to read Parquet metadata using Rugo.

Reading from Files

The most common use case is reading metadata from Parquet files on disk:

import rugo.parquet as parquet_meta

metadata = parquet_meta.read_metadata("path/to/file.parquet")

Path Types

Rugo accepts various path formats:

# String path
metadata = parquet_meta.read_metadata("data.parquet")

# Relative path
metadata = parquet_meta.read_metadata("../data/file.parquet")

# Absolute path
metadata = parquet_meta.read_metadata("/home/user/data/file.parquet")

# Path objects (pathlib)
from pathlib import Path
metadata = parquet_meta.read_metadata(Path("data.parquet"))

Reading from Bytes

Read metadata from in-memory byte strings:

# Read file into memory
with open("example.parquet", "rb") as f:
    data = f.read()

# Parse metadata from bytes
metadata = parquet_meta.read_metadata_from_bytes(data)

This is useful when:

  • Working with files from network sources
  • Processing data from blob storage
  • Testing with generated Parquet data

Reading from Memoryview

For zero-copy parsing with memory-mapped files or buffers:

# From bytes
data = open("example.parquet", "rb").read()
metadata = parquet_meta.read_metadata_from_memoryview(memoryview(data))

# With memory mapping
import mmap
with open("large_file.parquet", "rb") as f:
    with mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) as mm:
        metadata = parquet_meta.read_metadata_from_memoryview(memoryview(mm))

Performance Benefit

Using memoryview provides zero-copy access to the buffer, which can be more efficient for large files.

Function Comparison

All three reading functions accept the same keyword arguments:

Function Input Use Case
read_metadata File path Most common, reading from disk
read_metadata_from_bytes bytes object In-memory data, network sources
read_metadata_from_memoryview memoryview Zero-copy parsing, memory-mapped files

Return Value

All functions return the same metadata structure - a dictionary containing:

{
    "num_rows": int,
    "schema_columns": [...],
    "row_groups": [...]
}

See Metadata Structure for complete details.

Error Handling

Rugo raises exceptions for invalid inputs:

try:
    metadata = parquet_meta.read_metadata("nonexistent.parquet")
except FileNotFoundError:
    print("File not found")

try:
    metadata = parquet_meta.read_metadata_from_bytes(b"not parquet data")
except Exception as e:
    print(f"Invalid Parquet data: {e}")

Best Practices

Choose the Right Function

  • Use read_metadata for files on disk
  • Use read_metadata_from_bytes when you already have data in memory
  • Use read_metadata_from_memoryview for zero-copy scenarios

Use Schema-Only Mode for Large Files

When you only need schema information:

# Faster - skips row group details
schema = parquet_meta.read_metadata(
    "huge_file.parquet",
    schema_only=True
)

Limit Row Groups for Quick Analysis

For very large files with many row groups:

# Only read first 5 row groups
metadata = parquet_meta.read_metadata(
    "file.parquet",
    max_row_groups=5
)

Next Steps