Skip to content

Quickstart

Get up and running with Rugo in minutes. This guide covers the essential operations you'll need.

Basic Usage

Reading Metadata from a File

The simplest way to use Rugo is to read metadata from a Parquet file:

import rugo.parquet as parquet_meta

# Read complete metadata
metadata = parquet_meta.read_metadata("example.parquet")

# Access basic information
print(f"Total rows: {metadata['num_rows']}")
print(f"Number of row groups: {len(metadata['row_groups'])}")

Exploring the Schema

Access schema information to understand file structure:

# Iterate over schema columns
print("Schema columns:")
for column in metadata["schema_columns"]:
    print(f"  {column['name']}")
    print(f"    Physical type: {column['physical_type']}")
    print(f"    Logical type: {column['logical_type']}")
    print(f"    Nullable: {column['nullable']}")

Examining Row Groups

Each Parquet file contains one or more row groups. Access their details:

# Look at the first row group
first_rg = metadata["row_groups"][0]
print(f"Row group 0:")
print(f"  Rows: {first_rg['num_rows']}")
print(f"  Total size: {first_rg['total_byte_size']} bytes")

# Examine columns within the row group
for column in first_rg["columns"]:
    print(f"\n  Column: {column['name']}")
    print(f"    Compression: {column['compression_codec']}")
    print(f"    Null count: {column['null_count']}")
    print(f"    Min: {column['min']}")
    print(f"    Max: {column['max']}")

Schema-Only Mode

When you only need schema information, use schema_only=True for faster execution:

# Read schema without row group details
schema = parquet_meta.read_metadata(
    "large_file.parquet",
    schema_only=True
)

# Schema columns are available
for col in schema["schema_columns"]:
    print(f"{col['name']}: {col['physical_type']}")

# Row groups are empty
print(f"Row groups: {len(schema['row_groups'])}")  # Will be 0

Working with Statistics

Statistics help you understand data distribution without reading the full dataset:

metadata = parquet_meta.read_metadata("data.parquet")

for rg_idx, rg in enumerate(metadata["row_groups"]):
    print(f"\nRow Group {rg_idx}:")
    for col in rg["columns"]:
        if col["min"] is not None and col["max"] is not None:
            print(f"  {col['name']}: [{col['min']}, {col['max']}]")
            print(f"    Distinct values: {col['distinct_count']}")
            print(f"    Null values: {col['null_count']}")

Skipping Statistics

For faster parsing when statistics aren't needed:

metadata = parquet_meta.read_metadata(
    "file.parquet",
    include_statistics=False
)

# Min/max values will be None
for col in metadata["row_groups"][0]["columns"]:
    assert col["min"] is None
    assert col["max"] is None

In-Memory Data

Read metadata from bytes or memoryview objects:

# From bytes
with open("example.parquet", "rb") as f:
    data = f.read()

from_bytes = parquet_meta.read_metadata_from_bytes(data)

# From memoryview (zero-copy)
from_view = parquet_meta.read_metadata_from_memoryview(memoryview(data))

Common Patterns

Checking Compression

Determine what compression is used:

metadata = parquet_meta.read_metadata("file.parquet")

codecs = set()
for rg in metadata["row_groups"]:
    for col in rg["columns"]:
        if col["compression_codec"]:
            codecs.add(col["compression_codec"])

print(f"Compression codecs used: {codecs}")

Finding Column by Name

Locate specific column information:

def find_column(metadata, column_name, row_group_idx=0):
    """Find column metadata by name in specified row group."""
    rg = metadata["row_groups"][row_group_idx]
    for col in rg["columns"]:
        if col["name"] == column_name:
            return col
    return None

col_info = find_column(metadata, "my_column")
if col_info:
    print(f"Found: {col_info['name']}")
    print(f"  Type: {col_info['type']}")
    print(f"  Range: [{col_info['min']}, {col_info['max']}]")

Checking File Size

Calculate total file size from metadata:

total_bytes = sum(
    rg["total_byte_size"] 
    for rg in metadata["row_groups"]
)

print(f"Total data size: {total_bytes / 1024 / 1024:.2f} MB")

Next Steps