Quickstart
Get up and running with Rugo in minutes. This guide covers the essential operations you'll need.
Basic Usage
Reading Metadata from a File
The simplest way to use Rugo is to read metadata from a Parquet file:
import rugo.parquet as parquet_meta
# Read complete metadata
metadata = parquet_meta.read_metadata("example.parquet")
# Access basic information
print(f"Total rows: {metadata['num_rows']}")
print(f"Number of row groups: {len(metadata['row_groups'])}")
Exploring the Schema
Access schema information to understand file structure:
# Iterate over schema columns
print("Schema columns:")
for column in metadata["schema_columns"]:
print(f" {column['name']}")
print(f" Physical type: {column['physical_type']}")
print(f" Logical type: {column['logical_type']}")
print(f" Nullable: {column['nullable']}")
Examining Row Groups
Each Parquet file contains one or more row groups. Access their details:
# Look at the first row group
first_rg = metadata["row_groups"][0]
print(f"Row group 0:")
print(f" Rows: {first_rg['num_rows']}")
print(f" Total size: {first_rg['total_byte_size']} bytes")
# Examine columns within the row group
for column in first_rg["columns"]:
print(f"\n Column: {column['name']}")
print(f" Compression: {column['compression_codec']}")
print(f" Null count: {column['null_count']}")
print(f" Min: {column['min']}")
print(f" Max: {column['max']}")
Schema-Only Mode
When you only need schema information, use schema_only=True for faster execution:
# Read schema without row group details
schema = parquet_meta.read_metadata(
"large_file.parquet",
schema_only=True
)
# Schema columns are available
for col in schema["schema_columns"]:
print(f"{col['name']}: {col['physical_type']}")
# Row groups are empty
print(f"Row groups: {len(schema['row_groups'])}") # Will be 0
Working with Statistics
Statistics help you understand data distribution without reading the full dataset:
metadata = parquet_meta.read_metadata("data.parquet")
for rg_idx, rg in enumerate(metadata["row_groups"]):
print(f"\nRow Group {rg_idx}:")
for col in rg["columns"]:
if col["min"] is not None and col["max"] is not None:
print(f" {col['name']}: [{col['min']}, {col['max']}]")
print(f" Distinct values: {col['distinct_count']}")
print(f" Null values: {col['null_count']}")
Skipping Statistics
For faster parsing when statistics aren't needed:
metadata = parquet_meta.read_metadata(
"file.parquet",
include_statistics=False
)
# Min/max values will be None
for col in metadata["row_groups"][0]["columns"]:
assert col["min"] is None
assert col["max"] is None
In-Memory Data
Read metadata from bytes or memoryview objects:
# From bytes
with open("example.parquet", "rb") as f:
data = f.read()
from_bytes = parquet_meta.read_metadata_from_bytes(data)
# From memoryview (zero-copy)
from_view = parquet_meta.read_metadata_from_memoryview(memoryview(data))
Common Patterns
Checking Compression
Determine what compression is used:
metadata = parquet_meta.read_metadata("file.parquet")
codecs = set()
for rg in metadata["row_groups"]:
for col in rg["columns"]:
if col["compression_codec"]:
codecs.add(col["compression_codec"])
print(f"Compression codecs used: {codecs}")
Finding Column by Name
Locate specific column information:
def find_column(metadata, column_name, row_group_idx=0):
"""Find column metadata by name in specified row group."""
rg = metadata["row_groups"][row_group_idx]
for col in rg["columns"]:
if col["name"] == column_name:
return col
return None
col_info = find_column(metadata, "my_column")
if col_info:
print(f"Found: {col_info['name']}")
print(f" Type: {col_info['type']}")
print(f" Range: [{col_info['min']}, {col_info['max']}]")
Checking File Size
Calculate total file size from metadata:
total_bytes = sum(
rg["total_byte_size"]
for rg in metadata["row_groups"]
)
print(f"Total data size: {total_bytes / 1024 / 1024:.2f} MB")
Next Steps
- Reading Metadata - Detailed guide to metadata operations
- Metadata Structure - Complete structure reference
- Parsing Options - Advanced configuration