Skip to content

API Functions

Complete reference for all Rugo functions.

Metadata Reading Functions

read_metadata

Read metadata from a Parquet file.

read_metadata(
    path: str | Path,
    schema_only: bool = False,
    include_statistics: bool = True,
    max_row_groups: int = -1
) -> dict

Parameters:

  • path (str | Path): Path to Parquet file
  • schema_only (bool): Return only schema, skip row groups (default: False)
  • include_statistics (bool): Include min/max statistics (default: True)
  • max_row_groups (int): Limit row groups read, -1 for all (default: -1)

Returns: Dictionary with metadata structure

Example:

import rugo.parquet as parquet_meta

metadata = parquet_meta.read_metadata("file.parquet")
schema = parquet_meta.read_metadata("file.parquet", schema_only=True)
sample = parquet_meta.read_metadata("file.parquet", max_row_groups=5)

read_metadata_from_bytes

Read metadata from bytes object.

read_metadata_from_bytes(
    data: bytes,
    schema_only: bool = False,
    include_statistics: bool = True,
    max_row_groups: int = -1
) -> dict

Parameters:

  • data (bytes): Parquet file as bytes
  • schema_only (bool): Return only schema (default: False)
  • include_statistics (bool): Include statistics (default: True)
  • max_row_groups (int): Limit row groups, -1 for all (default: -1)

Returns: Dictionary with metadata structure

Example:

with open("file.parquet", "rb") as f:
    data = f.read()

metadata = parquet_meta.read_metadata_from_bytes(data)

read_metadata_from_memoryview

Read metadata from memoryview (zero-copy).

read_metadata_from_memoryview(
    view: memoryview,
    schema_only: bool = False,
    include_statistics: bool = True,
    max_row_groups: int = -1
) -> dict

Parameters:

  • view (memoryview): Memory view of Parquet data
  • schema_only (bool): Return only schema (default: False)
  • include_statistics (bool): Include statistics (default: True)
  • max_row_groups (int): Limit row groups, -1 for all (default: -1)

Returns: Dictionary with metadata structure

Example:

data = open("file.parquet", "rb").read()
metadata = parquet_meta.read_metadata_from_memoryview(memoryview(data))

Decoding Functions

Experimental

Decoder functions are experimental with limited capabilities.

can_decode

Check if file can be decoded with Rugo.

can_decode(path: str | Path) -> bool

Parameters:

  • path (str | Path): Path to Parquet file

Returns: True if file can be decoded, False otherwise

Example:

if parquet_meta.can_decode("file.parquet"):
    values = parquet_meta.decode_column("file.parquet", "column")

decode_column

Decode column data from Parquet file.

decode_column(
    path: str | Path,
    column_name: str
) -> list

Parameters:

  • path (str | Path): Path to Parquet file
  • column_name (str): Name of column to decode

Returns: List of Python values

Limitations:

  • Only uncompressed data
  • Only PLAIN encoding
  • Only int32, int64, string types
  • Only first row group
  • Only required (non-nullable) columns

Example:

values = parquet_meta.decode_column("file.parquet", "user_id")
print(values)  # [1, 2, 3, 4, 5]

Orso Conversion Functions

Available when installed with pip install rugo[orso].

rugo_to_orso_schema

Convert Rugo metadata to Orso Relation.

from rugo.converters.orso import rugo_to_orso_schema

rugo_to_orso_schema(
    metadata: dict,
    table_name: str
) -> Relation

Parameters:

  • metadata (dict): Metadata from read_metadata()
  • table_name (str): Name for the relation

Returns: Orso Relation object

Example:

metadata = parquet_meta.read_metadata("file.parquet")
relation = rugo_to_orso_schema(metadata, "my_table")

extract_schema_only

Extract schema in simplified format.

from rugo.converters.orso import extract_schema_only

extract_schema_only(metadata: dict) -> list

Parameters:

  • metadata (dict): Metadata from read_metadata()

Returns: List of column dictionaries

Example:

metadata = parquet_meta.read_metadata("file.parquet")
schema = extract_schema_only(metadata)

for col in schema:
    print(f"{col['name']}: {col['type']}")

Common Patterns

Read and Inspect

# Full metadata
metadata = parquet_meta.read_metadata("file.parquet")

# Schema only (faster)
schema = parquet_meta.read_metadata("file.parquet", schema_only=True)

# Sample (even faster)
sample = parquet_meta.read_metadata(
    "file.parquet",
    max_row_groups=5,
    include_statistics=False
)

From Memory

# From bytes
with open("file.parquet", "rb") as f:
    metadata = parquet_meta.read_metadata_from_bytes(f.read())

# From memoryview (zero-copy)
data = open("file.parquet", "rb").read()
metadata = parquet_meta.read_metadata_from_memoryview(memoryview(data))

Decode Data

# Check first
if parquet_meta.can_decode("file.parquet"):
    values = parquet_meta.decode_column("file.parquet", "column")
else:
    # Fallback to PyArrow
    import pyarrow.parquet as pq
    table = pq.read_table("file.parquet")

Error Handling

All functions may raise exceptions:

try:
    metadata = parquet_meta.read_metadata("file.parquet")
except FileNotFoundError:
    print("File not found")
except Exception as e:
    print(f"Error reading metadata: {e}")

try:
    values = parquet_meta.decode_column("file.parquet", "col")
except Exception as e:
    print(f"Cannot decode: {e}")

Type Hints

from typing import Union
from pathlib import Path

# Metadata return type
MetadataDict = dict[str, Any]

# Path types
PathLike = Union[str, Path]

# Example with type hints
def process_file(path: PathLike) -> MetadataDict:
    return parquet_meta.read_metadata(path)

Next Steps