Return Types

Detailed type information for Rugo's return values.

Metadata Dictionary

The primary return type from all metadata reading functions:

MetadataDict = {
    "num_rows": int,
    "schema_columns": List[SchemaColumn],
    "row_groups": List[RowGroup]
}

Type Definitions

SchemaColumn

SchemaColumn = {
    "name": str,
    "physical_type": str,
    "logical_type": str,
    "nullable": bool
}

Fields:

name: Column name
physical_type: Physical storage type (INT32, INT64, BYTE_ARRAY, etc.)
logical_type: Logical/semantic type (STRING, INT(64,true), TIMESTAMP, etc.)
nullable: Whether column can contain null values

Example:

{
    "name": "user_id",
    "physical_type": "INT64",
    "logical_type": "INT(64,false)",
    "nullable": False
}

RowGroup

RowGroup = {
    "num_rows": int,
    "total_byte_size": int,
    "columns": List[ColumnChunk]
}

Fields:

num_rows: Number of rows in this row group
total_byte_size: Total compressed size in bytes
columns: List of column chunk metadata

Example:

{
    "num_rows": 10000,
    "total_byte_size": 524288,
    "columns": [...]
}

ColumnChunk

ColumnChunk = {
    "name": str,
    "path_in_schema": str,
    "type": str,
    "logical_type": str,
    "num_values": Optional[int],
    "total_uncompressed_size": Optional[int],
    "total_compressed_size": Optional[int],
    "data_page_offset": Optional[int],
    "index_page_offset": Optional[int],
    "dictionary_page_offset": Optional[int],
    "min": Any,
    "max": Any,
    "null_count": Optional[int],
    "distinct_count": Optional[int],
    "bloom_offset": Optional[int],
    "bloom_length": Optional[int],
    "encodings": List[str],
    "compression_codec": Optional[str],
    "key_value_metadata": Optional[Dict[str, str]]
}

Fields:

name: Column name
path_in_schema: Full path for nested columns
type: Physical type
logical_type: Logical type
num_values: Number of values (including nulls)
total_uncompressed_size: Size before compression
total_compressed_size: Size after compression
data_page_offset: File offset to data pages
index_page_offset: File offset to index page (if present)
dictionary_page_offset: File offset to dictionary page (if present)
min: Minimum value (Python type or hex string)
max: Maximum value (Python type or hex string)
null_count: Number of null values
distinct_count: Approximate distinct value count
bloom_offset: Bloom filter file offset (if present)
bloom_length: Bloom filter size in bytes (if present)
encodings: List of encoding schemes used
compression_codec: Compression algorithm name (if used)
key_value_metadata: Custom key-value pairs (if present)

Example:

{
    "name": "user_id",
    "path_in_schema": "user_id",
    "type": "INT64",
    "logical_type": "INT(64,false)",
    "num_values": 10000,
    "total_uncompressed_size": 80000,
    "total_compressed_size": 45000,
    "data_page_offset": 4096,
    "index_page_offset": None,
    "dictionary_page_offset": None,
    "min": 1,
    "max": 999999,
    "null_count": 0,
    "distinct_count": 9876,
    "bloom_offset": None,
    "bloom_length": None,
    "encodings": ["PLAIN", "RLE"],
    "compression_codec": "SNAPPY",
    "key_value_metadata": None
}

Value Types

Statistics (min/max)

Statistics values are decoded to appropriate Python types:

Parquet Type	Python Type	Example
INT32	int	`42`
INT64	int	`1234567890`
FLOAT	float	`3.14`
DOUBLE	float	`2.718281828`
BYTE_ARRAY (string)	str	`"hello"`
BYTE_ARRAY (other)	str (hex)	`"0x48656c6c6f"`
BOOLEAN	bool	`True`

When decoding fails, values are returned as hexadecimal strings.

Optional vs Required Fields

Fields marked Optional[T] may be None:

# These may be None
column["min"]  # None if include_statistics=False
column["max"]  # None if include_statistics=False
column["null_count"]  # None if not recorded
column["distinct_count"]  # None if not recorded
column["bloom_offset"]  # None if no bloom filter
column["compression_codec"]  # None if uncompressed
column["key_value_metadata"]  # None if no custom metadata

Fields without Optional are always present:

# Always present
column["name"]  # str
column["type"]  # str
column["encodings"]  # List[str], may be empty

Orso Types

When using Orso integration:

Relation

from orso.schema import Relation

relation: Relation = rugo_to_orso_schema(metadata, "table_name")

Properties:

relation.name (str): Table name
relation.schema (Schema): Orso schema object
relation.schema.columns (List[Column]): Column definitions

Schema Info

schema_info: List[Dict[str, Any]] = extract_schema_only(metadata)

Each entry:

{
    "name": str,
    "type": str,  # Orso type name
    "nullable": bool
}

Decoding Return Types

decode_column

values: List[Union[int, str]] = decode_column(path, column_name)

Returns:

List[int] for INT32/INT64 columns
List[str] for BYTE_ARRAY (string) columns

can_decode

result: bool = can_decode(path)

Returns:

True if file can be decoded
False otherwise

Type Aliases

For cleaner type hints:

from typing import TypedDict, List, Optional, Any, Union
from pathlib import Path

# Path types
PathLike = Union[str, Path]

# Metadata structure
class SchemaColumn(TypedDict):
    name: str
    physical_type: str
    logical_type: str
    nullable: bool

class ColumnChunk(TypedDict, total=False):
    name: str
    path_in_schema: str
    type: str
    logical_type: str
    num_values: Optional[int]
    total_uncompressed_size: Optional[int]
    total_compressed_size: Optional[int]
    data_page_offset: Optional[int]
    index_page_offset: Optional[int]
    dictionary_page_offset: Optional[int]
    min: Any
    max: Any
    null_count: Optional[int]
    distinct_count: Optional[int]
    bloom_offset: Optional[int]
    bloom_length: Optional[int]
    encodings: List[str]
    compression_codec: Optional[str]
    key_value_metadata: Optional[dict[str, str]]

class RowGroup(TypedDict):
    num_rows: int
    total_byte_size: int
    columns: List[ColumnChunk]

class Metadata(TypedDict):
    num_rows: int
    schema_columns: List[SchemaColumn]
    row_groups: List[RowGroup]

Usage Examples

With Type Hints

from typing import Dict, Any, List

def analyze_schema(metadata: Dict[str, Any]) -> List[str]:
    """Extract column names from metadata."""
    return [col["name"] for col in metadata["schema_columns"]]

def get_row_group_sizes(metadata: Dict[str, Any]) -> List[int]:
    """Get size of each row group."""
    return [rg["total_byte_size"] for rg in metadata["row_groups"]]

Type Checking

import rugo.parquet as parquet_meta

metadata = parquet_meta.read_metadata("file.parquet")

# Type checking
assert isinstance(metadata, dict)
assert isinstance(metadata["num_rows"], int)
assert isinstance(metadata["schema_columns"], list)
assert isinstance(metadata["row_groups"], list)

# Schema columns
for col in metadata["schema_columns"]:
    assert isinstance(col["name"], str)
    assert isinstance(col["physical_type"], str)
    assert isinstance(col["nullable"], bool)

# Row groups
for rg in metadata["row_groups"]:
    assert isinstance(rg["num_rows"], int)
    assert isinstance(rg["columns"], list)

Next Steps

Functions - Complete function reference
Metadata Structure - Detailed field descriptions