Return Types
Detailed type information for Rugo's return values.
Metadata Dictionary
The primary return type from all metadata reading functions:
MetadataDict = {
"num_rows": int,
"schema_columns": List[SchemaColumn],
"row_groups": List[RowGroup]
}
Type Definitions
SchemaColumn
Fields:
name: Column namephysical_type: Physical storage type (INT32, INT64, BYTE_ARRAY, etc.)logical_type: Logical/semantic type (STRING, INT(64,true), TIMESTAMP, etc.)nullable: Whether column can contain null values
Example:
RowGroup
Fields:
num_rows: Number of rows in this row grouptotal_byte_size: Total compressed size in bytescolumns: List of column chunk metadata
Example:
ColumnChunk
ColumnChunk = {
"name": str,
"path_in_schema": str,
"type": str,
"logical_type": str,
"num_values": Optional[int],
"total_uncompressed_size": Optional[int],
"total_compressed_size": Optional[int],
"data_page_offset": Optional[int],
"index_page_offset": Optional[int],
"dictionary_page_offset": Optional[int],
"min": Any,
"max": Any,
"null_count": Optional[int],
"distinct_count": Optional[int],
"bloom_offset": Optional[int],
"bloom_length": Optional[int],
"encodings": List[str],
"compression_codec": Optional[str],
"key_value_metadata": Optional[Dict[str, str]]
}
Fields:
name: Column namepath_in_schema: Full path for nested columnstype: Physical typelogical_type: Logical typenum_values: Number of values (including nulls)total_uncompressed_size: Size before compressiontotal_compressed_size: Size after compressiondata_page_offset: File offset to data pagesindex_page_offset: File offset to index page (if present)dictionary_page_offset: File offset to dictionary page (if present)min: Minimum value (Python type or hex string)max: Maximum value (Python type or hex string)null_count: Number of null valuesdistinct_count: Approximate distinct value countbloom_offset: Bloom filter file offset (if present)bloom_length: Bloom filter size in bytes (if present)encodings: List of encoding schemes usedcompression_codec: Compression algorithm name (if used)key_value_metadata: Custom key-value pairs (if present)
Example:
{
"name": "user_id",
"path_in_schema": "user_id",
"type": "INT64",
"logical_type": "INT(64,false)",
"num_values": 10000,
"total_uncompressed_size": 80000,
"total_compressed_size": 45000,
"data_page_offset": 4096,
"index_page_offset": None,
"dictionary_page_offset": None,
"min": 1,
"max": 999999,
"null_count": 0,
"distinct_count": 9876,
"bloom_offset": None,
"bloom_length": None,
"encodings": ["PLAIN", "RLE"],
"compression_codec": "SNAPPY",
"key_value_metadata": None
}
Value Types
Statistics (min/max)
Statistics values are decoded to appropriate Python types:
| Parquet Type | Python Type | Example |
|---|---|---|
| INT32 | int | 42 |
| INT64 | int | 1234567890 |
| FLOAT | float | 3.14 |
| DOUBLE | float | 2.718281828 |
| BYTE_ARRAY (string) | str | "hello" |
| BYTE_ARRAY (other) | str (hex) | "0x48656c6c6f" |
| BOOLEAN | bool | True |
When decoding fails, values are returned as hexadecimal strings.
Optional vs Required Fields
Fields marked Optional[T] may be None:
# These may be None
column["min"] # None if include_statistics=False
column["max"] # None if include_statistics=False
column["null_count"] # None if not recorded
column["distinct_count"] # None if not recorded
column["bloom_offset"] # None if no bloom filter
column["compression_codec"] # None if uncompressed
column["key_value_metadata"] # None if no custom metadata
Fields without Optional are always present:
# Always present
column["name"] # str
column["type"] # str
column["encodings"] # List[str], may be empty
Orso Types
When using Orso integration:
Relation
Properties:
relation.name(str): Table namerelation.schema(Schema): Orso schema objectrelation.schema.columns(List[Column]): Column definitions
Schema Info
Each entry:
Decoding Return Types
decode_column
Returns:
List[int]for INT32/INT64 columnsList[str]for BYTE_ARRAY (string) columns
can_decode
Returns:
Trueif file can be decodedFalseotherwise
Type Aliases
For cleaner type hints:
from typing import TypedDict, List, Optional, Any, Union
from pathlib import Path
# Path types
PathLike = Union[str, Path]
# Metadata structure
class SchemaColumn(TypedDict):
name: str
physical_type: str
logical_type: str
nullable: bool
class ColumnChunk(TypedDict, total=False):
name: str
path_in_schema: str
type: str
logical_type: str
num_values: Optional[int]
total_uncompressed_size: Optional[int]
total_compressed_size: Optional[int]
data_page_offset: Optional[int]
index_page_offset: Optional[int]
dictionary_page_offset: Optional[int]
min: Any
max: Any
null_count: Optional[int]
distinct_count: Optional[int]
bloom_offset: Optional[int]
bloom_length: Optional[int]
encodings: List[str]
compression_codec: Optional[str]
key_value_metadata: Optional[dict[str, str]]
class RowGroup(TypedDict):
num_rows: int
total_byte_size: int
columns: List[ColumnChunk]
class Metadata(TypedDict):
num_rows: int
schema_columns: List[SchemaColumn]
row_groups: List[RowGroup]
Usage Examples
With Type Hints
from typing import Dict, Any, List
def analyze_schema(metadata: Dict[str, Any]) -> List[str]:
"""Extract column names from metadata."""
return [col["name"] for col in metadata["schema_columns"]]
def get_row_group_sizes(metadata: Dict[str, Any]) -> List[int]:
"""Get size of each row group."""
return [rg["total_byte_size"] for rg in metadata["row_groups"]]
Type Checking
import rugo.parquet as parquet_meta
metadata = parquet_meta.read_metadata("file.parquet")
# Type checking
assert isinstance(metadata, dict)
assert isinstance(metadata["num_rows"], int)
assert isinstance(metadata["schema_columns"], list)
assert isinstance(metadata["row_groups"], list)
# Schema columns
for col in metadata["schema_columns"]:
assert isinstance(col["name"], str)
assert isinstance(col["physical_type"], str)
assert isinstance(col["nullable"], bool)
# Row groups
for rg in metadata["row_groups"]:
assert isinstance(rg["num_rows"], int)
assert isinstance(rg["columns"], list)
Next Steps
- Functions - Complete function reference
- Metadata Structure - Detailed field descriptions