Metadata Structure
Complete reference for the metadata dictionary returned by Rugo.
Top Level Structure
Fields
num_rows
Total number of rows in the Parquet file.
schema_columns
List of schema column definitions. Each entry describes a column in the file.
row_groups
List of row group metadata. Each Parquet file contains one or more row groups.
Schema Column Structure
Each entry in schema_columns contains:
{
"name": str, # Column name
"physical_type": str, # Physical storage type
"logical_type": str, # Logical/semantic type
"nullable": bool # Whether column allows nulls
}
Example
Physical Types
Common physical types:
BOOLEANINT32INT64FLOATDOUBLEBYTE_ARRAYFIXED_LEN_BYTE_ARRAY
Logical Types
Common logical types:
STRING- UTF-8 stringsINT(bitWidth, signed)- Integer typesTIMESTAMP(isAdjustedToUTC, unit)- TimestampsDATE- Date valuesDECIMAL(precision, scale)- Decimal numbersJSON- JSON dataUUID- UUID values
Row Group Structure
Each row group contains:
{
"num_rows": int, # Rows in this row group
"total_byte_size": int, # Total compressed size in bytes
"columns": [...] # List of column chunks
}
Example
Column Chunk Structure
Each column in a row group contains detailed metadata:
{
"name": str, # Column name
"path_in_schema": str, # Full path for nested columns
"type": str, # Physical type
"logical_type": str, # Logical type
"num_values": Optional[int], # Number of values
"total_uncompressed_size": Optional[int], # Uncompressed size
"total_compressed_size": Optional[int], # Compressed size
"data_page_offset": Optional[int], # Offset to data pages
"index_page_offset": Optional[int], # Offset to index page
"dictionary_page_offset": Optional[int], # Offset to dictionary page
"min": Any, # Minimum value
"max": Any, # Maximum value
"null_count": Optional[int], # Number of null values
"distinct_count": Optional[int], # Number of distinct values
"bloom_offset": Optional[int], # Bloom filter offset
"bloom_length": Optional[int], # Bloom filter length
"encodings": List[str], # Encoding schemes used
"compression_codec": Optional[str], # Compression codec
"key_value_metadata": Optional[Dict[str, str]] # Custom metadata
}
Field Details
Statistics Fields
- min/max: Decoded to Python types when possible, otherwise hex strings
- null_count: Number of null values in this column chunk
- distinct_count: Approximate number of distinct values
- num_values: Total number of values (including nulls)
Size and Offset Fields
- total_compressed_size: Size after compression
- total_uncompressed_size: Size before compression
- data_page_offset: File offset where data pages begin
- dictionary_page_offset: File offset for dictionary page (if present)
- index_page_offset: File offset for column index (if present)
Encoding and Compression
- encodings: List of encoding types used (e.g.,
["PLAIN", "RLE"]) - compression_codec: Compression algorithm (e.g.,
"SNAPPY","GZIP","ZSTD")
Bloom Filters
- bloom_offset: File offset to bloom filter
- bloom_length: Size of bloom filter in bytes
Example Column Chunk
{
"name": "user_id",
"path_in_schema": "user_id",
"type": "INT64",
"logical_type": "INT(64,false)",
"num_values": 10000,
"total_uncompressed_size": 80000,
"total_compressed_size": 45000,
"data_page_offset": 4096,
"index_page_offset": None,
"dictionary_page_offset": None,
"min": 1,
"max": 999999,
"null_count": 0,
"distinct_count": 9876,
"bloom_offset": None,
"bloom_length": None,
"encodings": ["PLAIN", "RLE"],
"compression_codec": "SNAPPY",
"key_value_metadata": None
}
None Values
Fields may be None when:
- Not present in the source Parquet file
- Disabled by parsing options (
include_statistics=False) - Not applicable for the data type
- Statistics not computed during file creation
Accessing Nested Data
For nested schemas, use the path_in_schema field to identify the full column path:
# Nested structure example
{
"name": "address.city",
"path_in_schema": "address.city",
"type": "BYTE_ARRAY",
"logical_type": "STRING",
...
}
Complete Example
import rugo.parquet as parquet_meta
metadata = parquet_meta.read_metadata("example.parquet")
# Top level info
print(f"Total rows: {metadata['num_rows']}")
# Schema information
for col in metadata["schema_columns"]:
print(f"Column: {col['name']} ({col['physical_type']})")
# Row group details
for idx, rg in enumerate(metadata["row_groups"]):
print(f"\nRow Group {idx}:")
print(f" Rows: {rg['num_rows']}")
print(f" Size: {rg['total_byte_size']} bytes")
# Column statistics
for col in rg["columns"]:
print(f" {col['name']}:")
print(f" Range: [{col['min']}, {col['max']}]")
print(f" Nulls: {col['null_count']}")
print(f" Compression: {col['compression_codec']}")
Next Steps
- Parsing Options - Control what metadata is returned
- API Reference - Function signatures and options