Skip to content

Metadata Structure

Complete reference for the metadata dictionary returned by Rugo.

Top Level Structure

{
    "num_rows": int,
    "schema_columns": [...],
    "row_groups": [...]
}

Fields

num_rows

Total number of rows in the Parquet file.

metadata["num_rows"]  # e.g., 1000000

schema_columns

List of schema column definitions. Each entry describes a column in the file.

metadata["schema_columns"]  # List of column schemas

row_groups

List of row group metadata. Each Parquet file contains one or more row groups.

metadata["row_groups"]  # List of row group details

Schema Column Structure

Each entry in schema_columns contains:

{
    "name": str,              # Column name
    "physical_type": str,     # Physical storage type
    "logical_type": str,      # Logical/semantic type
    "nullable": bool          # Whether column allows nulls
}

Example

{
    "name": "user_id",
    "physical_type": "INT64",
    "logical_type": "INT(64,false)",
    "nullable": True
}

Physical Types

Common physical types:

  • BOOLEAN
  • INT32
  • INT64
  • FLOAT
  • DOUBLE
  • BYTE_ARRAY
  • FIXED_LEN_BYTE_ARRAY

Logical Types

Common logical types:

  • STRING - UTF-8 strings
  • INT(bitWidth, signed) - Integer types
  • TIMESTAMP(isAdjustedToUTC, unit) - Timestamps
  • DATE - Date values
  • DECIMAL(precision, scale) - Decimal numbers
  • JSON - JSON data
  • UUID - UUID values

Row Group Structure

Each row group contains:

{
    "num_rows": int,           # Rows in this row group
    "total_byte_size": int,    # Total compressed size in bytes
    "columns": [...]           # List of column chunks
}

Example

{
    "num_rows": 10000,
    "total_byte_size": 524288,
    "columns": [...]
}

Column Chunk Structure

Each column in a row group contains detailed metadata:

{
    "name": str,                           # Column name
    "path_in_schema": str,                 # Full path for nested columns
    "type": str,                           # Physical type
    "logical_type": str,                   # Logical type
    "num_values": Optional[int],           # Number of values
    "total_uncompressed_size": Optional[int],  # Uncompressed size
    "total_compressed_size": Optional[int],    # Compressed size
    "data_page_offset": Optional[int],     # Offset to data pages
    "index_page_offset": Optional[int],    # Offset to index page
    "dictionary_page_offset": Optional[int],   # Offset to dictionary page
    "min": Any,                            # Minimum value
    "max": Any,                            # Maximum value
    "null_count": Optional[int],           # Number of null values
    "distinct_count": Optional[int],       # Number of distinct values
    "bloom_offset": Optional[int],         # Bloom filter offset
    "bloom_length": Optional[int],         # Bloom filter length
    "encodings": List[str],                # Encoding schemes used
    "compression_codec": Optional[str],    # Compression codec
    "key_value_metadata": Optional[Dict[str, str]]  # Custom metadata
}

Field Details

Statistics Fields

  • min/max: Decoded to Python types when possible, otherwise hex strings
  • null_count: Number of null values in this column chunk
  • distinct_count: Approximate number of distinct values
  • num_values: Total number of values (including nulls)

Size and Offset Fields

  • total_compressed_size: Size after compression
  • total_uncompressed_size: Size before compression
  • data_page_offset: File offset where data pages begin
  • dictionary_page_offset: File offset for dictionary page (if present)
  • index_page_offset: File offset for column index (if present)

Encoding and Compression

  • encodings: List of encoding types used (e.g., ["PLAIN", "RLE"])
  • compression_codec: Compression algorithm (e.g., "SNAPPY", "GZIP", "ZSTD")

Bloom Filters

  • bloom_offset: File offset to bloom filter
  • bloom_length: Size of bloom filter in bytes

Example Column Chunk

{
    "name": "user_id",
    "path_in_schema": "user_id",
    "type": "INT64",
    "logical_type": "INT(64,false)",
    "num_values": 10000,
    "total_uncompressed_size": 80000,
    "total_compressed_size": 45000,
    "data_page_offset": 4096,
    "index_page_offset": None,
    "dictionary_page_offset": None,
    "min": 1,
    "max": 999999,
    "null_count": 0,
    "distinct_count": 9876,
    "bloom_offset": None,
    "bloom_length": None,
    "encodings": ["PLAIN", "RLE"],
    "compression_codec": "SNAPPY",
    "key_value_metadata": None
}

None Values

Fields may be None when:

  • Not present in the source Parquet file
  • Disabled by parsing options (include_statistics=False)
  • Not applicable for the data type
  • Statistics not computed during file creation

Accessing Nested Data

For nested schemas, use the path_in_schema field to identify the full column path:

# Nested structure example
{
    "name": "address.city",
    "path_in_schema": "address.city",
    "type": "BYTE_ARRAY",
    "logical_type": "STRING",
    ...
}

Complete Example

import rugo.parquet as parquet_meta

metadata = parquet_meta.read_metadata("example.parquet")

# Top level info
print(f"Total rows: {metadata['num_rows']}")

# Schema information
for col in metadata["schema_columns"]:
    print(f"Column: {col['name']} ({col['physical_type']})")

# Row group details
for idx, rg in enumerate(metadata["row_groups"]):
    print(f"\nRow Group {idx}:")
    print(f"  Rows: {rg['num_rows']}")
    print(f"  Size: {rg['total_byte_size']} bytes")

    # Column statistics
    for col in rg["columns"]:
        print(f"  {col['name']}:")
        print(f"    Range: [{col['min']}, {col['max']}]")
        print(f"    Nulls: {col['null_count']}")
        print(f"    Compression: {col['compression_codec']}")

Next Steps