Skip to content

Data Decoding Overview

Experimental Feature

The data decoder is a prototype with limited capabilities. For production use, use PyArrow or FastParquet.

What is the Decoder?

Rugo includes a basic decoder for reading actual column data from Parquet files. This is an experimental feature designed for:

  • Simple testing scenarios
  • Educational purposes
  • Understanding Parquet internals
  • Quick data inspection

Capabilities

Supported Features

Uncompressed data only - codec=UNCOMPRESSED
PLAIN encoding - Simple encoding scheme
Limited types - int32, int64, string (byte_array)
Required columns - No nulls or definition levels

Unsupported Features

Compression - SNAPPY, GZIP, ZSTD, etc.
Advanced encodings - Dictionary, Delta, RLE_DICTIONARY
Other data types - float, boolean, date, timestamp, complex types
Nullable columns - Columns with definition levels
Multiple row groups - Only first row group supported
Nested structures - Lists, maps, structs

Basic Usage

Check if File Can Be Decoded

import rugo.parquet as parquet_meta

if parquet_meta.can_decode("data.parquet"):
    print("File can be decoded")
else:
    print("File cannot be decoded - unsupported features")

Decode a Column

# Decode specific column
values = parquet_meta.decode_column("data.parquet", "column_name")

# Returns Python list
print(values)  # e.g., [1, 2, 3, 4, 5]

When to Use

Good Use Cases

✅ Quick inspection of simple files
✅ Testing with uncompressed data
✅ Learning Parquet format
✅ Debugging simple issues

Bad Use Cases

❌ Production data pipelines
❌ Compressed files
❌ Complex schemas
❌ Large-scale processing

Alternatives

For production use, consider:

PyArrow

import pyarrow.parquet as pq

# Full-featured Parquet reading
table = pq.read_table("file.parquet")
df = table.to_pandas()

FastParquet

from fastparquet import ParquetFile

pf = ParquetFile("file.parquet")
df = pf.to_pandas()

DuckDB

import duckdb

# Query Parquet files directly
result = duckdb.query("SELECT * FROM 'file.parquet'").df()

Implementation Details

The decoder:

  1. Reads metadata to locate column data
  2. Seeks to data page offset
  3. Reads page header
  4. Decodes PLAIN encoded values
  5. Returns as Python list

This works only for: - First row group only - Uncompressed pages - PLAIN encoding - Required (non-nullable) columns

Next Steps