Data Decoding Overview

Experimental Feature

The data decoder is a prototype with limited capabilities. For production use, use PyArrow or FastParquet.

What is the Decoder?

Rugo includes a basic decoder for reading actual column data from Parquet files. This is an experimental feature designed for:

Simple testing scenarios
Educational purposes
Understanding Parquet internals
Quick data inspection

Capabilities

Supported Features

✅ Uncompressed data only - codec=UNCOMPRESSED
✅ PLAIN encoding - Simple encoding scheme
✅ Limited types - int32, int64, string (byte_array)
✅ Required columns - No nulls or definition levels

Unsupported Features

❌ Compression - SNAPPY, GZIP, ZSTD, etc.
❌ Advanced encodings - Dictionary, Delta, RLE_DICTIONARY
❌ Other data types - float, boolean, date, timestamp, complex types
❌ Nullable columns - Columns with definition levels
❌ Multiple row groups - Only first row group supported
❌ Nested structures - Lists, maps, structs

Basic Usage

Check if File Can Be Decoded

import rugo.parquet as parquet_meta

if parquet_meta.can_decode("data.parquet"):
    print("File can be decoded")
else:
    print("File cannot be decoded - unsupported features")

Decode a Column

# Decode specific column
values = parquet_meta.decode_column("data.parquet", "column_name")

# Returns Python list
print(values)  # e.g., [1, 2, 3, 4, 5]

When to Use

Good Use Cases

✅ Quick inspection of simple files
✅ Testing with uncompressed data
✅ Learning Parquet format
✅ Debugging simple issues

Bad Use Cases

❌ Production data pipelines
❌ Compressed files
❌ Complex schemas
❌ Large-scale processing

Alternatives

For production use, consider:

PyArrow

import pyarrow.parquet as pq

# Full-featured Parquet reading
table = pq.read_table("file.parquet")
df = table.to_pandas()

FastParquet

from fastparquet import ParquetFile

pf = ParquetFile("file.parquet")
df = pf.to_pandas()

DuckDB

import duckdb

# Query Parquet files directly
result = duckdb.query("SELECT * FROM 'file.parquet'").df()

Implementation Details

The decoder:

Reads metadata to locate column data
Seeks to data page offset
Reads page header
Decodes PLAIN encoded values
Returns as Python list

This works only for: - First row group only - Uncompressed pages - PLAIN encoding - Required (non-nullable) columns

Next Steps

Limitations - Detailed constraints
Examples - Working code examples