Data Decoding Overview
Experimental Feature
The data decoder is a prototype with limited capabilities. For production use, use PyArrow or FastParquet.
What is the Decoder?
Rugo includes a basic decoder for reading actual column data from Parquet files. This is an experimental feature designed for:
- Simple testing scenarios
- Educational purposes
- Understanding Parquet internals
- Quick data inspection
Capabilities
Supported Features
✅ Uncompressed data only - codec=UNCOMPRESSED
✅ PLAIN encoding - Simple encoding scheme
✅ Limited types - int32, int64, string (byte_array)
✅ Required columns - No nulls or definition levels
Unsupported Features
❌ Compression - SNAPPY, GZIP, ZSTD, etc.
❌ Advanced encodings - Dictionary, Delta, RLE_DICTIONARY
❌ Other data types - float, boolean, date, timestamp, complex types
❌ Nullable columns - Columns with definition levels
❌ Multiple row groups - Only first row group supported
❌ Nested structures - Lists, maps, structs
Basic Usage
Check if File Can Be Decoded
import rugo.parquet as parquet_meta
if parquet_meta.can_decode("data.parquet"):
print("File can be decoded")
else:
print("File cannot be decoded - unsupported features")
Decode a Column
# Decode specific column
values = parquet_meta.decode_column("data.parquet", "column_name")
# Returns Python list
print(values) # e.g., [1, 2, 3, 4, 5]
When to Use
Good Use Cases
✅ Quick inspection of simple files
✅ Testing with uncompressed data
✅ Learning Parquet format
✅ Debugging simple issues
Bad Use Cases
❌ Production data pipelines
❌ Compressed files
❌ Complex schemas
❌ Large-scale processing
Alternatives
For production use, consider:
PyArrow
import pyarrow.parquet as pq
# Full-featured Parquet reading
table = pq.read_table("file.parquet")
df = table.to_pandas()
FastParquet
DuckDB
import duckdb
# Query Parquet files directly
result = duckdb.query("SELECT * FROM 'file.parquet'").df()
Implementation Details
The decoder:
- Reads metadata to locate column data
- Seeks to data page offset
- Reads page header
- Decodes PLAIN encoded values
- Returns as Python list
This works only for: - First row group only - Uncompressed pages - PLAIN encoding - Required (non-nullable) columns
Next Steps
- Limitations - Detailed constraints
- Examples - Working code examples