Decoder Limitations

Complete list of limitations and constraints for Rugo's prototype decoder.

Not Production Ready

This decoder is a prototype for educational and testing purposes only.

Supported Configurations

Data Types

Only these physical types are supported:

INT32 - 32-bit integers
INT64 - 64-bit integers
BYTE_ARRAY - For strings

All other types (FLOAT, DOUBLE, BOOLEAN, FIXED_LEN_BYTE_ARRAY, etc.) are not supported.

Encodings

Only PLAIN encoding is supported:

✅ PLAIN - Direct value encoding
❌ DICTIONARY - Dictionary encoding
❌ RLE - Run-length encoding
❌ DELTA_BINARY_PACKED - Delta encoding
❌ DELTA_LENGTH_BYTE_ARRAY - Delta length encoding
❌ DELTA_BYTE_ARRAY - Delta byte array encoding
❌ RLE_DICTIONARY - Combined RLE and dictionary

Compression

Only uncompressed data is supported:

✅ UNCOMPRESSED - No compression
❌ SNAPPY - Snappy compression
❌ GZIP - Gzip compression
❌ LZO - LZO compression
❌ BROTLI - Brotli compression
❌ LZ4 - LZ4 compression
❌ ZSTD - Zstandard compression

Column Requirements

Columns must be:

✅ Required (not nullable)
❌ Optional columns with definition levels are not supported
❌ Repeated columns are not supported
❌ Nested structures are not supported

Row Groups

✅ First row group only
❌ Multiple row groups not supported
❌ Cannot select specific row group

Detailed Constraints

Type Limitations

Integer Types

# Supported
INT32 with logical type INT(32, true/false)
INT64 with logical type INT(64, true/false)

# Not supported
INT96 (deprecated timestamp)
FLOAT
DOUBLE

String Types

# Supported
BYTE_ARRAY with logical type STRING

# Not supported
BYTE_ARRAY with other logical types (JSON, BSON, etc.)
FIXED_LEN_BYTE_ARRAY

Not Supported At All

Boolean values
Date and timestamp types
Decimal types
UUID
Time types
Interval types
Lists and arrays
Maps
Structs
Enums

Encoding Limitations

Why PLAIN Only?

The decoder only implements PLAIN encoding because:

Simplicity - PLAIN is the simplest encoding
Educational - Easy to understand and implement
Prototype - Not intended for production use

More complex encodings require: - Dictionary management - Bit-packing algorithms - Delta decoding logic - Additional dependencies

Compression Limitations

Why No Compression?

Supporting compression requires:

External libraries - Snappy, zlib, zstd, etc.
Additional complexity - Decompression logic
More dependencies - Against Rugo's minimal design
Out of scope - Metadata focus, not full data access

Nullability Limitations

Why No Nulls?

Nullable columns require:

Definition levels - Additional encoding to track nulls
Repetition levels - For nested structures
Complex logic - Level decoding algorithms
Larger scope - Beyond prototype intentions

Row Group Limitations

Why First Only?

Supporting multiple row groups requires:

Iteration logic - Loop through all groups
Memory management - Combining results
API design - How to return multi-group data
Performance - Potentially large datasets

Checking Compatibility

Using can_decode()

import rugo.parquet as parquet_meta

# Check if file meets all requirements
if parquet_meta.can_decode("file.parquet"):
    # Safe to decode
    values = parquet_meta.decode_column("file.parquet", "col")
else:
    # Use PyArrow instead
    import pyarrow.parquet as pq
    table = pq.read_table("file.parquet")

Manual Checking

metadata = parquet_meta.read_metadata("file.parquet")

# Check compression
rg = metadata["row_groups"][0]
for col in rg["columns"]:
    if col["compression_codec"] != "UNCOMPRESSED":
        print(f"{col['name']} is compressed: {col['compression_codec']}")

# Check encoding
for col in rg["columns"]:
    if "PLAIN" not in col["encodings"]:
        print(f"{col['name']} uses: {col['encodings']}")

# Check types
for col in metadata["schema_columns"]:
    if col["physical_type"] not in ["INT32", "INT64", "BYTE_ARRAY"]:
        print(f"{col['name']} has unsupported type: {col['physical_type']}")

Error Messages

Common errors when attempting to decode unsupported files:

Compression Error

Cannot decode: column uses compression codec SNAPPY

Solution: Use PyArrow or decompress externally

Encoding Error

Cannot decode: column uses DICTIONARY encoding

Solution: Use PyArrow or re-encode as PLAIN

Type Error

Cannot decode: unsupported type FLOAT

Solution: Use PyArrow or convert type

Multiple Row Groups Error

Cannot decode: file has multiple row groups

Solution: Use PyArrow or process first group only

Workarounds

Convert Files for Decoding

import pyarrow as pa
import pyarrow.parquet as pq

# Read with PyArrow
table = pq.read_table("compressed.parquet")

# Write uncompressed with PLAIN encoding
pq.write_table(
    table,
    "uncompressed.parquet",
    compression="none",
    use_dictionary=False
)

# Now decodable with Rugo
import rugo.parquet as parquet_meta
values = parquet_meta.decode_column("uncompressed.parquet", "col")

Future Enhancements

Potential improvements (not guaranteed):

Support for more data types (FLOAT, DOUBLE, BOOLEAN)
Dictionary encoding support
Basic compression codecs (SNAPPY, GZIP)
Nullable columns
Multiple row groups
Configurable row group selection

Recommendations

For production use:

PyArrow - Full Parquet support, actively maintained
FastParquet - Pure Python option
DuckDB - SQL queries on Parquet
Polars - High-performance DataFrame library

Use Rugo decoder only for: - Learning Parquet internals - Quick tests with simple files - Development and debugging

Next Steps

Examples - See decoder in action
API Reference - Function documentation