Decoder Limitations
Complete list of limitations and constraints for Rugo's prototype decoder.
Not Production Ready
This decoder is a prototype for educational and testing purposes only.
Supported Configurations
Data Types
Only these physical types are supported:
INT32- 32-bit integersINT64- 64-bit integersBYTE_ARRAY- For strings
All other types (FLOAT, DOUBLE, BOOLEAN, FIXED_LEN_BYTE_ARRAY, etc.) are not supported.
Encodings
Only PLAIN encoding is supported:
- ✅ PLAIN - Direct value encoding
- ❌ DICTIONARY - Dictionary encoding
- ❌ RLE - Run-length encoding
- ❌ DELTA_BINARY_PACKED - Delta encoding
- ❌ DELTA_LENGTH_BYTE_ARRAY - Delta length encoding
- ❌ DELTA_BYTE_ARRAY - Delta byte array encoding
- ❌ RLE_DICTIONARY - Combined RLE and dictionary
Compression
Only uncompressed data is supported:
- ✅ UNCOMPRESSED - No compression
- ❌ SNAPPY - Snappy compression
- ❌ GZIP - Gzip compression
- ❌ LZO - LZO compression
- ❌ BROTLI - Brotli compression
- ❌ LZ4 - LZ4 compression
- ❌ ZSTD - Zstandard compression
Column Requirements
Columns must be:
- ✅ Required (not nullable)
- ❌ Optional columns with definition levels are not supported
- ❌ Repeated columns are not supported
- ❌ Nested structures are not supported
Row Groups
- ✅ First row group only
- ❌ Multiple row groups not supported
- ❌ Cannot select specific row group
Detailed Constraints
Type Limitations
Integer Types
# Supported
INT32 with logical type INT(32, true/false)
INT64 with logical type INT(64, true/false)
# Not supported
INT96 (deprecated timestamp)
FLOAT
DOUBLE
String Types
# Supported
BYTE_ARRAY with logical type STRING
# Not supported
BYTE_ARRAY with other logical types (JSON, BSON, etc.)
FIXED_LEN_BYTE_ARRAY
Not Supported At All
- Boolean values
- Date and timestamp types
- Decimal types
- UUID
- Time types
- Interval types
- Lists and arrays
- Maps
- Structs
- Enums
Encoding Limitations
Why PLAIN Only?
The decoder only implements PLAIN encoding because:
- Simplicity - PLAIN is the simplest encoding
- Educational - Easy to understand and implement
- Prototype - Not intended for production use
More complex encodings require: - Dictionary management - Bit-packing algorithms - Delta decoding logic - Additional dependencies
Compression Limitations
Why No Compression?
Supporting compression requires:
- External libraries - Snappy, zlib, zstd, etc.
- Additional complexity - Decompression logic
- More dependencies - Against Rugo's minimal design
- Out of scope - Metadata focus, not full data access
Nullability Limitations
Why No Nulls?
Nullable columns require:
- Definition levels - Additional encoding to track nulls
- Repetition levels - For nested structures
- Complex logic - Level decoding algorithms
- Larger scope - Beyond prototype intentions
Row Group Limitations
Why First Only?
Supporting multiple row groups requires:
- Iteration logic - Loop through all groups
- Memory management - Combining results
- API design - How to return multi-group data
- Performance - Potentially large datasets
Checking Compatibility
Using can_decode()
import rugo.parquet as parquet_meta
# Check if file meets all requirements
if parquet_meta.can_decode("file.parquet"):
# Safe to decode
values = parquet_meta.decode_column("file.parquet", "col")
else:
# Use PyArrow instead
import pyarrow.parquet as pq
table = pq.read_table("file.parquet")
Manual Checking
metadata = parquet_meta.read_metadata("file.parquet")
# Check compression
rg = metadata["row_groups"][0]
for col in rg["columns"]:
if col["compression_codec"] != "UNCOMPRESSED":
print(f"{col['name']} is compressed: {col['compression_codec']}")
# Check encoding
for col in rg["columns"]:
if "PLAIN" not in col["encodings"]:
print(f"{col['name']} uses: {col['encodings']}")
# Check types
for col in metadata["schema_columns"]:
if col["physical_type"] not in ["INT32", "INT64", "BYTE_ARRAY"]:
print(f"{col['name']} has unsupported type: {col['physical_type']}")
Error Messages
Common errors when attempting to decode unsupported files:
Compression Error
Solution: Use PyArrow or decompress externally
Encoding Error
Solution: Use PyArrow or re-encode as PLAIN
Type Error
Solution: Use PyArrow or convert type
Multiple Row Groups Error
Solution: Use PyArrow or process first group only
Workarounds
Convert Files for Decoding
import pyarrow as pa
import pyarrow.parquet as pq
# Read with PyArrow
table = pq.read_table("compressed.parquet")
# Write uncompressed with PLAIN encoding
pq.write_table(
table,
"uncompressed.parquet",
compression="none",
use_dictionary=False
)
# Now decodable with Rugo
import rugo.parquet as parquet_meta
values = parquet_meta.decode_column("uncompressed.parquet", "col")
Future Enhancements
Potential improvements (not guaranteed):
- Support for more data types (FLOAT, DOUBLE, BOOLEAN)
- Dictionary encoding support
- Basic compression codecs (SNAPPY, GZIP)
- Nullable columns
- Multiple row groups
- Configurable row group selection
Recommendations
For production use:
- PyArrow - Full Parquet support, actively maintained
- FastParquet - Pure Python option
- DuckDB - SQL queries on Parquet
- Polars - High-performance DataFrame library
Use Rugo decoder only for: - Learning Parquet internals - Quick tests with simple files - Development and debugging
Next Steps
- Examples - See decoder in action
- API Reference - Function documentation