Skip to content

Decoder Limitations

Complete list of limitations and constraints for Rugo's prototype decoder.

Not Production Ready

This decoder is a prototype for educational and testing purposes only.

Supported Configurations

Data Types

Only these physical types are supported:

  • INT32 - 32-bit integers
  • INT64 - 64-bit integers
  • BYTE_ARRAY - For strings

All other types (FLOAT, DOUBLE, BOOLEAN, FIXED_LEN_BYTE_ARRAY, etc.) are not supported.

Encodings

Only PLAIN encoding is supported:

  • ✅ PLAIN - Direct value encoding
  • ❌ DICTIONARY - Dictionary encoding
  • ❌ RLE - Run-length encoding
  • ❌ DELTA_BINARY_PACKED - Delta encoding
  • ❌ DELTA_LENGTH_BYTE_ARRAY - Delta length encoding
  • ❌ DELTA_BYTE_ARRAY - Delta byte array encoding
  • ❌ RLE_DICTIONARY - Combined RLE and dictionary

Compression

Only uncompressed data is supported:

  • ✅ UNCOMPRESSED - No compression
  • ❌ SNAPPY - Snappy compression
  • ❌ GZIP - Gzip compression
  • ❌ LZO - LZO compression
  • ❌ BROTLI - Brotli compression
  • ❌ LZ4 - LZ4 compression
  • ❌ ZSTD - Zstandard compression

Column Requirements

Columns must be:

  • Required (not nullable)
  • ❌ Optional columns with definition levels are not supported
  • ❌ Repeated columns are not supported
  • ❌ Nested structures are not supported

Row Groups

  • First row group only
  • ❌ Multiple row groups not supported
  • ❌ Cannot select specific row group

Detailed Constraints

Type Limitations

Integer Types

# Supported
INT32 with logical type INT(32, true/false)
INT64 with logical type INT(64, true/false)

# Not supported
INT96 (deprecated timestamp)
FLOAT
DOUBLE

String Types

# Supported
BYTE_ARRAY with logical type STRING

# Not supported
BYTE_ARRAY with other logical types (JSON, BSON, etc.)
FIXED_LEN_BYTE_ARRAY

Not Supported At All

  • Boolean values
  • Date and timestamp types
  • Decimal types
  • UUID
  • Time types
  • Interval types
  • Lists and arrays
  • Maps
  • Structs
  • Enums

Encoding Limitations

Why PLAIN Only?

The decoder only implements PLAIN encoding because:

  1. Simplicity - PLAIN is the simplest encoding
  2. Educational - Easy to understand and implement
  3. Prototype - Not intended for production use

More complex encodings require: - Dictionary management - Bit-packing algorithms - Delta decoding logic - Additional dependencies

Compression Limitations

Why No Compression?

Supporting compression requires:

  1. External libraries - Snappy, zlib, zstd, etc.
  2. Additional complexity - Decompression logic
  3. More dependencies - Against Rugo's minimal design
  4. Out of scope - Metadata focus, not full data access

Nullability Limitations

Why No Nulls?

Nullable columns require:

  1. Definition levels - Additional encoding to track nulls
  2. Repetition levels - For nested structures
  3. Complex logic - Level decoding algorithms
  4. Larger scope - Beyond prototype intentions

Row Group Limitations

Why First Only?

Supporting multiple row groups requires:

  1. Iteration logic - Loop through all groups
  2. Memory management - Combining results
  3. API design - How to return multi-group data
  4. Performance - Potentially large datasets

Checking Compatibility

Using can_decode()

import rugo.parquet as parquet_meta

# Check if file meets all requirements
if parquet_meta.can_decode("file.parquet"):
    # Safe to decode
    values = parquet_meta.decode_column("file.parquet", "col")
else:
    # Use PyArrow instead
    import pyarrow.parquet as pq
    table = pq.read_table("file.parquet")

Manual Checking

metadata = parquet_meta.read_metadata("file.parquet")

# Check compression
rg = metadata["row_groups"][0]
for col in rg["columns"]:
    if col["compression_codec"] != "UNCOMPRESSED":
        print(f"{col['name']} is compressed: {col['compression_codec']}")

# Check encoding
for col in rg["columns"]:
    if "PLAIN" not in col["encodings"]:
        print(f"{col['name']} uses: {col['encodings']}")

# Check types
for col in metadata["schema_columns"]:
    if col["physical_type"] not in ["INT32", "INT64", "BYTE_ARRAY"]:
        print(f"{col['name']} has unsupported type: {col['physical_type']}")

Error Messages

Common errors when attempting to decode unsupported files:

Compression Error

Cannot decode: column uses compression codec SNAPPY

Solution: Use PyArrow or decompress externally

Encoding Error

Cannot decode: column uses DICTIONARY encoding

Solution: Use PyArrow or re-encode as PLAIN

Type Error

Cannot decode: unsupported type FLOAT

Solution: Use PyArrow or convert type

Multiple Row Groups Error

Cannot decode: file has multiple row groups

Solution: Use PyArrow or process first group only

Workarounds

Convert Files for Decoding

import pyarrow as pa
import pyarrow.parquet as pq

# Read with PyArrow
table = pq.read_table("compressed.parquet")

# Write uncompressed with PLAIN encoding
pq.write_table(
    table,
    "uncompressed.parquet",
    compression="none",
    use_dictionary=False
)

# Now decodable with Rugo
import rugo.parquet as parquet_meta
values = parquet_meta.decode_column("uncompressed.parquet", "col")

Future Enhancements

Potential improvements (not guaranteed):

  • Support for more data types (FLOAT, DOUBLE, BOOLEAN)
  • Dictionary encoding support
  • Basic compression codecs (SNAPPY, GZIP)
  • Nullable columns
  • Multiple row groups
  • Configurable row group selection

Recommendations

For production use:

  1. PyArrow - Full Parquet support, actively maintained
  2. FastParquet - Pure Python option
  3. DuckDB - SQL queries on Parquet
  4. Polars - High-performance DataFrame library

Use Rugo decoder only for: - Learning Parquet internals - Quick tests with simple files - Development and debugging

Next Steps