Skip to content

Project Structure

Understanding Rugo's codebase organization.

Overview

rugo/
├── rugo/                      # Main package
│   ├── __init__.py           # Package initialization
│   ├── parquet/              # Parquet metadata reading
│   │   ├── __init__.py
│   │   ├── metadata_reader.pyx   # Cython bindings
│   │   ├── metadata.cpp          # C++ metadata parser
│   │   ├── metadata.hpp
│   │   ├── decode.cpp            # Data decoder (experimental)
│   │   ├── decode.hpp
│   │   └── thrift.hpp            # Thrift structures
│   └── converters/           # Format converters
│       ├── __init__.py
│       └── orso.py           # Orso integration
├── tests/                    # Test suite
│   ├── data/                 # Test data files
│   ├── test_all_metadata_fields.py
│   ├── test_decode.py
│   ├── test_logical_types.py
│   ├── test_orso_converter.py
│   └── test_statistics.py
├── examples/                 # Example scripts
│   ├── comprehensive_metadata.py
│   ├── decode_example.py
│   └── orso_conversion.py
├── Makefile                  # Build automation
├── pyproject.toml           # Project metadata
├── setup.py                 # Build configuration
├── MANIFEST.in              # Package manifest
├── README.md                # Main documentation
├── LICENSE                  # Apache 2.0 license
├── IMPLEMENTATION_SUMMARY.md
└── DECODE_IMPLEMENTATION.md

Core Components

Python Package (rugo/)

__init__.py

Package initialization and version information:

__version__ = "0.1.0"

parquet/ Directory

Main Parquet functionality:

  • metadata_reader.pyx - Cython interface to C++ code
  • metadata.cpp/hpp - C++ metadata parser implementation
  • decode.cpp/hpp - Experimental data decoder
  • thrift.hpp - Parquet Thrift structure definitions

Converters (rugo/converters/)

orso.py

Orso schema integration:

def rugo_to_orso_schema(metadata, table_name)
def extract_schema_only(metadata)

C++ Layer

metadata.cpp

Core metadata parsing logic:

  • Reads Parquet footer
  • Parses Thrift structures
  • Extracts schema and row group information
  • Decodes statistics

Key functions:

PyObject* read_parquet_metadata(const char* filename, ...)
PyObject* read_parquet_metadata_from_bytes(const char* data, ...)
PyObject* read_parquet_metadata_from_memoryview(Py_buffer* view, ...)

decode.cpp

Experimental data decoder:

  • Reads data pages
  • Decodes PLAIN encoding
  • Returns Python lists

Key functions:

bool can_decode_file(const char* filename)
PyObject* decode_column(const char* filename, const char* column_name)

thrift.hpp

Parquet Thrift definitions:

  • FileMetaData
  • RowGroup
  • ColumnMetaData
  • SchemaElement
  • Statistics

Cython Layer

metadata_reader.pyx

Python-C++ bridge:

def read_metadata(path, schema_only=False, include_statistics=True, max_row_groups=-1)
def read_metadata_from_bytes(data, ...)
def read_metadata_from_memoryview(view, ...)
def can_decode(path)
def decode_column(path, column_name)

Handles: - Type conversion (C++ ↔ Python) - Error handling - Memory management

Test Suite

Test Organization

  • test_all_metadata_fields.py - Comprehensive metadata validation
  • test_decode.py - Decoder functionality
  • test_logical_types.py - Type system tests
  • test_orso_converter.py - Orso integration tests
  • test_statistics.py - Statistics accuracy

Test Data

Located in tests/data/: - Sample Parquet files - Various encodings and compressions - Different schemas and types

Build System

Makefile

Build automation targets:

update      # Install/update dependencies
compile     # Rebuild extension
test        # Run tests
lint        # Run linters
mypy        # Type checking
clean       # Remove build artifacts

setup.py

Extension build configuration:

  • Cython compilation settings
  • C++ compiler flags
  • Include paths
  • Dependencies

pyproject.toml

Project metadata:

  • Package information
  • Dependencies
  • Build requirements
  • Tool configurations

Documentation

In-Repository

  • README.md - Main documentation
  • IMPLEMENTATION_SUMMARY.md - Implementation details
  • DECODE_IMPLEMENTATION.md - Decoder architecture

Examples

Located in examples/:

  • comprehensive_metadata.py - Full metadata reading example
  • decode_example.py - Data decoding demonstration
  • orso_conversion.py - Orso integration example

Dependencies

Runtime

  • None - Pure Python stdlib for installed package

Build Time

  • Cython - Python to C compiler
  • setuptools - Build system

Development

  • pytest - Testing framework
  • PyArrow - Test comparisons
  • ruff - Linting
  • mypy - Type checking

Optional

  • orso - Schema conversion (rugo[orso])

Data Flow

Metadata Reading

File Path
Python API (metadata_reader.pyx)
C++ Parser (metadata.cpp)
Thrift Parsing (thrift.hpp)
Dictionary Construction
Return to Python

Data Decoding

File Path + Column Name
Python API (metadata_reader.pyx)
Decoder (decode.cpp)
Read Metadata
Locate Data Pages
Decode PLAIN Values
Return Python List

Extension Points

Adding New Features

  1. C++ Implementation - Add to metadata.cpp or decode.cpp
  2. Cython Wrapper - Expose in metadata_reader.pyx
  3. Python API - Document and test

Adding Converters

  1. Create new file in converters/
  2. Implement conversion functions
  3. Add optional dependency if needed
  4. Document in user guide

Adding Tests

  1. Create test file in tests/
  2. Add test data if needed
  3. Use pytest fixtures
  4. Compare with PyArrow when applicable

Code Style

Python

  • Follow PEP 8
  • Use type hints
  • Document with docstrings
  • Lint with ruff

C++

  • Follow C++17 standards
  • Use RAII for memory management
  • Handle errors gracefully
  • Document complex logic

Cython

  • Minimize Python/C++ boundary crossings
  • Use typed declarations
  • Handle reference counting carefully
  • Follow Cython best practices

Build Artifacts

Generated During Build

build/                          # Build directory
*.so, *.pyd, *.dylib           # Compiled extensions
*.cpp (from .pyx)              # Generated C++ from Cython
*.egg-info/                    # Package metadata
__pycache__/                   # Python bytecode

Source Control

Excluded from git: - Build artifacts - Compiled extensions - Python bytecode - IDE files

Performance Considerations

Metadata Reading

  • C++ parser - Fast Thrift parsing
  • Minimal copies - Zero-copy with memoryview
  • Lazy loading - schema_only option
  • Limited parsing - max_row_groups option

Memory Usage

  • No data pages - Metadata only
  • Efficient structures - Minimal Python objects
  • Streaming - Can use memory-mapped files

Security Considerations

  • Input validation in C++ layer
  • Safe buffer handling
  • No external dependencies at runtime
  • Bounds checking on reads

Next Steps