Project Structure
Understanding Rugo's codebase organization.
Overview
rugo/
├── rugo/ # Main package
│ ├── __init__.py # Package initialization
│ ├── parquet/ # Parquet metadata reading
│ │ ├── __init__.py
│ │ ├── metadata_reader.pyx # Cython bindings
│ │ ├── metadata.cpp # C++ metadata parser
│ │ ├── metadata.hpp
│ │ ├── decode.cpp # Data decoder (experimental)
│ │ ├── decode.hpp
│ │ └── thrift.hpp # Thrift structures
│ └── converters/ # Format converters
│ ├── __init__.py
│ └── orso.py # Orso integration
├── tests/ # Test suite
│ ├── data/ # Test data files
│ ├── test_all_metadata_fields.py
│ ├── test_decode.py
│ ├── test_logical_types.py
│ ├── test_orso_converter.py
│ └── test_statistics.py
├── examples/ # Example scripts
│ ├── comprehensive_metadata.py
│ ├── decode_example.py
│ └── orso_conversion.py
├── Makefile # Build automation
├── pyproject.toml # Project metadata
├── setup.py # Build configuration
├── MANIFEST.in # Package manifest
├── README.md # Main documentation
├── LICENSE # Apache 2.0 license
├── IMPLEMENTATION_SUMMARY.md
└── DECODE_IMPLEMENTATION.md
Core Components
Python Package (rugo/)
__init__.py
Package initialization and version information:
parquet/ Directory
Main Parquet functionality:
metadata_reader.pyx- Cython interface to C++ codemetadata.cpp/hpp- C++ metadata parser implementationdecode.cpp/hpp- Experimental data decoderthrift.hpp- Parquet Thrift structure definitions
Converters (rugo/converters/)
orso.py
Orso schema integration:
C++ Layer
metadata.cpp
Core metadata parsing logic:
- Reads Parquet footer
- Parses Thrift structures
- Extracts schema and row group information
- Decodes statistics
Key functions:
PyObject* read_parquet_metadata(const char* filename, ...)
PyObject* read_parquet_metadata_from_bytes(const char* data, ...)
PyObject* read_parquet_metadata_from_memoryview(Py_buffer* view, ...)
decode.cpp
Experimental data decoder:
- Reads data pages
- Decodes PLAIN encoding
- Returns Python lists
Key functions:
bool can_decode_file(const char* filename)
PyObject* decode_column(const char* filename, const char* column_name)
thrift.hpp
Parquet Thrift definitions:
- FileMetaData
- RowGroup
- ColumnMetaData
- SchemaElement
- Statistics
Cython Layer
metadata_reader.pyx
Python-C++ bridge:
def read_metadata(path, schema_only=False, include_statistics=True, max_row_groups=-1)
def read_metadata_from_bytes(data, ...)
def read_metadata_from_memoryview(view, ...)
def can_decode(path)
def decode_column(path, column_name)
Handles: - Type conversion (C++ ↔ Python) - Error handling - Memory management
Test Suite
Test Organization
test_all_metadata_fields.py- Comprehensive metadata validationtest_decode.py- Decoder functionalitytest_logical_types.py- Type system teststest_orso_converter.py- Orso integration teststest_statistics.py- Statistics accuracy
Test Data
Located in tests/data/:
- Sample Parquet files
- Various encodings and compressions
- Different schemas and types
Build System
Makefile
Build automation targets:
update # Install/update dependencies
compile # Rebuild extension
test # Run tests
lint # Run linters
mypy # Type checking
clean # Remove build artifacts
setup.py
Extension build configuration:
- Cython compilation settings
- C++ compiler flags
- Include paths
- Dependencies
pyproject.toml
Project metadata:
- Package information
- Dependencies
- Build requirements
- Tool configurations
Documentation
In-Repository
- README.md - Main documentation
- IMPLEMENTATION_SUMMARY.md - Implementation details
- DECODE_IMPLEMENTATION.md - Decoder architecture
Examples
Located in examples/:
comprehensive_metadata.py- Full metadata reading exampledecode_example.py- Data decoding demonstrationorso_conversion.py- Orso integration example
Dependencies
Runtime
- None - Pure Python stdlib for installed package
Build Time
- Cython - Python to C compiler
- setuptools - Build system
Development
- pytest - Testing framework
- PyArrow - Test comparisons
- ruff - Linting
- mypy - Type checking
Optional
- orso - Schema conversion (
rugo[orso])
Data Flow
Metadata Reading
File Path
↓
Python API (metadata_reader.pyx)
↓
C++ Parser (metadata.cpp)
↓
Thrift Parsing (thrift.hpp)
↓
Dictionary Construction
↓
Return to Python
Data Decoding
File Path + Column Name
↓
Python API (metadata_reader.pyx)
↓
Decoder (decode.cpp)
↓
Read Metadata
↓
Locate Data Pages
↓
Decode PLAIN Values
↓
Return Python List
Extension Points
Adding New Features
- C++ Implementation - Add to
metadata.cppordecode.cpp - Cython Wrapper - Expose in
metadata_reader.pyx - Python API - Document and test
Adding Converters
- Create new file in
converters/ - Implement conversion functions
- Add optional dependency if needed
- Document in user guide
Adding Tests
- Create test file in
tests/ - Add test data if needed
- Use pytest fixtures
- Compare with PyArrow when applicable
Code Style
Python
- Follow PEP 8
- Use type hints
- Document with docstrings
- Lint with ruff
C++
- Follow C++17 standards
- Use RAII for memory management
- Handle errors gracefully
- Document complex logic
Cython
- Minimize Python/C++ boundary crossings
- Use typed declarations
- Handle reference counting carefully
- Follow Cython best practices
Build Artifacts
Generated During Build
build/ # Build directory
*.so, *.pyd, *.dylib # Compiled extensions
*.cpp (from .pyx) # Generated C++ from Cython
*.egg-info/ # Package metadata
__pycache__/ # Python bytecode
Source Control
Excluded from git: - Build artifacts - Compiled extensions - Python bytecode - IDE files
Performance Considerations
Metadata Reading
- C++ parser - Fast Thrift parsing
- Minimal copies - Zero-copy with memoryview
- Lazy loading -
schema_onlyoption - Limited parsing -
max_row_groupsoption
Memory Usage
- No data pages - Metadata only
- Efficient structures - Minimal Python objects
- Streaming - Can use memory-mapped files
Security Considerations
- Input validation in C++ layer
- Safe buffer handling
- No external dependencies at runtime
- Bounds checking on reads
Next Steps
- Contributing - How to contribute
- Building - Build from source