Orso Integration
Rugo provides optional helpers for converting Parquet metadata to Orso schema format.
Installation
Install Rugo with Orso support:
Overview
Orso is a schema management library that provides a unified schema representation across different data formats. Rugo's Orso integration allows you to:
- Convert Parquet schemas to Orso format
- Extract schema-only information
- Integrate with Orso-based tools
Functions
rugo_to_orso_schema
Convert complete Parquet metadata to an Orso Relation:
from rugo.converters.orso import rugo_to_orso_schema
import rugo.parquet as parquet_meta
# Read Parquet metadata
metadata = parquet_meta.read_metadata("example.parquet")
# Convert to Orso schema
relation = rugo_to_orso_schema(metadata, table_name="example_table")
# Access Orso schema
print(relation.name) # "example_table"
print(relation.schema) # Orso schema object
extract_schema_only
Extract just the schema information in Orso-compatible format:
from rugo.converters.orso import extract_schema_only
metadata = parquet_meta.read_metadata("example.parquet")
schema_info = extract_schema_only(metadata)
# Returns simplified schema structure
for col in schema_info:
print(f"{col['name']}: {col['type']}")
Type Mapping
Parquet types are mapped to Orso types:
| Parquet Physical Type | Parquet Logical Type | Orso Type |
|---|---|---|
| BOOLEAN | - | BOOLEAN |
| INT32 | INT(32, true) | INTEGER |
| INT64 | INT(64, true) | BIGINT |
| FLOAT | - | FLOAT |
| DOUBLE | - | DOUBLE |
| BYTE_ARRAY | STRING | VARCHAR |
| BYTE_ARRAY | JSON | STRUCT |
| BYTE_ARRAY | UUID | VARCHAR |
| INT32 | DATE | DATE |
| INT64 | TIMESTAMP | TIMESTAMP |
| FIXED_LEN_BYTE_ARRAY | DECIMAL | DECIMAL |
Examples
Basic Conversion
import rugo.parquet as parquet_meta
from rugo.converters.orso import rugo_to_orso_schema
# Read metadata
metadata = parquet_meta.read_metadata("users.parquet")
# Convert to Orso
relation = rugo_to_orso_schema(metadata, "users")
# Use Orso schema
print(f"Table: {relation.name}")
print(f"Columns: {len(relation.schema.columns)}")
for column in relation.schema.columns:
print(f" {column.name}: {column.type}")
Schema Extraction
from rugo.converters.orso import extract_schema_only
metadata = parquet_meta.read_metadata("data.parquet")
schema = extract_schema_only(metadata)
# Schema is a simplified structure
for col in schema:
print(f"Column: {col['name']}")
print(f" Type: {col['type']}")
print(f" Nullable: {col['nullable']}")
Integration with Orso Tools
from rugo.converters.orso import rugo_to_orso_schema
import rugo.parquet as parquet_meta
# Read Parquet metadata
metadata = parquet_meta.read_metadata("dataset.parquet")
# Convert to Orso
relation = rugo_to_orso_schema(metadata, "dataset")
# Use with Orso-based tools
# relation can now be used with any tool that accepts Orso Relations
Complete Example
See the orso_conversion.py example in the Rugo repository:
import rugo.parquet as parquet_meta
from rugo.converters.orso import rugo_to_orso_schema, extract_schema_only
# Read metadata
metadata = parquet_meta.read_metadata("example.parquet")
# Method 1: Full conversion
relation = rugo_to_orso_schema(metadata, "my_table")
print(f"Relation name: {relation.name}")
print(f"Number of columns: {len(relation.schema.columns)}")
# Method 2: Schema extraction
schema_info = extract_schema_only(metadata)
for col in schema_info:
print(f"{col['name']}: {col['type']} (nullable: {col['nullable']})")
Use Cases
Schema Catalog
Build a schema catalog using Orso format:
from rugo.converters.orso import rugo_to_orso_schema
import rugo.parquet as parquet_meta
def catalog_parquet_file(filepath, table_name):
"""Add Parquet file to schema catalog."""
metadata = parquet_meta.read_metadata(filepath, schema_only=True)
relation = rugo_to_orso_schema(metadata, table_name)
# Store relation in catalog
return {
"name": relation.name,
"schema": relation.schema,
"source": filepath
}
Schema Comparison
Compare schemas across files:
from rugo.converters.orso import extract_schema_only
def compare_schemas(file1, file2):
"""Compare schemas of two Parquet files."""
meta1 = parquet_meta.read_metadata(file1, schema_only=True)
meta2 = parquet_meta.read_metadata(file2, schema_only=True)
schema1 = extract_schema_only(meta1)
schema2 = extract_schema_only(meta2)
# Compare columns
cols1 = {c['name']: c['type'] for c in schema1}
cols2 = {c['name']: c['type'] for c in schema2}
return cols1 == cols2
Type Validation
Validate Parquet schema against expected types:
from rugo.converters.orso import extract_schema_only
def validate_schema(filepath, expected_types):
"""Validate Parquet file has expected schema."""
metadata = parquet_meta.read_metadata(filepath, schema_only=True)
schema = extract_schema_only(metadata)
for col in schema:
expected = expected_types.get(col['name'])
if expected and col['type'] != expected:
raise ValueError(
f"Type mismatch for {col['name']}: "
f"expected {expected}, got {col['type']}"
)
return True
Limitations
- Not all Parquet types have direct Orso equivalents
- Complex nested structures may require manual handling
- Custom logical types might not be supported
Next Steps
- Orso Documentation - Learn about Orso
- API Reference - Complete function documentation