Skip to content

Orso Integration

Rugo provides optional helpers for converting Parquet metadata to Orso schema format.

Installation

Install Rugo with Orso support:

pip install rugo[orso]

Overview

Orso is a schema management library that provides a unified schema representation across different data formats. Rugo's Orso integration allows you to:

  • Convert Parquet schemas to Orso format
  • Extract schema-only information
  • Integrate with Orso-based tools

Functions

rugo_to_orso_schema

Convert complete Parquet metadata to an Orso Relation:

from rugo.converters.orso import rugo_to_orso_schema
import rugo.parquet as parquet_meta

# Read Parquet metadata
metadata = parquet_meta.read_metadata("example.parquet")

# Convert to Orso schema
relation = rugo_to_orso_schema(metadata, table_name="example_table")

# Access Orso schema
print(relation.name)  # "example_table"
print(relation.schema)  # Orso schema object

extract_schema_only

Extract just the schema information in Orso-compatible format:

from rugo.converters.orso import extract_schema_only

metadata = parquet_meta.read_metadata("example.parquet")
schema_info = extract_schema_only(metadata)

# Returns simplified schema structure
for col in schema_info:
    print(f"{col['name']}: {col['type']}")

Type Mapping

Parquet types are mapped to Orso types:

Parquet Physical Type Parquet Logical Type Orso Type
BOOLEAN - BOOLEAN
INT32 INT(32, true) INTEGER
INT64 INT(64, true) BIGINT
FLOAT - FLOAT
DOUBLE - DOUBLE
BYTE_ARRAY STRING VARCHAR
BYTE_ARRAY JSON STRUCT
BYTE_ARRAY UUID VARCHAR
INT32 DATE DATE
INT64 TIMESTAMP TIMESTAMP
FIXED_LEN_BYTE_ARRAY DECIMAL DECIMAL

Examples

Basic Conversion

import rugo.parquet as parquet_meta
from rugo.converters.orso import rugo_to_orso_schema

# Read metadata
metadata = parquet_meta.read_metadata("users.parquet")

# Convert to Orso
relation = rugo_to_orso_schema(metadata, "users")

# Use Orso schema
print(f"Table: {relation.name}")
print(f"Columns: {len(relation.schema.columns)}")

for column in relation.schema.columns:
    print(f"  {column.name}: {column.type}")

Schema Extraction

from rugo.converters.orso import extract_schema_only

metadata = parquet_meta.read_metadata("data.parquet")
schema = extract_schema_only(metadata)

# Schema is a simplified structure
for col in schema:
    print(f"Column: {col['name']}")
    print(f"  Type: {col['type']}")
    print(f"  Nullable: {col['nullable']}")

Integration with Orso Tools

from rugo.converters.orso import rugo_to_orso_schema
import rugo.parquet as parquet_meta

# Read Parquet metadata
metadata = parquet_meta.read_metadata("dataset.parquet")

# Convert to Orso
relation = rugo_to_orso_schema(metadata, "dataset")

# Use with Orso-based tools
# relation can now be used with any tool that accepts Orso Relations

Complete Example

See the orso_conversion.py example in the Rugo repository:

import rugo.parquet as parquet_meta
from rugo.converters.orso import rugo_to_orso_schema, extract_schema_only

# Read metadata
metadata = parquet_meta.read_metadata("example.parquet")

# Method 1: Full conversion
relation = rugo_to_orso_schema(metadata, "my_table")
print(f"Relation name: {relation.name}")
print(f"Number of columns: {len(relation.schema.columns)}")

# Method 2: Schema extraction
schema_info = extract_schema_only(metadata)
for col in schema_info:
    print(f"{col['name']}: {col['type']} (nullable: {col['nullable']})")

Use Cases

Schema Catalog

Build a schema catalog using Orso format:

from rugo.converters.orso import rugo_to_orso_schema
import rugo.parquet as parquet_meta

def catalog_parquet_file(filepath, table_name):
    """Add Parquet file to schema catalog."""
    metadata = parquet_meta.read_metadata(filepath, schema_only=True)
    relation = rugo_to_orso_schema(metadata, table_name)

    # Store relation in catalog
    return {
        "name": relation.name,
        "schema": relation.schema,
        "source": filepath
    }

Schema Comparison

Compare schemas across files:

from rugo.converters.orso import extract_schema_only

def compare_schemas(file1, file2):
    """Compare schemas of two Parquet files."""
    meta1 = parquet_meta.read_metadata(file1, schema_only=True)
    meta2 = parquet_meta.read_metadata(file2, schema_only=True)

    schema1 = extract_schema_only(meta1)
    schema2 = extract_schema_only(meta2)

    # Compare columns
    cols1 = {c['name']: c['type'] for c in schema1}
    cols2 = {c['name']: c['type'] for c in schema2}

    return cols1 == cols2

Type Validation

Validate Parquet schema against expected types:

from rugo.converters.orso import extract_schema_only

def validate_schema(filepath, expected_types):
    """Validate Parquet file has expected schema."""
    metadata = parquet_meta.read_metadata(filepath, schema_only=True)
    schema = extract_schema_only(metadata)

    for col in schema:
        expected = expected_types.get(col['name'])
        if expected and col['type'] != expected:
            raise ValueError(
                f"Type mismatch for {col['name']}: "
                f"expected {expected}, got {col['type']}"
            )

    return True

Limitations

  • Not all Parquet types have direct Orso equivalents
  • Complex nested structures may require manual handling
  • Custom logical types might not be supported

Next Steps