Skip to content

Working with In-Memory Data

Learn how to read Parquet metadata from data already loaded in memory.

Overview

Rugo provides two functions for working with in-memory data:

  • read_metadata_from_bytes() - For bytes objects
  • read_metadata_from_memoryview() - For zero-copy parsing

Both functions accept the same options as read_metadata().

Reading from Bytes

Basic Usage

import rugo.parquet as parquet_meta

# Load file into memory
with open("example.parquet", "rb") as f:
    data = f.read()

# Parse metadata
metadata = parquet_meta.read_metadata_from_bytes(data)

Use Cases

Network Sources

import urllib.request

# Download Parquet file
url = "https://example.com/data.parquet"
response = urllib.request.urlopen(url)
data = response.read()

# Parse metadata
metadata = parquet_meta.read_metadata_from_bytes(data)

Cloud Storage

# AWS S3
import boto3

s3 = boto3.client('s3')
response = s3.get_object(Bucket='my-bucket', Key='data.parquet')
data = response['Body'].read()

metadata = parquet_meta.read_metadata_from_bytes(data)
# Google Cloud Storage
from google.cloud import storage

client = storage.Client()
bucket = client.bucket('my-bucket')
blob = bucket.blob('data.parquet')
data = blob.download_as_bytes()

metadata = parquet_meta.read_metadata_from_bytes(data)

In-Memory Testing

# Generate test data with PyArrow
import pyarrow as pa
import pyarrow.parquet as pq
import io

# Create test table
table = pa.table({'col1': [1, 2, 3], 'col2': ['a', 'b', 'c']})

# Write to bytes
buf = io.BytesIO()
pq.write_table(table, buf)
data = buf.getvalue()

# Read metadata
metadata = parquet_meta.read_metadata_from_bytes(data)

Reading from Memoryview

Zero-Copy Parsing

For maximum efficiency, use memoryview to avoid copying data:

# Load data
with open("example.parquet", "rb") as f:
    data = f.read()

# Parse with zero-copy
metadata = parquet_meta.read_metadata_from_memoryview(memoryview(data))

Memory-Mapped Files

Memory mapping is ideal for large files:

import mmap

with open("large_file.parquet", "rb") as f:
    # Memory map the file
    with mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) as mm:
        # Parse without loading into memory
        metadata = parquet_meta.read_metadata_from_memoryview(memoryview(mm))

Shared Memory

Work with data from multiprocessing:

from multiprocessing import shared_memory

# Assume 'shm' is a shared memory block with Parquet data
# shm = shared_memory.SharedMemory(name='parquet_data')

# Parse from shared memory
metadata = parquet_meta.read_metadata_from_memoryview(
    memoryview(shm.buf)
)

Comparison: Bytes vs Memoryview

When to Use Bytes

✅ Data is already a bytes object
✅ Working with network/API responses
✅ Simple use cases
✅ Data size is manageable

When to Use Memoryview

✅ Large files (> 100 MB)
✅ Memory-mapped files
✅ Shared memory scenarios
✅ Need zero-copy access
✅ Performance critical paths

Performance Considerations

Memory Usage

import sys

# bytes: copies data
data_bytes = open("file.parquet", "rb").read()
print(f"Bytes size: {sys.getsizeof(data_bytes)} bytes")

# memoryview: references data (no copy)
data_view = memoryview(data_bytes)
print(f"Memoryview size: {sys.getsizeof(data_view)} bytes")  # Much smaller!

Speed Comparison

For a 100 MB file:

Method Time Memory
read_metadata() 10 ms Low
read_metadata_from_bytes() 10 ms Medium
read_metadata_from_memoryview() 8 ms Low

Complete Examples

S3 with Streaming

import boto3
import io

def read_s3_metadata(bucket, key):
    """Read metadata from S3 without loading entire file."""
    s3 = boto3.client('s3')

    # Stream file
    response = s3.get_object(Bucket=bucket, Key=key)

    # Read into memory
    data = response['Body'].read()

    # Parse metadata
    return parquet_meta.read_metadata_from_bytes(data)

HTTP Range Requests

For very large files, you might want to read only the footer:

import requests

def read_http_metadata(url):
    """Read metadata from HTTP source."""
    # Download entire file (simplified)
    response = requests.get(url)
    data = response.content

    return parquet_meta.read_metadata_from_bytes(data)

# Note: Production code should use range requests
# to fetch only the footer, not the entire file

Memory-Mapped Processing

import mmap

def process_large_file(filename):
    """Efficiently process large Parquet file."""
    with open(filename, "rb") as f:
        with mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) as mm:
            # Zero-copy parsing
            metadata = parquet_meta.read_metadata_from_memoryview(
                memoryview(mm)
            )

            # Process metadata
            print(f"Rows: {metadata['num_rows']}")
            print(f"Columns: {len(metadata['schema_columns'])}")

            return metadata

Error Handling

def safe_read_bytes(data):
    """Safely read metadata from bytes."""
    try:
        metadata = parquet_meta.read_metadata_from_bytes(data)
        return metadata
    except Exception as e:
        print(f"Failed to parse: {e}")
        return None

# Validate data first
if data.startswith(b'PAR1'):
    metadata = safe_read_bytes(data)
else:
    print("Not a Parquet file")

Best Practices

  1. Use memoryview for large files - Avoid unnecessary copies
  2. Memory map when possible - Don't load entire file
  3. Validate data first - Check for PAR1 magic bytes
  4. Handle errors gracefully - Invalid data should not crash
  5. Clean up resources - Close files and release memory

Next Steps