Working with In-Memory Data
Learn how to read Parquet metadata from data already loaded in memory.
Overview
Rugo provides two functions for working with in-memory data:
read_metadata_from_bytes()- For bytes objectsread_metadata_from_memoryview()- For zero-copy parsing
Both functions accept the same options as read_metadata().
Reading from Bytes
Basic Usage
import rugo.parquet as parquet_meta
# Load file into memory
with open("example.parquet", "rb") as f:
data = f.read()
# Parse metadata
metadata = parquet_meta.read_metadata_from_bytes(data)
Use Cases
Network Sources
import urllib.request
# Download Parquet file
url = "https://example.com/data.parquet"
response = urllib.request.urlopen(url)
data = response.read()
# Parse metadata
metadata = parquet_meta.read_metadata_from_bytes(data)
Cloud Storage
# AWS S3
import boto3
s3 = boto3.client('s3')
response = s3.get_object(Bucket='my-bucket', Key='data.parquet')
data = response['Body'].read()
metadata = parquet_meta.read_metadata_from_bytes(data)
# Google Cloud Storage
from google.cloud import storage
client = storage.Client()
bucket = client.bucket('my-bucket')
blob = bucket.blob('data.parquet')
data = blob.download_as_bytes()
metadata = parquet_meta.read_metadata_from_bytes(data)
In-Memory Testing
# Generate test data with PyArrow
import pyarrow as pa
import pyarrow.parquet as pq
import io
# Create test table
table = pa.table({'col1': [1, 2, 3], 'col2': ['a', 'b', 'c']})
# Write to bytes
buf = io.BytesIO()
pq.write_table(table, buf)
data = buf.getvalue()
# Read metadata
metadata = parquet_meta.read_metadata_from_bytes(data)
Reading from Memoryview
Zero-Copy Parsing
For maximum efficiency, use memoryview to avoid copying data:
# Load data
with open("example.parquet", "rb") as f:
data = f.read()
# Parse with zero-copy
metadata = parquet_meta.read_metadata_from_memoryview(memoryview(data))
Memory-Mapped Files
Memory mapping is ideal for large files:
import mmap
with open("large_file.parquet", "rb") as f:
# Memory map the file
with mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) as mm:
# Parse without loading into memory
metadata = parquet_meta.read_metadata_from_memoryview(memoryview(mm))
Shared Memory
Work with data from multiprocessing:
from multiprocessing import shared_memory
# Assume 'shm' is a shared memory block with Parquet data
# shm = shared_memory.SharedMemory(name='parquet_data')
# Parse from shared memory
metadata = parquet_meta.read_metadata_from_memoryview(
memoryview(shm.buf)
)
Comparison: Bytes vs Memoryview
When to Use Bytes
✅ Data is already a bytes object
✅ Working with network/API responses
✅ Simple use cases
✅ Data size is manageable
When to Use Memoryview
✅ Large files (> 100 MB)
✅ Memory-mapped files
✅ Shared memory scenarios
✅ Need zero-copy access
✅ Performance critical paths
Performance Considerations
Memory Usage
import sys
# bytes: copies data
data_bytes = open("file.parquet", "rb").read()
print(f"Bytes size: {sys.getsizeof(data_bytes)} bytes")
# memoryview: references data (no copy)
data_view = memoryview(data_bytes)
print(f"Memoryview size: {sys.getsizeof(data_view)} bytes") # Much smaller!
Speed Comparison
For a 100 MB file:
| Method | Time | Memory |
|---|---|---|
read_metadata() |
10 ms | Low |
read_metadata_from_bytes() |
10 ms | Medium |
read_metadata_from_memoryview() |
8 ms | Low |
Complete Examples
S3 with Streaming
import boto3
import io
def read_s3_metadata(bucket, key):
"""Read metadata from S3 without loading entire file."""
s3 = boto3.client('s3')
# Stream file
response = s3.get_object(Bucket=bucket, Key=key)
# Read into memory
data = response['Body'].read()
# Parse metadata
return parquet_meta.read_metadata_from_bytes(data)
HTTP Range Requests
For very large files, you might want to read only the footer:
import requests
def read_http_metadata(url):
"""Read metadata from HTTP source."""
# Download entire file (simplified)
response = requests.get(url)
data = response.content
return parquet_meta.read_metadata_from_bytes(data)
# Note: Production code should use range requests
# to fetch only the footer, not the entire file
Memory-Mapped Processing
import mmap
def process_large_file(filename):
"""Efficiently process large Parquet file."""
with open(filename, "rb") as f:
with mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) as mm:
# Zero-copy parsing
metadata = parquet_meta.read_metadata_from_memoryview(
memoryview(mm)
)
# Process metadata
print(f"Rows: {metadata['num_rows']}")
print(f"Columns: {len(metadata['schema_columns'])}")
return metadata
Error Handling
def safe_read_bytes(data):
"""Safely read metadata from bytes."""
try:
metadata = parquet_meta.read_metadata_from_bytes(data)
return metadata
except Exception as e:
print(f"Failed to parse: {e}")
return None
# Validate data first
if data.startswith(b'PAR1'):
metadata = safe_read_bytes(data)
else:
print("Not a Parquet file")
Best Practices
- Use memoryview for large files - Avoid unnecessary copies
- Memory map when possible - Don't load entire file
- Validate data first - Check for PAR1 magic bytes
- Handle errors gracefully - Invalid data should not crash
- Clean up resources - Close files and release memory
Next Steps
- Orso Integration - Schema conversion helpers
- API Reference - Complete function documentation