Users attempting to upload large datasets (2.7GB+) were encountering:
- Server-side rejection: PHP upload limits too restrictive (2MB max)
- Client-side OverflowError: Python SSL limitation when sending >2GB as single buffer
| Setting | Old Value | New Value | Purpose |
|---|---|---|---|
upload_max_filesize |
2M | 5G | Maximum size per uploaded file |
post_max_size |
8M | 5G | Maximum total POST request size |
max_execution_time |
30 | 3600 | Maximum script runtime (1 hour) |
memory_limit |
16G | 16G | Already sufficient ✓ |
After making these changes, you must restart the OpenML Docker container:
docker-compose down
docker-compose up -d --buildOr if using plain Docker:
docker stop <container_name>
docker start <container_name>OverflowError: string longer than 2147483647 bytes
Root cause: Python's SSL layer uses a signed 32-bit integer for write buffer length. This limits a single send() call to 2,147,483,647 bytes (2^31-1 ≈ 2GB).
Why it happens: The openml-python client or requests library may be:
- Reading the entire 2.7GB file into memory as one bytes object
- Building the entire multipart POST body in memory
- Attempting to send it in one SSL write operation
Modify how the file is passed to the OpenML client. Instead of:
# BAD - loads entire file into memory
with open('dataset.arff', 'rb') as f:
data = f.read() # 2.7GB in RAM!
openml_dataset.publish() # triggers OverflowErrorUse streaming (requires patching openml-python or using direct requests):
# GOOD - streams in chunks
import requests
with open('dataset.arff', 'rb') as f:
files = {'dataset': ('dataset.arff', f)} # Pass file handle, not bytes
response = requests.post(
'https://openml.org/api/v1/data',
files=files,
data={'api_key': 'YOUR_KEY', 'description': xml_description}
)Note: If openml-python internally calls f.read(), you'll need to patch it or use Option 2/3.
Reduce file size below 2GB:
# ARFF supports gzip compression
gzip dataset.arff
# Result: dataset.arff.gz (often 10-50x smaller for sparse data)Then upload the .arff.gz file. OpenML should accept compressed ARFF.
Upload to a service that handles large files:
- Zenodo: Free, DOI-based, handles 50GB+
- AWS S3: Pay-per-use, unlimited size
- Institutional repository: Check your university
Then register the dataset in OpenML by URL:
import openml
dataset = openml.datasets.OpenMLDataset(
name="My Large Dataset",
description="...",
url="https://zenodo.org/record/12345/files/dataset.arff.gz",
format="arff",
version_label="1.0"
)
dataset.publish()If you control the client environment, patch the library to use streaming:
File to patch: <python_site_packages>/openml/_api_calls.py
Find the section that builds file_elements and ensure it passes file handles, not bytes:
# In _perform_api_call or _read_url_files
# BEFORE (bad):
file_data = open(filepath, 'rb').read() # Loads all into memory
file_elements = {'dataset': (filename, file_data)}
# AFTER (good):
file_handle = open(filepath, 'rb') # Keep handle open
file_elements = {'dataset': (filename, file_handle)}-
Check PHP configuration is loaded:
docker exec <container_name> php -i | grep -E 'upload_max_filesize|post_max_size|max_execution_time'
Should show:
upload_max_filesize => 5G,post_max_size => 5G,max_execution_time => 3600 -
Try a test upload via curl:
curl -X POST https://your-openml-server.org/api/v1/data \ -F "api_key=YOUR_KEY" \ -F "description=@description.xml" \ -F "dataset=@test_large_file.arff"
- Try uploading a 1GB file first (below the 2GB SSL limit)
- Monitor memory usage:
htopor Task Manager - If successful, the client is streaming properly
- For 2.7GB files, use compression or external hosting
Best approach combining all solutions:
-
Compress the dataset (reduces transfer time and bypasses SSL limit):
gzip -9 dataset.arff # Maximum compression -
Verify server config (already fixed in this repo):
- Restart Docker container to load new php.ini
-
Upload via direct HTTP streaming (bypass openml-python client):
import requests api_key = "YOUR_API_KEY" url = "https://openml.org/api/v1/data" # Prepare XML description xml_desc = """<?xml version="1.0" encoding="UTF-8"?> <oml:data_set_description xmlns:oml="http://openml.org/openml"> <oml:name>Dataset Name</oml:name> <oml:description>Description here</oml:description> <oml:format>arff</oml:format> </oml:data_set_description>""" # Stream upload with open('dataset.arff.gz', 'rb') as f: response = requests.post( url, data={'api_key': api_key, 'description': xml_desc}, files={'dataset': ('dataset.arff.gz', f)}, timeout=3600 # 1 hour timeout for large uploads ) print(response.text)
-
Monitor upload progress (optional):
from tqdm import tqdm import requests # Wrapper for progress bar class TqdmUploader: def __init__(self, filename): self.filename = filename self.size = os.path.getsize(filename) self.progress = tqdm(total=self.size, unit='B', unit_scale=True) def __enter__(self): self.f = open(self.filename, 'rb') return self def __exit__(self, *args): self.f.close() self.progress.close() def read(self, size=-1): chunk = self.f.read(size) self.progress.update(len(chunk)) return chunk with TqdmUploader('dataset.arff.gz') as uploader: response = requests.post(url, files={'dataset': uploader}, ...)
If you're using nginx as a reverse proxy (not present in current setup), also add:
client_max_body_size 5G;
proxy_read_timeout 3600s;For very large uploads over slow connections:
- Client timeout: Set
timeout=7200in requests (2 hours) - Server timeout: Already set via
max_execution_time = 3600 - Load balancer timeout: Check cloud provider settings (AWS ALB, GCP LB, etc.)
Uploading 2.7GB datasets requires adequate disk space:
- Temporary space:
/tmpneeds ~2.7GB during upload - Final storage:
DATA_PATHneeds ~2.7GB per dataset - Recommend: 50GB+ free space on server
If all else fails, consider splitting into multiple smaller datasets:
# Split dataset into chunks
import pandas as pd
df = pd.read_csv('dataset.csv')
chunk_size = 1_000_000 # 1M rows per chunk
for i, start in enumerate(range(0, len(df), chunk_size)):
chunk = df[start:start + chunk_size]
chunk.to_csv(f'dataset_part{i}.arff', index=False, header=True)
# Upload each part separately✅ Server-side limits fixed (this repo)
- File compression (easiest)
- Streaming upload (most robust)
- External hosting (most flexible)
For your 2.7GB file: Compress with gzip first, should reduce to <500MB for typical datasets.