OverflowError: string longer than 2147483647 bytes
gzip -9 your_dataset.arffThis typically reduces file size by 80-95% for sparse datasets. Upload the .arff.gz file instead.
Replace your publish_dataset.py with this:
import requests
import os
# Configuration
API_KEY = "your_api_key_here"
DATASET_FILE = "your_dataset.arff" # or .arff.gz
DATASET_NAME = "Your Dataset Name"
DATASET_DESCRIPTION = "Description of your dataset"
# Create XML description
xml_description = f"""<?xml version="1.0" encoding="UTF-8"?>
<oml:data_set_description xmlns:oml="http://openml.org/openml">
<oml:name>{DATASET_NAME}</oml:name>
<oml:description>{DATASET_DESCRIPTION}</oml:description>
<oml:format>arff</oml:format>
</oml:data_set_description>"""
# Upload with streaming (no memory overflow)
print(f"Uploading {DATASET_FILE} ({os.path.getsize(DATASET_FILE) / 1e9:.2f} GB)...")
with open(DATASET_FILE, 'rb') as f:
response = requests.post(
'https://www.openml.org/api/v1/data',
data={
'api_key': API_KEY,
'description': xml_description
},
files={'dataset': (os.path.basename(DATASET_FILE), f)},
timeout=7200 # 2 hour timeout
)
print(response.status_code)
print(response.text)- Upload to Zenodo, Figshare, or S3
- Get the permanent URL
- Register in OpenML:
import openml
dataset = openml.datasets.OpenMLDataset(
name="Your Dataset Name",
description="Your description",
url="https://zenodo.org/record/XXXXX/files/dataset.arff.gz",
format="arff"
)
dataset.publish()- Python limitation: SSL write buffer cannot exceed 2GB (signed 32-bit int max)
- Client bug: openml-python loads entire file into memory instead of streaming
- Server limits: Default OpenML server limits were 2MB (now fixed to 5GB)
See LARGE_DATASET_UPLOAD_FIX.md for complete details.