Large Dataset Upload Fix

Problem Summary

Users attempting to upload large datasets (2.7GB+) were encountering:

Server-side rejection: PHP upload limits too restrictive (2MB max)
Client-side OverflowError: Python SSL limitation when sending >2GB as single buffer

Server-Side Fix (COMPLETED ✓)

Changes Made to `/docker/config/php.ini`:

Setting	Old Value	New Value	Purpose
`upload_max_filesize`	2M	5G	Maximum size per uploaded file
`post_max_size`	8M	5G	Maximum total POST request size
`max_execution_time`	30	3600	Maximum script runtime (1 hour)
`memory_limit`	16G	16G	Already sufficient ✓

Deployment Required

After making these changes, you must restart the OpenML Docker container:

docker-compose down
docker-compose up -d --build

Or if using plain Docker:

docker stop <container_name>
docker start <container_name>

Client-Side Issue (Still Needs Addressing)

The OverflowError Explained

OverflowError: string longer than 2147483647 bytes

Root cause: Python's SSL layer uses a signed 32-bit integer for write buffer length. This limits a single send() call to 2,147,483,647 bytes (2^31-1 ≈ 2GB).

Why it happens: The openml-python client or requests library may be:

Reading the entire 2.7GB file into memory as one bytes object
Building the entire multipart POST body in memory
Attempting to send it in one SSL write operation

Solutions for Client-Side

Option 1: Stream the Upload (RECOMMENDED)

Modify how the file is passed to the OpenML client. Instead of:

# BAD - loads entire file into memory
with open('dataset.arff', 'rb') as f:
    data = f.read()  # 2.7GB in RAM!
    openml_dataset.publish()  # triggers OverflowError

Use streaming (requires patching openml-python or using direct requests):

# GOOD - streams in chunks
import requests

with open('dataset.arff', 'rb') as f:
    files = {'dataset': ('dataset.arff', f)}  # Pass file handle, not bytes
    response = requests.post(
        'https://openml.org/api/v1/data',
        files=files,
        data={'api_key': 'YOUR_KEY', 'description': xml_description}
    )

Note: If openml-python internally calls f.read(), you'll need to patch it or use Option 2/3.

Option 2: Compress Before Upload

Reduce file size below 2GB:

# ARFF supports gzip compression
gzip dataset.arff
# Result: dataset.arff.gz (often 10-50x smaller for sparse data)

Then upload the .arff.gz file. OpenML should accept compressed ARFF.

Option 3: Host Externally and Register by URL

Upload to a service that handles large files:

Zenodo: Free, DOI-based, handles 50GB+
AWS S3: Pay-per-use, unlimited size
Institutional repository: Check your university

Then register the dataset in OpenML by URL:

import openml

dataset = openml.datasets.OpenMLDataset(
    name="My Large Dataset",
    description="...",
    url="https://zenodo.org/record/12345/files/dataset.arff.gz",
    format="arff",
    version_label="1.0"
)
dataset.publish()

Option 4: Patch openml-python

If you control the client environment, patch the library to use streaming:

File to patch: <python_site_packages>/openml/_api_calls.py

Find the section that builds file_elements and ensure it passes file handles, not bytes:

# In _perform_api_call or _read_url_files
# BEFORE (bad):
file_data = open(filepath, 'rb').read()  # Loads all into memory
file_elements = {'dataset': (filename, file_data)}

# AFTER (good):
file_handle = open(filepath, 'rb')  # Keep handle open
file_elements = {'dataset': (filename, file_handle)}

Testing Your Fix

Server-Side Test

Check PHP configuration is loaded:
```
docker exec <container_name> php -i | grep -E 'upload_max_filesize|post_max_size|max_execution_time'
```
Should show: upload_max_filesize => 5G, post_max_size => 5G, max_execution_time => 3600

Try a test upload via curl:

curl -X POST https://your-openml-server.org/api/v1/data \
  -F "api_key=YOUR_KEY" \
  -F "description=@description.xml" \
  -F "dataset=@test_large_file.arff"

Client-Side Test

Try uploading a 1GB file first (below the 2GB SSL limit)
Monitor memory usage: htop or Task Manager
If successful, the client is streaming properly
For 2.7GB files, use compression or external hosting

Recommended Workflow for 2.7GB Dataset

Best approach combining all solutions:

Compress the dataset (reduces transfer time and bypasses SSL limit):
```
gzip -9 dataset.arff  # Maximum compression
```
Verify server config (already fixed in this repo):
- Restart Docker container to load new php.ini

Upload via direct HTTP streaming (bypass openml-python client):

import requests

api_key = "YOUR_API_KEY"
url = "https://openml.org/api/v1/data"

# Prepare XML description
xml_desc = """<?xml version="1.0" encoding="UTF-8"?>
<oml:data_set_description xmlns:oml="http://openml.org/openml">
  <oml:name>Dataset Name</oml:name>
  <oml:description>Description here</oml:description>
  <oml:format>arff</oml:format>
</oml:data_set_description>"""

# Stream upload
with open('dataset.arff.gz', 'rb') as f:
    response = requests.post(
        url,
        data={'api_key': api_key, 'description': xml_desc},
        files={'dataset': ('dataset.arff.gz', f)},
        timeout=3600  # 1 hour timeout for large uploads
    )

print(response.text)

Monitor upload progress (optional):

from tqdm import tqdm
import requests

# Wrapper for progress bar
class TqdmUploader:
    def __init__(self, filename):
        self.filename = filename
        self.size = os.path.getsize(filename)
        self.progress = tqdm(total=self.size, unit='B', unit_scale=True)
    
    def __enter__(self):
        self.f = open(self.filename, 'rb')
        return self
    
    def __exit__(self, *args):
        self.f.close()
        self.progress.close()
    
    def read(self, size=-1):
        chunk = self.f.read(size)
        self.progress.update(len(chunk))
        return chunk

with TqdmUploader('dataset.arff.gz') as uploader:
    response = requests.post(url, files={'dataset': uploader}, ...)

Additional Considerations

Web Server Configuration

If you're using nginx as a reverse proxy (not present in current setup), also add:

client_max_body_size 5G;
proxy_read_timeout 3600s;

Network Timeouts

For very large uploads over slow connections:

Client timeout: Set timeout=7200 in requests (2 hours)
Server timeout: Already set via max_execution_time = 3600
Load balancer timeout: Check cloud provider settings (AWS ALB, GCP LB, etc.)

Storage Space

Uploading 2.7GB datasets requires adequate disk space:

Temporary space: /tmp needs ~2.7GB during upload
Final storage: DATA_PATH needs ~2.7GB per dataset
Recommend: 50GB+ free space on server

Alternative: Split Dataset

If all else fails, consider splitting into multiple smaller datasets:

# Split dataset into chunks
import pandas as pd

df = pd.read_csv('dataset.csv')
chunk_size = 1_000_000  # 1M rows per chunk

for i, start in enumerate(range(0, len(df), chunk_size)):
    chunk = df[start:start + chunk_size]
    chunk.to_csv(f'dataset_part{i}.arff', index=False, header=True)
    # Upload each part separately

Summary

✅ Server-side limits fixed (this repo)
⚠️ Client-side requires:

File compression (easiest)
Streaming upload (most robust)
External hosting (most flexible)

For your 2.7GB file: Compress with gzip first, should reduce to <500MB for typical datasets.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Large Dataset Upload Fix

Problem Summary

Server-Side Fix (COMPLETED ✓)

Changes Made to `/docker/config/php.ini`:

Deployment Required

Client-Side Issue (Still Needs Addressing)

The OverflowError Explained

Solutions for Client-Side

Option 1: Stream the Upload (RECOMMENDED)

Option 2: Compress Before Upload

Option 3: Host Externally and Register by URL

Option 4: Patch openml-python

Testing Your Fix

Server-Side Test

Client-Side Test

Recommended Workflow for 2.7GB Dataset

Additional Considerations

Web Server Configuration

Network Timeouts

Storage Space

Alternative: Split Dataset

Summary

Uh oh!

FilesExpand file tree

LARGE_DATASET_UPLOAD_FIX.md

Latest commit

History

LARGE_DATASET_UPLOAD_FIX.md

File metadata and controls

Large Dataset Upload Fix

Problem Summary

Server-Side Fix (COMPLETED ✓)

Changes Made to /docker/config/php.ini:

Deployment Required

Client-Side Issue (Still Needs Addressing)

The OverflowError Explained

Solutions for Client-Side

Option 1: Stream the Upload (RECOMMENDED)

Option 2: Compress Before Upload

Option 3: Host Externally and Register by URL

Option 4: Patch openml-python

Testing Your Fix

Server-Side Test

Client-Side Test

Recommended Workflow for 2.7GB Dataset

Additional Considerations

Web Server Configuration

Network Timeouts

Storage Space

Alternative: Split Dataset

Summary

Changes Made to `/docker/config/php.ini`: