Skip to content
Merged
Show file tree
Hide file tree
Changes from 16 commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
3f73738
added deploy script with uploading to given rclone remote
gg46ixav Jul 3, 2025
9edc0dc
added webdav-url argument
gg46ixav Jul 4, 2025
a56f01d
added deploying to the databus without upload to nextcloud
gg46ixav Jul 25, 2025
5fdf78b
Merge branch 'download-capabilities' into nextcloudclient
gg46ixav Oct 21, 2025
800256c
updated pyproject.toml and content-hash
gg46ixav Oct 21, 2025
66f1c8e
Merge branch 'main' into nextcloudclient
gg46ixav Oct 28, 2025
4259229
Merge remote-tracking branch 'origin/main' into nextcloudclient
gg46ixav Oct 28, 2025
b179f90
updated README.md
gg46ixav Oct 28, 2025
a504b9d
Merge remote-tracking branch 'origin/nextcloudclient' into nextcloudc…
gg46ixav Oct 28, 2025
0ce0c24
added checksum validation
gg46ixav Oct 28, 2025
6596cbc
updated upload_to_nextcloud function to accept list of source_paths
gg46ixav Oct 28, 2025
b9f9854
only add result if upload successful
gg46ixav Oct 28, 2025
2f8493d
use os.path.basename instead of .split("/")[-1]
gg46ixav Oct 28, 2025
07359cc
added __init__.py and updated README.md
gg46ixav Oct 28, 2025
8047968
changed append to extend (no nested list)
gg46ixav Oct 28, 2025
0172450
fixed windows separators and added rclone error message
gg46ixav Oct 28, 2025
f957512
moved deploy.py to cli upload_and_deploy
gg46ixav Nov 3, 2025
607f527
changed metadata to dict list
gg46ixav Nov 3, 2025
6cb7e11
removed python-dotenv
gg46ixav Nov 3, 2025
7651c31
small updates
gg46ixav Nov 3, 2025
df17a7c
refactored upload_and_deploy function
gg46ixav Nov 3, 2025
7492531
updated README.md
gg46ixav Nov 3, 2025
c985603
updated metadata_string for new metadata format
gg46ixav Nov 3, 2025
62a3611
updated README.md
gg46ixav Nov 3, 2025
22ac02f
updated README.md
gg46ixav Nov 3, 2025
3faaf4d
Changed context url back
gg46ixav Nov 3, 2025
5dfebe5
added check for known compressions
gg46ixav Nov 3, 2025
f9367c0
updated checksum to sha256
gg46ixav Nov 3, 2025
5d474db
updated README.md
gg46ixav Nov 3, 2025
bef78ef
size check
gg46ixav Nov 3, 2025
529f2ae
updated checksum validation
gg46ixav Nov 3, 2025
77dca5a
added doc
gg46ixav Nov 3, 2025
02b1873
- refactored deploy, upload_and_deploy and deploy_with_metadata to on…
gg46ixav Nov 4, 2025
04c0b6e
updated README.md
gg46ixav Nov 4, 2025
fb93bc9
fixed docstring
gg46ixav Nov 4, 2025
8e6167b
removed metadata.json
gg46ixav Nov 4, 2025
943e30b
moved COMPRESSION_EXTS out of loop
gg46ixav Nov 4, 2025
1274cbc
removed unnecessary f-strings
gg46ixav Nov 4, 2025
02481b3
set file_format and compression to None
gg46ixav Nov 4, 2025
a5ec24d
get file_format and compression from metadata file
gg46ixav Nov 4, 2025
f95155f
updated README.md
gg46ixav Nov 4, 2025
274f252
chores
Integer-Ctrl Nov 5, 2025
f22c71d
updated metadata format (removed filename - used url instead)
gg46ixav Nov 5, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 38 additions & 0 deletions README.md
Comment thread
gg46ixav marked this conversation as resolved.
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,44 @@ docker run --rm -v $(pwd):/data dbpedia/databus-python-client download https://d
A docker image is available at [dbpedia/databus-python-client](https://hub.docker.com/r/dbpedia/databus-python-client). See [download section](#usage-of-docker-image) for details.


## Deploy to Databus
Comment thread
gg46ixav marked this conversation as resolved.
Outdated
Please add databus API_KEY to .env file
Use metadata.json file to list all files which should be added to the databus

The script registers all files on the databus.
### Example Call
```bash
python -m databusclient.deploy \
--no-upload \
--metadata ./metadata.json \
--version-id https://databus.org/user/dataset/version/1.0 \
--title "Test Dataset" \
--abstract "This is a short abstract of the test dataset." \
--description "This dataset was uploaded for testing the Nextcloud → Databus deployment pipeline." \
--license https://dalicc.net/licenselibrary/Apache-2.0

```

## Upload to Nextcloud and Deploy to Databus
Please add databus API_KEY to .env file

The script uploads all given files and all files in the given folders to the given remote.
Then registers them on the databus.
### Example Call
```bash
python -m databusclient.deploy \
Comment thread
gg46ixav marked this conversation as resolved.
Outdated
--webdav-url https://cloud.scadsai.uni-leipzig.de/remote.php/webdav \
--remote scads-nextcloud \
--path test \
--version-id https://databus.dbpedia.org/gg46ixav/test_group/test_artifact/2023-07-03 \
--title "Test Dataset" \
--abstract "This is a short abstract of the test dataset." \
--description "This dataset was uploaded for testing the Nextcloud → Databus deployment pipeline." \
--license https://dalicc.net/licenselibrary/Apache-2.0 \
/home/CSVTest/newtestoutputfolder \
/home/CSVTest/output.csv.bz2

```
## CLI Usage

**Installation**
Expand Down
2 changes: 1 addition & 1 deletion databusclient/client.py
Original file line number Diff line number Diff line change
Expand Up @@ -343,7 +343,7 @@ def append_to_dataset_graph_if_existent(add_key: str, add_value: str):
graphs.append(dataset_graph)

dataset = {
"@context": "https://downloads.dbpedia.org/databus/context.jsonld",
"@context": "https://databus.dbpedia.org/res/context.jsonld",
Comment thread
gg46ixav marked this conversation as resolved.
Outdated
Comment thread
gg46ixav marked this conversation as resolved.
Outdated
"@graph": graphs,
}
return dataset
Expand Down
110 changes: 110 additions & 0 deletions databusclient/deploy.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
import os
import sys
import argparse
import json

from databusclient import create_distribution, create_dataset, deploy
from dotenv import load_dotenv

from nextcloudclient.upload import upload_to_nextcloud

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Import path breaks when running from databusclient/; add fallback.

README shows cd databusclient && python deploy.py, which won’t find top‑level package nextcloudclient. Add a fallback import from databusclient.upload (or raise a clear error).

-from nextcloudclient.upload import upload_to_nextcloud
+try:
+    from nextcloudclient.upload import upload_to_nextcloud
+except ModuleNotFoundError:
+    try:
+        from databusclient.upload import upload_to_nextcloud
+    except ModuleNotFoundError as e:
+        raise ModuleNotFoundError(
+            "upload_to_nextcloud not found. Install/provide 'nextcloudclient' "
+            "or place 'upload.py' under 'databusclient' (importable as databusclient.upload)."
+        ) from e
🤖 Prompt for AI Agents
In databusclient/deploy.py around lines 9 to 10, the top-level import from
nextcloudclient.upload will fail when running python deploy.py from the
databusclient/ directory; add a fallback import that first tries from
nextcloudclient.upload import upload_to_nextcloud and if ImportError falls back
to from databusclient.upload import upload_to_nextcloud (or raise a clear
ImportError with guidance), so the module can be executed both as a package and
as a script.

def deploy_to_databus(
metadata,
version_id,
title,
abstract,
description,
license_url
):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Clarify expected metadata structure.

The metadata parameter is used at line 26 as for filename, checksum, size, url in metadata:, which expects a flat iterable of 4-tuples. However, upload_to_nextcloud (called at line 101) returns List[List[Tuple]] (a nested structure where each source path produces a list of tuples). This type mismatch will cause a runtime ValueError: too many values to unpack.

Apply this diff to flatten the metadata when it comes from upload_to_nextcloud:

In the main block at line 101, flatten the result:

-        metadata = upload_to_nextcloud(args.files, args.remote, args.path, args.webdav_url)
+        metadata_nested = upload_to_nextcloud(args.files, args.remote, args.path, args.webdav_url)
+        # Flatten the nested list: upload_to_nextcloud returns List[List[Tuple]]
+        metadata = [item for sublist in metadata_nested for item in sublist]

Alternatively, update the function signature and add type hints to make the expected structure explicit:

def deploy_to_databus(
    metadata: list[tuple[str, str, int, str]],  # list of (filename, checksum, size, url)
    version_id: str,
    ...
):

load_dotenv()
api_key = os.getenv("API_KEY")
if not api_key:
raise ValueError("API_KEY not found in .env")

distributions = []
counter = 0
for filename, checksum, size, url in metadata:
# Expect a SHA-256 hex digest (64 chars). Reject others.
if not isinstance(checksum, str) or len(checksum) != 64:
raise ValueError(f"Invalid checksum for {filename}: expected SHA-256 hex (64 chars), got '{checksum}'")
parts = filename.split(".")
if len(parts) == 1:
file_format = "none"
compression = "none"
elif len(parts) == 2:
file_format = parts[-1]
compression = "none"
else:
file_format = parts[-2]
compression = parts[-1]

distributions.append(
create_distribution(
url=url,
cvs={"count": f"{counter}"},
file_format=file_format,
compression=compression,
sha256_length_tuple=(checksum, size)
)
)
counter += 1

dataset = create_dataset(
version_id=version_id,
title=title,
abstract=abstract,
description=description,
license_url=license_url,
distributions=distributions
)

deploy(dataset, api_key)
metadata_string = ",\n".join([entry[-1] for entry in metadata])

print(f"Successfully deployed\n{metadata_string}\nto databus {version_id}")


def parse_args():
parser = argparse.ArgumentParser(description="Upload files to Nextcloud and deploy to DBpedia Databus.")
parser.add_argument("files", nargs="*", help="Path(s) to file(s) or folder(s) to upload")
parser.add_argument("--webdav-url", help="WebDAV URL (e.g., https://cloud.example.com/remote.php/webdav)")
parser.add_argument("--remote", help="rclone remote name (e.g., 'nextcloud')")
parser.add_argument("--path", help="Remote path on Nextcloud (e.g., 'datasets/mydataset')")
parser.add_argument("--no-upload", action="store_true", help="Skip file upload and use existing metadata")
parser.add_argument("--metadata", help="Path to metadata JSON file (required if --no-upload is used)")

parser.add_argument("--version-id", required=True, help="Databus version URI")
parser.add_argument("--title", required=True, help="Title of the dataset")
parser.add_argument("--abstract", required=True, help="Short abstract of the dataset")
parser.add_argument("--description", required=True, help="Detailed description of the dataset")
parser.add_argument("--license", required=True, help="License URL (e.g., https://dalicc.net/licenselibrary/Apache-2.0)")

return parser.parse_args()


if __name__ == '__main__':
args = parse_args()

if args.no_upload:
if not args.metadata:
print("Error: --metadata is required when using --no-upload")
sys.exit(1)
if not os.path.isfile(args.metadata):
print(f"Error: Metadata file not found: {args.metadata}")
sys.exit(1)
with open(args.metadata, 'r') as f:
metadata = json.load(f)
else:
if not (args.webdav_url and args.remote and args.path):
print("Error: --webdav-url, --remote, and --path are required unless --no-upload is used")
sys.exit(1)
metadata = upload_to_nextcloud(args.files, args.remote, args.path, args.webdav_url)

deploy_to_databus(
metadata,
version_id=args.version_id,
title=args.title,
abstract=args.abstract,
description=args.description,
license_url=args.license
)
14 changes: 14 additions & 0 deletions databusclient/metadata.json
Comment thread
gg46ixav marked this conversation as resolved.
Outdated
Comment thread
gg46ixav marked this conversation as resolved.
Outdated
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
[
[
"example.ttl",
"6e340b9cffb37a989ca544e6bb780a2c7e5d7dcb",
12345,
"https://cloud.example.com/remote.php/webdav/datasets/mydataset/example.ttl"
],
[
"example.csv.gz",
"3f786850e387550fdab836ed7e6dc881de23001b",
54321,
"https://cloud.example.com/remote.php/webdav/datasets/mydataset/example.csv.gz"
]
]
Empty file added nextcloudclient/__init__.py
Empty file.
77 changes: 77 additions & 0 deletions nextcloudclient/upload.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
import hashlib
import os
import subprocess
import posixpath
from urllib.parse import urljoin, quote


def compute_sha256_and_length(filepath):
sha256 = hashlib.sha256()
total_length = 0
with open(filepath, 'rb') as f:
while True:
chunk = f.read(4096)
if not chunk:
break
sha256.update(chunk)
total_length += len(chunk)
return sha256.hexdigest(), total_length

def get_all_files(path):
if os.path.isfile(path):
return [path]
files = []
for root, _, filenames in os.walk(path):
for name in filenames:
files.append(os.path.join(root, name))
return files

def upload_to_nextcloud(source_paths: list[str], remote_name: str, remote_path: str, webdav_url: str):
result = []
for path in source_paths:
if not os.path.exists(path):
print(f"Path not found: {path}")
continue

abs_path = os.path.abspath(path)
basename = os.path.basename(abs_path)
files = get_all_files(abs_path)

tmp_results = []

for file in files:
checksum,size = compute_sha256_and_length(file)

if os.path.isdir(path):
rel_file = os.path.relpath(file, abs_path)
# Normalize to POSIX for WebDAV/URLs
rel_file = rel_file.replace(os.sep, "/")
remote_webdav_path = posixpath.join(remote_path, basename, rel_file)
else:
remote_webdav_path = posixpath.join(remote_path, os.path.basename(file))

# Preserve scheme/host and percent-encode path segments
url = urljoin(webdav_url.rstrip("/") + "/", quote(remote_webdav_path.lstrip("/"), safe="/"))

filename = os.path.basename(file)
tmp_results.append((filename, checksum, size, url))

Comment thread
gg46ixav marked this conversation as resolved.
dest_subpath = posixpath.join(remote_path.lstrip("/"), basename)
if os.path.isdir(path):
destination = f"{remote_name}:{dest_subpath}"
command = ["rclone", "copy", abs_path, destination, "--progress"]
else:
destination = f"{remote_name}:{dest_subpath}"
command = ["rclone", "copyto", abs_path, destination, "--progress"]

Comment thread
gg46ixav marked this conversation as resolved.
print(f"Upload: {path} → {destination}")
try:
subprocess.run(command, check=True)
result.extend(tmp_results)
print("✅ Uploaded successfully.\n")
except subprocess.CalledProcessError as e:
print(f"❌ Error uploading {path}: {e}\n")
except FileNotFoundError:
print("❌ rclone not found on PATH. Install rclone and retry.")

return result
Comment thread
coderabbitai[bot] marked this conversation as resolved.
21 changes: 18 additions & 3 deletions poetry.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ click = "^8.0.4"
requests = "^2.28.1"
tqdm = "^4.42.1"
SPARQLWrapper = "^2.0.0"
python-dotenv = "^1.1.1"
rdflib = "^7.2.1"

[tool.poetry.group.dev.dependencies]
Expand Down