Skip to content

Commit 0c99ced

Browse files
authored
Merge branch 'main' into panc_pdo
2 parents b5b1743 + bd584cc commit 0c99ced

45 files changed

Lines changed: 1174 additions & 633 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/workflows/main.yml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@ on:
44
push:
55
tags:
66
- '*' # Triggers the workflow only on version tags
7+
workflow_dispatch: # Allows manual triggering of the workflow
78

89
# Sets permissions of the GITHUB_TOKEN to allow deployment to GitHub Pages
910
permissions:
@@ -44,4 +45,4 @@ jobs:
4445
steps:
4546
- name: Deploy to GitHub Pages
4647
id: deployment
47-
uses: actions/deploy-pages@v4
48+
uses: actions/deploy-pages@v4

build/README.md

Lines changed: 12 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -10,11 +10,10 @@ are added.
1010

1111
## build_all.py script
1212

13-
This script initializes all docker containers, builds all datasets, validates them, and uploads them to figshare and pypi.
13+
This script initializes all docker containers, builds all datasets, validates them, and uploads them to figshare.
1414

1515
It requires the following authorization tokens to be set in the local environment depending on the use case:
1616
`SYNAPSE_AUTH_TOKEN`: Required for beataml and mpnst datasets. Join the [CoderData team](https://www.synapse.org/#!Team:3503472) on Synapse and generate an access token.
17-
`PYPI_TOKEN`: This token is required to upload to PyPI.
1817
`FIGSHARE_TOKEN`: This token is required to upload to Figshare.
1918
`GITHUB_TOKEN`: This token is required to upload to GitHub.
2019

@@ -25,21 +24,20 @@ It requires the following authorization tokens to be set in the local environmen
2524
- `--omics`: Processes and builds the omics data files.
2625
- `--drugs`: Processes and builds the drug data files.
2726
- `--exp`: Processes and builds the experiment data files.
28-
- `--all`: Executes all available processes above (docker, samples, omics, drugs, exp). This does not run the validate, figshare, or pypi commands.
27+
- `--all`: Executes all available processes above (docker, samples, omics, drugs, exp). This does not run the validate or figshare commands.
2928
- `--validate`: Validates the generated datasets using the schema check scripts. This is automatically included if data upload occurs.
3029
- `--figshare`: Uploads the datasets to Figshare. FIGSHARE_TOKEN must be set in local environment.
31-
- `--pypi`: Uploads the package to PyPI. PYPI_TOKEN must be set in local environment.
3230
- `--high_mem`: Utilizes high memory mode for concurrent data processing. This has been successfully tested using 32 or more vCPUs.
3331
- `--dataset`: Specifies the datasets to process (default='broad_sanger,hcmi,beataml,mpnst,cptac').
34-
- `--version`: Specifies the version number for the PyPI package and Figshare upload title (e.g., "0.1.29"). This is required for figshare and PyPI upload steps. This must be a higher version than previously published versions.
32+
- `--version`: Specifies the version number for the Figshare upload title (e.g., "0.1.29"). This must be a higher version than previously published versions.
3533
- `--github-username`: GitHub username matching the GITHUB_TOKEN. Required to push the new Tag to the GitHub Repository.
3634
- `--github-email`: GitHub email matching the GITHUB_TOKEN. Required to push the new Tag to the GitHub Repository.
3735

3836
**Example usage**:
39-
- Build all datasets and upload to Figshare and PyPI and GitHub.
40-
Required tokens for the following command: `SYNAPSE_AUTH_TOKEN`, `PYPI_TOKEN`, `FIGSHARE_TOKEN`, `GITHUB_TOKEN`.
37+
- Build all datasets and upload to Figshare and GitHub.
38+
Required tokens for the following command: `SYNAPSE_AUTH_TOKEN`, `FIGSHARE_TOKEN`, `GITHUB_TOKEN`.
4139
```bash
42-
python build/build_all.py --all --high_mem --validate --pypi --figshare --version 0.1.41 --github-username jjacobson95 --github-email jeremy.jacobson3402@gmail.com
40+
python build/build_all.py --all --high_mem --validate --figshare --version 0.1.41 --github-username jjacobson95 --github-email jeremy.jacobson3402@gmail.com
4341
```
4442

4543
- Build only the experiment files.
@@ -56,21 +54,22 @@ It requires the following authorization tokens to be set in the local environmen
5654
`SYNAPSE_AUTH_TOKEN`: Required for beataml and mpnst datasets. Follow the directions above to use gain access.
5755

5856
Available arguments:
59-
- `--dataset`: Required. Name of the dataset to build.
57+
- `--dataset`: Required. Name of the dataset to build. At a minimum, this will build the docker images.
6058
- `--use_prev_dataset`: Optional. Prefix of the previous dataset for sample and drug ID continuation. The previous dataset files must be in the "local" directory.
61-
- `--validate`: Optional. Runs the schema checker on the built files.
59+
- `--build`: Optional. Build the desired Dataset.
60+
- `--validate`: Optional. Run the schema checker on the built files.
6261
- `--continue`: Optional. Continues from where the build left off by skipping existing files in "local" directory.
6362
Example usage:
6463

6564
Build the broad_sanger dataset:
6665
```bash
67-
python build/build_dataset.py --dataset broad_sanger
66+
python build/build_dataset.py --build --dataset broad_sanger
6867
```
6968
Build the mpnst dataset continuing from broad_sanger sample and drug IDs:
7069
```bash
71-
python build/build_dataset.py --dataset mpnst --use_prev_dataset broad_sanger
70+
python build/build_dataset.py --build --dataset mpnst --use_prev_dataset broad_sanger
7271
```
73-
Build the hcmi dataset and run validation:
72+
Build run schema validation on hcmi dataset:
7473
```bash
7574
python build/build_dataset.py --dataset hcmi --validate
7675
```

build/beatAML/GetBeatAML.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -466,8 +466,14 @@ def map_and_combine(df, data_type, entrez_map_file, improve_map_file, map_file=N
466466
right_on='other_id',
467467
how='left')
468468
mapped_df.insert(0, 'improve_sample_id', mapped_df.pop('improve_sample_id'))
469+
470+
# Replace NaNs, round values, and convert to integers for specified columns
471+
columns_to_convert = ['improve_sample_id', 'entrez_id']
472+
mapped_df[columns_to_convert] = mapped_df[columns_to_convert].fillna(0).round().astype('int32')
473+
469474
mapped_df['source'] = 'synapse'
470475
mapped_df['study'] = 'BeatAML'
476+
mapped_df =mapped_df.drop_duplicates()
471477

472478
final_dataframe = mapped_df.dropna()
473479
return final_dataframe

build/broad_sanger/03a-nci60Drugs.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -70,7 +70,7 @@ def main():
7070
smiles= pl.DataFrame({'NSC':smiles['NSC'],'upper':upper})#smiles.with_columns(upper=upper)
7171
##reduce to smiels only in current drugs
7272
# ssmiles = smiles.filter(~pl.col('upper').is_in(curdrugs['isoSMILES']))
73-
ssmiles = ssmiles.filter(~pl.col('upper').is_in(curdrugs['canSMILES']))
73+
ssmiles = smiles.filter(~pl.col('upper').is_in(curdrugs['canSMILES']))
7474
pubchems = pubchems.filter(pl.col('NSC').is_in(ssmiles['NSC']))
7575
arr = set(pubchems['CID'])
7676

build/broad_sanger/04b-nci60-updated.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -107,10 +107,11 @@ def main():
107107

108108
finaldf = pl.DataFrame(
109109
{
110-
'source':['NCI60' for a in molar['improve_drug_id']], ##2024 build
110+
'source':['NCI60_24' for a in molar['improve_drug_id']], ##2024 build
111111
'improve_sample_id':molar['improve_sample_id'],
112112
'Drug':molar['improve_drug_id'],
113-
'study': molar['EXPID'],#['NCI60' for a in nonulls['improve_drug_id']],
113+
# 'study': molar['EXPID'],#['NCI60' for a in nonulls['improve_drug_id']],
114+
'study': "NCI60",
114115
'time':molar['time'],
115116
'time_unit':molar['time_unit'],
116117
'DOSE': [(10**a)*1000000 for a in molar['CONCENTRATION']], ##move from molar to uM to match pharmacoDB

build/build_all.py

Lines changed: 16 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -10,39 +10,37 @@
1010
import shutil
1111
import gzip
1212
from glob import glob
13-
from packaging import version
1413
import sys
1514

1615
def main():
1716
parser=argparse.ArgumentParser(
18-
description="This script initializes all docker containers, builds datasets, validates them, and uploads to Figshare and PyPI.",
17+
description="This script initializes all docker containers, builds datasets, validates them, and uploads to Figshare.",
1918
epilog="""Examples of usage:
2019
21-
Build all datasets in a high memory environment, validate them, and upload to Figshare and PyPI:
22-
python build/build_all.py --all --high_mem --validate --pypi --figshare --version 0.1.29
20+
Build all datasets in a high memory environment, validate them, and upload to Figshare:
21+
python build/build_all.py --all --high_mem --validate --figshare --version 0.1.29
2322
2423
Build only experiment files. This assumes preceding steps (docker images, samples, omics, and drugs) have already been completed:
2524
python build/build_all.py --exp
2625
2726
Validate all local files without building or uploading. These files must be located in ./local. Includes compression/decompression steps.
2827
python build/build_all.py --validate
2928
30-
Upload the latest data to Figshare and PyPI (ensure tokens are set in the local environment):
31-
python build/build_all.py --figshare --pypi --version 0.1.30
29+
Upload the latest data to Figshare (ensure tokens are set in the local environment):
30+
python build/build_all.py --figshare --version 0.1.30
3231
"""
3332
)
3433
parser.add_argument('--docker',dest='docker',default=False,action='store_true', help="Build all docker images.")
3534
parser.add_argument('--samples',dest='samples',default=False,action='store_true', help="Build all sample files.")
3635
parser.add_argument('--omics',dest='omics',default=False,action='store_true', help="Build all omics files.")
3736
parser.add_argument('--drugs',dest='drugs',default=False,action='store_true', help="Build all drug files")
3837
parser.add_argument('--exp',dest='exp',default=False,action='store_true', help="Build all experiment file.")
39-
parser.add_argument('--validate', action='store_true', help="Run schema checker on all local files. Note this will be run, whether specified or not, if figshare or pypi arguments are included.")
38+
parser.add_argument('--validate', action='store_true', help="Run schema checker on all local files. Note this will be run, whether specified or not, if figshare arguments are included.")
4039
parser.add_argument('--figshare', action='store_true', help="Upload all local data to Figshare. FIGSHARE_TOKEN must be set in local environment.")
41-
parser.add_argument('--pypi', action='store_true', help="Update PYPI Package with latest Figshare data. PYPI_TOKEN must be set in local environment.")
42-
parser.add_argument('--all',dest='all',default=False,action='store_true', help="Run all data build commands. This includes docker, samples, omics, drugs, exp arguments. This does not run the validate, figshare, or pypi commands.")
40+
parser.add_argument('--all',dest='all',default=False,action='store_true', help="Run all data build commands. This includes docker, samples, omics, drugs, exp arguments. This does not run the validate or figshare commands")
4341
parser.add_argument('--high_mem',dest='high_mem',default=False,action='store_true',help = "If you have 32 or more CPUs, this option is recommended. It will run many code portions in parallel. If you don't have enough memory, this will cause a run failure.")
44-
parser.add_argument('--dataset',dest='datasets',default='broad_sanger,hcmi,beataml,mpnst,cptac',help='Datasets to process. Defaults to all available.')
45-
parser.add_argument('--version', type=str, required=False, help='Version number for the PyPI package and Figshare upload title (e.g., "0.1.29"). This is required for Figshare and PyPI upload. This must be a higher version than previously published versions.')
42+
parser.add_argument('--dataset',dest='datasets',default='broad_sanger,hcmi,beataml,cptac,mpnst,mpnstpdx',help='Datasets to process. Defaults to all available.')
43+
parser.add_argument('--version', type=str, required=False, help='Version number for the Figshare upload title (e.g., "0.1.29"). This is required for Figshare upload. This must be a higher version than previously published versions.')
4644
parser.add_argument('--github-username', type=str, required=False, help='GitHub username for the repository.')
4745
parser.add_argument('--github-email', type=str, required=False, help='GitHub email for the repository.')
4846

@@ -120,6 +118,7 @@ def process_docker(datasets):
120118
'hcmi': ['hcmi'],
121119
'beataml': ['beataml'],
122120
'mpnst': ['mpnst'],
121+
'mpnstpdx': ['mpnstpdx'],
123122
'cptac': ['cptac'],
124123
'genes': ['genes'],
125124
'upload': ['upload']
@@ -266,8 +265,6 @@ def run_docker_upload_cmd(cmd_arr, all_files_dir, name, version):
266265
docker_run = ['docker', 'run', '--rm', '-v', f"{env['PWD']}/local/{all_files_dir}:/tmp", '-e', f"VERSION={version}"]
267266

268267
# Add Appropriate Environment Variables
269-
if 'PYPI_TOKEN' in env and name == 'PyPI':
270-
docker_run.extend(['-e', f"PYPI_TOKEN={env['PYPI_TOKEN']}", 'upload'])
271268
if 'FIGSHARE_TOKEN' in env and name == 'Figshare':
272269
docker_run.extend(['-e', f"FIGSHARE_TOKEN={env['FIGSHARE_TOKEN']}", 'upload'])
273270
if name == "validate":
@@ -308,16 +305,13 @@ def compress_file(file_path):
308305
#####
309306

310307
figshare_token = os.getenv('FIGSHARE_TOKEN')
311-
pypi_token = os.getenv('PYPI_TOKEN')
312308
synapse_auth_token = os.getenv('SYNAPSE_AUTH_TOKEN')
313309
github_token = os.getenv('GITHUB_TOKEN')
314310

315311

316312
# Error handling for required tokens
317313
if args.figshare and not figshare_token:
318314
raise ValueError("FIGSHARE_TOKEN environment variable is not set.")
319-
if args.pypi and not pypi_token:
320-
raise ValueError("PYPI_TOKEN environment variable is not set.")
321315
if ('beataml' in args.datasets or 'mpnst' in args.datasets) and not synapse_auth_token:
322316
if args.docker or args.samples or args.omics or args.drugs or args.exp or args.all: # Token only required if building data, not upload or validate.
323317
raise ValueError("SYNAPSE_AUTH_TOKEN is required for accessing MPNST and beatAML datasets.")
@@ -394,7 +388,7 @@ def compress_file(file_path):
394388
### Begin Upload and/or validation
395389
#####
396390

397-
if args.pypi or args.figshare or args.validate:
391+
if args.figshare or args.validate:
398392
# FigShare File Prefixes:
399393
prefixes = ['beataml', 'hcmi', 'cptac', 'mpnst', 'genes', 'drugs']
400394
broad_sanger_datasets = ["ccle","ctrpv2","fimm","gdscv1","gdscv2","gcsi","prism","nci60"]
@@ -405,23 +399,18 @@ def compress_file(file_path):
405399

406400

407401
figshare_token = os.getenv('FIGSHARE_TOKEN')
408-
pypi_token = os.getenv('PYPI_TOKEN')
409402

410403
all_files_dir = 'local/all_files_dir'
411404
if not os.path.exists(all_files_dir):
412405
os.makedirs(all_files_dir)
413-
414-
# Ensure pypi tokens are available
415-
if args.pypi and not pypi_token:
416-
raise ValueError("Required tokens (PYPI) are not set in environment variables.")
417406

418407
# Ensure figshare tokens are available
419408
if args.figshare and not figshare_token:
420409
raise ValueError("Required tokens (FIGSHARE) are not set in environment variables.")
421410

422411
# Ensure version is specified
423-
if (args.figshare or args.pypi) and not args.version:
424-
raise ValueError("Version must be specified when pushing to pypi or figshare")
412+
if args.figshare and not args.version:
413+
raise ValueError("Version must be specified when pushing to figshare")
425414

426415
# Move relevant files to a designated directory
427416
for file in glob(os.path.join("local", '*.*')):
@@ -433,7 +422,7 @@ def compress_file(file_path):
433422
decompress_file(file)
434423

435424
# Run schema checker - This will always run if uploading data.
436-
schema_check_command = ['python3', 'check_schema.py', '--datasets'] + datasets
425+
schema_check_command = ['python3', 'scripts/check_schema.py', '--datasets'] + datasets
437426
run_docker_upload_cmd(schema_check_command, 'all_files_dir', 'validate', args.version)
438427

439428
print("Validation complete. Proceeding with file compression/decompression adjustments")
@@ -453,13 +442,9 @@ def compress_file(file_path):
453442
figshare_command = ['python3', 'scripts/push_to_figshare.py', '--directory', "/tmp", '--title', f"CODERData{args.version}", '--token', os.getenv('FIGSHARE_TOKEN'), '--project_id', '189342', '--publish']
454443
run_docker_upload_cmd(figshare_command, 'all_files_dir', 'Figshare', args.version)
455444

456-
# Upload to PyPI using Docker
457-
if args.pypi and args.version and pypi_token:
458-
pypi_command = ['python3', 'scripts/push_to_pypi.py', '-y', '/tmp/figshare_latest.yml', '-d', 'coderdata/download/downloader.py', "-v", args.version]
459-
run_docker_upload_cmd(pypi_command, 'all_files_dir', 'PyPI', args.version)
460445

461446
# Push changes to GitHub using Docker
462-
if args.version and args.figshare and args.pypi and pypi_token and figshare_token and github_token and args.github_username and args.github_email:
447+
if args.version and args.figshare and figshare_token and github_token and args.github_username and args.github_email:
463448
git_command = [
464449
'bash', '-c', (
465450
f'git config --global user.name "{args.github_username}" '
@@ -476,4 +461,4 @@ def compress_file(file_path):
476461

477462

478463
if __name__ == '__main__':
479-
main()
464+
main()

build/build_dataset.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -220,7 +220,7 @@ def run_docker_validate_cmd(cmd_arr, all_files_dir, name):
220220
Wrapper for 'docker run' command used during validation and uploads.
221221
'''
222222
env = os.environ.copy()
223-
docker_run = ['docker', 'run', '-v', f"{env['PWD']}/local/{all_files_dir}:/tmp"]
223+
docker_run = ['docker', 'run', '-v', f"{env['PWD']}/local/{all_files_dir}:/tmp", '--platform=linux/amd64']
224224
docker_run.extend(['upload'])
225225
docker_run.extend(cmd_arr)
226226
print('Executing:', ' '.join(docker_run))

build/cptac/getCptacData.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -380,11 +380,15 @@ def main():
380380
dat_files[dtype_key] = fdf2
381381
else:
382382
dat_files[dtype_key] = fdf.dropna()
383+
383384
print(dtype_key)
384385

385386
# Now concatenate all the cancers into a single file
386387
for dtype_key, df in dat_files.items():
387388
print('Saving ' + "cptac_" + dtype_key + '.csv.gz' + ' file')
389+
print(df.to_string())
390+
df['entrez_id'] = df['entrez_id'].fillna(0)
391+
df['entrez_id'] = df['entrez_id'].astype(int)
388392
df.to_csv("/tmp/" + "cptac_" + dtype_key + '.csv.gz', sep=',', index=False, compression='gzip')
389393

390394
if __name__ == '__main__':

build/docker/Dockerfile.upload

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2,10 +2,10 @@ FROM python:3.9
22

33
WORKDIR /usr/src/app
44

5-
RUN python -m pip install --upgrade pip setuptools wheel twine packaging pyyaml requests linkml
5+
RUN python -m pip install --upgrade pip pyyaml requests linkml
66

77
RUN apt-get update && apt-get install -y git
88

99

10-
COPY ./schema /usr/src/app/schema
11-
ADD scripts/check_schema.py ./
10+
RUN git clone https://github.com/PNNL-CompBio/coderdata.git
11+
WORKDIR /usr/src/app/coderdata

0 commit comments

Comments
 (0)