Skip to content

Commit 0d74576

Browse files
committed
Merge branch 'nci-fix' of github.com:PNNL-CompBio/coderdata into nci-fix
2 parents 55e4b39 + a9232a4 commit 0d74576

24 files changed

Lines changed: 1156 additions & 583 deletions

build/README.md

Lines changed: 57 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -12,30 +12,75 @@ are added.
1212

1313
This script initializes all docker containers, builds all datasets, validates them, and uploads them to figshare and pypi.
1414

15-
It requires the following authorization tokens to be set in the local environment depending on the use case:
16-
`SYNAPSE_AUTH_TOKEN`: Required for beataml and mpnst datasets. Join the [CoderData team](https://www.synapse.org/#!Team:3503472) on Synapse and generate an access token.
17-
`PYPI_TOKEN`: This token is required to upload to PyPI.
18-
`FIGSHARE_TOKEN`: This token is required to upload to Figshare.
15+
It requires the following authorization tokens to be set in the local environment depending on the use case:
16+
`SYNAPSE_AUTH_TOKEN`: Required for beataml and mpnst datasets. Join the [CoderData team](https://www.synapse.org/#!Team:3503472) on Synapse and generate an access token.
17+
`PYPI_TOKEN`: This token is required to upload to PyPI.
18+
`FIGSHARE_TOKEN`: This token is required to upload to Figshare.
19+
`GITHUB_TOKEN`: This token is required to upload to GitHub.
1920

20-
Available arguments:
21+
**Available arguments**:
2122

2223
- `--docker`: Initializes and builds all docker containers.
2324
- `--samples`: Processes and builds the sample data files.
2425
- `--omics`: Processes and builds the omics data files.
2526
- `--drugs`: Processes and builds the drug data files.
2627
- `--exp`: Processes and builds the experiment data files.
27-
- `--all`: Executes all available processes above (docker, samples, omics, drugs, exp).
28-
- `--validate`: Validates the generated datasets using the schema check scripts.
29-
- `--figshare`: Uploads the datasets to Figshare.
30-
- `--pypi`: Uploads the package to PyPI.
31-
- `--high_mem`: Utilizes high memory mode for concurrent data processing.
28+
- `--all`: Executes all available processes above (docker, samples, omics, drugs, exp). This does not run the validate, figshare, or pypi commands.
29+
- `--validate`: Validates the generated datasets using the schema check scripts. This is automatically included if data upload occurs.
30+
- `--figshare`: Uploads the datasets to Figshare. FIGSHARE_TOKEN must be set in local environment.
31+
- `--pypi`: Uploads the package to PyPI. PYPI_TOKEN must be set in local environment.
32+
- `--high_mem`: Utilizes high memory mode for concurrent data processing. This has been successfully tested using 32 or more vCPUs.
3233
- `--dataset`: Specifies the datasets to process (default='broad_sanger,hcmi,beataml,mpnst,cptac').
33-
- `--version`: Specifies the version number for the package and data upload title. This is required to upload to figshare and PyPI
34+
- `--version`: Specifies the version number for the PyPI package and Figshare upload title (e.g., "0.1.29"). This is required for figshare and PyPI upload steps. This must be a higher version than previously published versions.
35+
- `--github-username`: GitHub username matching the GITHUB_TOKEN. Required to push the new Tag to the GitHub Repository.
36+
- `--github-email`: GitHub email matching the GITHUB_TOKEN. Required to push the new Tag to the GitHub Repository.
37+
38+
**Example usage**:
39+
- Build all datasets and upload to Figshare and PyPI and GitHub.
40+
Required tokens for the following command: `SYNAPSE_AUTH_TOKEN`, `PYPI_TOKEN`, `FIGSHARE_TOKEN`, `GITHUB_TOKEN`.
41+
```bash
42+
python build/build_all.py --all --high_mem --validate --pypi --figshare --version 0.1.41 --github-username jjacobson95 --github-email jeremy.jacobson3402@gmail.com
43+
```
44+
45+
- Build only the experiment files.
46+
**Note**: Preceding steps will not automatically be run. This assumes that docker images, samples, omics, and drugs were all previously built. Ensure all required tokens are set.
47+
```bash
48+
python build/build_all.py --exp
49+
```
3450

51+
## build_dataset.py script
52+
This script builds a single dataset for **debugging purposes only**. It can help determine if a dataset will build correctly in isolation. Note that the sample and drug identifiers generated may not align with those from other datasets, so this script is not suitable for building production datasets.
53+
54+
It requires the following authorization tokens to be set in the local environment depending on the dataset:
55+
56+
`SYNAPSE_AUTH_TOKEN`: Required for beataml and mpnst datasets. Follow the directions above to use gain access.
57+
58+
Available arguments:
59+
- `--dataset`: Required. Name of the dataset to build.
60+
- `--use_prev_dataset`: Optional. Prefix of the previous dataset for sample and drug ID continuation. The previous dataset files must be in the "local" directory.
61+
- `--validate`: Optional. Runs the schema checker on the built files.
62+
- `--continue`: Optional. Continues from where the build left off by skipping existing files in "local" directory.
3563
Example usage:
64+
65+
Build the broad_sanger dataset:
3666
```bash
37-
python build/build_all.py --all --high_mem --validate --pypi --figshare --version 0.1.29
67+
python build/build_dataset.py --dataset broad_sanger
3868
```
69+
Build the mpnst dataset continuing from broad_sanger sample and drug IDs:
70+
```bash
71+
python build/build_dataset.py --dataset mpnst --use_prev_dataset broad_sanger
72+
```
73+
Build the hcmi dataset and run validation:
74+
```bash
75+
python build/build_dataset.py --dataset hcmi --validate
76+
```
77+
Build the broad_sanger dataset but skip previously built files in "local" directory:
78+
```bash
79+
python build/build_dataset.py --dataset broad_sanger --continue
80+
```
81+
82+
83+
3984

4085
## Data Source Reference List
4186

@@ -66,4 +111,3 @@ python build/build_all.py --all --high_mem --validate --pypi --figshare --versio
66111
| BeatAML | NCI Proteomic Data Commons | Mapping the proteogenomic landscape enables prediction of drug response in acute myeloid leukemia | James Pino et al. | 23
67112
| MPNST | NF Data Portal | Chromosome 8 gain is associated with high-grade transformation in MPNST | David P Nusinow et al. | 24
68113

69-

build/beatAML/GetBeatAML.py

Lines changed: 31 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -134,8 +134,11 @@ def generate_samples_file(prev_samples_path):
134134
prot_samples["other_id_source"] = "beatAML"
135135

136136
all_samples = pd.concat([prot_samples, full_samples])
137-
all_samples['species'] = 'Homo sapiens'
138-
maxval = max(pd.read_csv(prev_samples_path).improve_sample_id)
137+
all_samples['species'] = 'Homo sapiens (Human)'
138+
if prev_samples_path == "":
139+
maxval = 0
140+
else:
141+
maxval = max(pd.read_csv(prev_samples_path).improve_sample_id)
139142
mapping = {labId: i for i, labId in enumerate(all_samples['other_id'].unique(), start=(int(maxval)+1))}
140143
all_samples['improve_sample_id'] = all_samples['other_id'].map(mapping)
141144
all_samples.insert(1, 'improve_sample_id', all_samples.pop('improve_sample_id'))
@@ -282,8 +285,14 @@ def format_drug_map(drug_map_path):
282285
pd.DataFrame
283286
Formatted and cleaned drug mapping dataframe.
284287
"""
285-
drug_map = pd.read_csv(drug_map_path, sep = "\t")
286-
drug_map = drug_map.drop_duplicates(subset='isoSMILES', keep='first')
288+
if drug_map_path:
289+
drug_map = pd.read_csv(drug_map_path, sep = "\t")
290+
drug_map = drug_map.drop_duplicates(subset='isoSMILES', keep='first')
291+
else:
292+
drug_map = pd.DataFrame(columns=[
293+
'improve_drug_id', 'chem_name', 'pubchem_id', 'canSMILES',
294+
'isoSMILES', 'InChIKey', 'formula', 'weight'
295+
])
287296
return drug_map
288297

289298
#Drug Response
@@ -326,7 +335,12 @@ def add_improve_id(previous_df, new_df):
326335
pd.DataFrame
327336
New dataframe with 'improve_drug_id' added.
328337
"""
329-
max_id = max([int(val.replace('SMI_', '')) for val in previous_df['improve_drug_id'].tolist() if pd.notnull(val) and val.startswith('SMI_')])
338+
if not previous_df.empty and 'improve_drug_id' in previous_df.columns:
339+
id_list = [int(val.replace('SMI_', '')) for val in previous_df['improve_drug_id'].tolist() if pd.notnull(val) and val.startswith('SMI_')]
340+
max_id = max(id_list) if id_list else 0 # Default to 0 if the list is empty
341+
else:
342+
max_id = 0 # Default value if the DataFrame is empty or doesn't have the column
343+
# max_id = max([int(val.replace('SMI_', '')) for val in previous_df['improve_drug_id'].tolist() if pd.notnull(val) and val.startswith('SMI_')])
330344
# Identify isoSMILES in the new dataframe that don't exist in the old dataframe
331345
unique_new_smiles = set(new_df['isoSMILES']) - set(previous_df['isoSMILES'])
332346
# Identify rows in the new dataframe with isoSMILES that are unique and where improve_drug_id is NaN
@@ -552,10 +566,10 @@ def generate_drug_list(drug_map_path,drug_path):
552566
##the next three arguments determine what we'll do
553567

554568
parser.add_argument('-s', '--samples', action = 'store_true', help='Only generate samples, requires previous samples',default=False)
555-
parser.add_argument('-p', '--prevSamples', type=str, help='Use this to provide previous sample file, will run sample file generation',default='')
569+
parser.add_argument('-p', '--prevSamples', nargs='?',type=str, default='', const='', help='Use this to provide previous sample file, will run sample file generation')
556570

557571
parser.add_argument('-d', '--drugs',action='store_true', default=False,help='Query drugs only, requires drug file')
558-
parser.add_argument('-r', '--drugFile',type=str,help='Path to existing drugs.tsv file to query')
572+
parser.add_argument('-r', '--drugFile',nargs='?',type=str, default='', const='',help='Path to existing drugs.tsv file to query')
559573

560574
parser.add_argument('-o', '--omics',action='store_true',default=False,help='Set this flag to query omics, requires current samples')
561575
parser.add_argument('-c', '--curSamples', type=str, help='Add path if you want to generate data')
@@ -604,27 +618,23 @@ def generate_drug_list(drug_map_path,drug_path):
604618
supplimentary_file = '1-s2.0-S1535610822003129-mmc2.xlsx'
605619
download_from_github(supplementary_url, supplimentary_file)
606620

607-
#prev_samples_path = "hcmi_samples.csv"
608-
#improve_map_file = "/tmp/beataml_samples.csv"
609621

610622
if args.samples:
611623
if args.prevSamples is None or args.prevSamples=='':
612-
print("Cannot run sample file generation without previous samples")
613-
exit()
624+
print("No Previous Samples file was found. Data will not align with other datasets. Use ONLY for testing purposes.")
614625
else:
615-
print("Only running Samples File Generation")
616-
prev_samples_path = args.prevSamples
617-
#Generate Samples File
618-
generate_samples_file(prev_samples_path)
626+
print("Previous Samples File Provided. Running BeatAML Sample File Generation")
627+
#Generate Samples File
628+
generate_samples_file(args.prevSamples)
619629
if args.drugs:
620630
if args.drugFile is None or args.drugFile=='':
621-
print("Cannot run drug matching without prior drug file")
622-
exit()
631+
print("Prior Drug File not provided. Data will not align with other datasets. Use ONLY for testing purposes.")
623632
else:
624-
original_drug_file = "beataml_wv1to4_raw_inhibitor_v4_dbgap.txt"
625-
original_drug_url = "https://github.com/biodev/beataml2.0_data/raw/main/beataml_wv1to4_raw_inhibitor_v4_dbgap.txt"
626-
download_from_github(original_drug_url, original_drug_file)
627-
generate_drug_list(args.drugFile, original_drug_file) ##this doesn't exist, need to add
633+
print("Drug File Provided. Proceeding with build.")
634+
original_drug_file = "beataml_wv1to4_raw_inhibitor_v4_dbgap.txt"
635+
original_drug_url = "https://github.com/biodev/beataml2.0_data/raw/main/beataml_wv1to4_raw_inhibitor_v4_dbgap.txt"
636+
download_from_github(original_drug_url, original_drug_file)
637+
generate_drug_list(args.drugFile, original_drug_file) ##this doesn't exist, need to add
628638
if args.omics:
629639
if args.genes is None or args.curSamples is None:
630640
print('Cannot process omics without sample mapping and gene mapping files')

build/broad_sanger/04b-nci60-updated.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -86,6 +86,7 @@ def main():
8686
# }
8787
# )
8888
# newnames = newnames.unique()
89+
8990

9091
# fixed = nulls[['AVERAGE_PTC','CONCENTRATION_UNIT','CONCENTRATION','CELL_NAME','EXPID','NSC','time','time_unit']].join(newnames,on='CELL_NAME',how='left')
9192
# merged.columns = ['AVERAGE_PTC','CONCENTRATION_UNIT','CONCENTRATION','old_CELL_NAME','EXPID','NSC','time','time_unit','CELL_NAME']

0 commit comments

Comments
 (0)