Skip to content

Commit 19548a9

Browse files
committed
Updated Readme for build_dataset.py. Removed unused code
1 parent ce8e651 commit 19548a9

2 files changed

Lines changed: 38 additions & 5 deletions

File tree

build/README.md

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,40 @@ python build/build_all.py --all --high_mem --validate --pypi --figshare --versio
4848
python build/build_all.py --exp
4949
```
5050

51+
## build_dataset.py script
52+
This script builds a single dataset for **debugging purposes only**. It can help determine if a dataset will build correctly in isolation. Note that the sample and drug identifiers generated may not align with those from other datasets, so this script is not suitable for building production datasets.
53+
54+
It requires the following authorization tokens to be set in the local environment depending on the dataset:
55+
56+
`SYNAPSE_AUTH_TOKEN`: Required for beataml and mpnst datasets. Follow the directions above to use gain access.
57+
58+
Available arguments:
59+
- `--dataset`: Required. Name of the dataset to build.
60+
- `--use_prev_dataset`: Optional. Prefix of the previous dataset for sample and drug ID continuation. The previous dataset files must be in the "local" directory.
61+
- `--validate`: Optional. Runs the schema checker on the built files.
62+
- `--continue`: Optional. Continues from where the build left off by skipping existing files in "local" directory.
63+
Example usage:
64+
65+
Build the broad_sanger dataset:
66+
```bash
67+
python build/build_dataset.py --dataset broad_sanger
68+
```
69+
Build the mpnst dataset continuing from broad_sanger sample and drug IDs:
70+
```bash
71+
python build/build_dataset.py --dataset mpnst --use_prev_dataset broad_sanger
72+
```
73+
Build the hcmi dataset and run validation:
74+
```bash
75+
python build/build_dataset.py --dataset hcmi --validate
76+
```
77+
Build the broad_sanger dataset but skip previously built files in "local" directory:
78+
```bash
79+
python build/build_dataset.py --dataset broad_sanger --continue
80+
```
81+
82+
83+
84+
5185
## Data Source Reference List
5286

5387
| Dataset | Data Source | Resource | Authors | AACR Reference Number |

build/build_dataset.py

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -111,7 +111,7 @@ def process_drugs(executor, dataset, use_prev_dataset, should_continue):
111111
executor.submit(run_docker_cmd, [di, 'sh', 'build_drugs.sh', ','.join(dflist)], filename)
112112

113113

114-
def process_omics(executor, dataset, high_mem, should_continue):
114+
def process_omics(executor, dataset, should_continue):
115115
'''
116116
Build the omics files for the specified dataset.
117117
'''
@@ -158,7 +158,7 @@ def process_omics(executor, dataset, high_mem, should_continue):
158158
executor.submit(run_docker_cmd, [di, 'sh', 'build_omics.sh', '/tmp/genes.csv', f'/tmp/{dataset}_samples.csv'], filename)
159159

160160

161-
def process_experiments(executor, dataset, high_mem, should_continue):
161+
def process_experiments(executor, dataset, should_continue):
162162
'''
163163
Build the experiments files for the specified dataset.
164164
'''
@@ -236,7 +236,6 @@ def main():
236236
)
237237
parser.add_argument('--dataset', required=True, help='Name of the dataset to build')
238238
parser.add_argument('--use_prev_dataset', help='Prefix of the previous dataset for sample and drug ID assignment')
239-
parser.add_argument('--high-mem', action='store_true', help='Use high memory mode for parallel processing')
240239
parser.add_argument('--validate', action='store_true', help='Run schema checker on the built files')
241240
parser.add_argument('--continue', dest='should_continue', action='store_true', help='Continue from where the build left off by skipping existing files')
242241

@@ -265,8 +264,8 @@ def main():
265264
with ThreadPoolExecutor() as executor:
266265

267266
# Build omics and experiments
268-
omics_future = executor.submit(process_omics, executor, args.dataset, args.high_mem, args.should_continue)
269-
experiments_future = executor.submit(process_experiments, executor, args.dataset, args.high_mem, args.should_continue)
267+
omics_future = executor.submit(process_omics, executor, args.dataset, args.should_continue)
268+
experiments_future = executor.submit(process_experiments, executor, args.dataset, args.should_continue)
270269

271270
omics_future.result()
272271
experiments_future.result()

0 commit comments

Comments
 (0)