Skip to content

Commit 32a17e9

Browse files
committed
Merge branch 'main' into dataset_statistics
2 parents c87c178 + e00bf3c commit 32a17e9

123 files changed

Lines changed: 12205 additions & 1921 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.dockerignore

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,5 +4,4 @@ coderdata/
44
dataSummary/
55
docs/
66
candle_bmd/
7-
schema/
8-
build/local/
7+
build/local/

.github/workflows/build.yml

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -177,6 +177,30 @@ jobs:
177177
push: true
178178
platforms: linux/amd64
179179

180+
build-pancpdo:
181+
runs-on: ubuntu-latest
182+
steps:
183+
- name: Checkout
184+
uses: actions/checkout@v3
185+
- name: Set up QEMU
186+
uses: docker/setup-qemu-action@v3
187+
- name: Set up Docker Buildx
188+
uses: docker/setup-buildx-action@v3
189+
- name: Login to DockerHub
190+
uses: docker/login-action@v3
191+
with:
192+
username: ${{ secrets.DOCKERHUB_USERNAME }}
193+
password: ${{ secrets.DOCKERHUB_PASSWORD }}
194+
- name: Build and push pancpdo
195+
uses: docker/build-push-action@v3
196+
with:
197+
file: ./build/docker/Dockerfile.pancpdo
198+
tags: |
199+
sgosline/pancpdo:latest
200+
sgosline/pancpdo:${{ github.ref_name }}
201+
push: true
202+
platforms: linux/amd64
203+
180204
build-upload:
181205
runs-on: ubuntu-latest
182206
steps:

.github/workflows/main.yml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@ on:
44
push:
55
tags:
66
- '*' # Triggers the workflow only on version tags
7+
workflow_dispatch: # Allows manual triggering of the workflow
78

89
# Sets permissions of the GITHUB_TOKEN to allow deployment to GitHub Pages
910
permissions:
@@ -44,4 +45,4 @@ jobs:
4445
steps:
4546
- name: Deploy to GitHub Pages
4647
id: deployment
47-
uses: actions/deploy-pages@v4
48+
uses: actions/deploy-pages@v4

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,3 +17,6 @@ __pycache__
1717
tests/__pycache__
1818
dist
1919
build/lib
20+
build/local
21+
coderdata/_version.py
22+
local/

build/README.md

Lines changed: 57 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -10,32 +10,76 @@ are added.
1010

1111
## build_all.py script
1212

13-
This script initializes all docker containers, builds all datasets, validates them, and uploads them to figshare and pypi.
13+
This script initializes all docker containers, builds all datasets, validates them, and uploads them to figshare.
1414

15-
It requires the following authorization tokens to be set in the local environment depending on the use case:
16-
`SYNAPSE_AUTH_TOKEN`: Required for beataml and mpnst datasets. Join the [CoderData team](https://www.synapse.org/#!Team:3503472) on Synapse and generate an access token.
17-
`PYPI_TOKEN`: This token is required to upload to PyPI.
18-
`FIGSHARE_TOKEN`: This token is required to upload to Figshare.
15+
It requires the following authorization tokens to be set in the local environment depending on the use case:
16+
`SYNAPSE_AUTH_TOKEN`: Required for beataml and mpnst datasets. Join the [CoderData team](https://www.synapse.org/#!Team:3503472) on Synapse and generate an access token.
17+
`FIGSHARE_TOKEN`: This token is required to upload to Figshare.
18+
`GITHUB_TOKEN`: This token is required to upload to GitHub.
1919

20-
Available arguments:
20+
**Available arguments**:
2121

2222
- `--docker`: Initializes and builds all docker containers.
2323
- `--samples`: Processes and builds the sample data files.
2424
- `--omics`: Processes and builds the omics data files.
2525
- `--drugs`: Processes and builds the drug data files.
2626
- `--exp`: Processes and builds the experiment data files.
27-
- `--all`: Executes all available processes above (docker, samples, omics, drugs, exp).
28-
- `--validate`: Validates the generated datasets using the schema check scripts.
29-
- `--figshare`: Uploads the datasets to Figshare.
30-
- `--pypi`: Uploads the package to PyPI.
31-
- `--high_mem`: Utilizes high memory mode for concurrent data processing.
27+
- `--all`: Executes all available processes above (docker, samples, omics, drugs, exp). This does not run the validate or figshare commands.
28+
- `--validate`: Validates the generated datasets using the schema check scripts. This is automatically included if data upload occurs.
29+
- `--figshare`: Uploads the datasets to Figshare. FIGSHARE_TOKEN must be set in local environment.
30+
- `--high_mem`: Utilizes high memory mode for concurrent data processing. This has been successfully tested using 32 or more vCPUs.
3231
- `--dataset`: Specifies the datasets to process (default='broad_sanger,hcmi,beataml,mpnst,cptac').
33-
- `--version`: Specifies the version number for the package and data upload title. This is required to upload to figshare and PyPI
32+
- `--version`: Specifies the version number for the Figshare upload title (e.g., "0.1.29"). This must be a higher version than previously published versions.
33+
- `--github-username`: GitHub username matching the GITHUB_TOKEN. Required to push the new Tag to the GitHub Repository.
34+
- `--github-email`: GitHub email matching the GITHUB_TOKEN. Required to push the new Tag to the GitHub Repository.
35+
36+
**Example usage**:
37+
- Build all datasets and upload to Figshare and GitHub.
38+
Required tokens for the following command: `SYNAPSE_AUTH_TOKEN`, `FIGSHARE_TOKEN`, `GITHUB_TOKEN`.
39+
```bash
40+
python build/build_all.py --all --high_mem --validate --figshare --version 0.1.41 --github-username jjacobson95 --github-email jeremy.jacobson3402@gmail.com
41+
```
42+
43+
- Build only the experiment files.
44+
**Note**: Preceding steps will not automatically be run. This assumes that docker images, samples, omics, and drugs were all previously built. Ensure all required tokens are set.
45+
```bash
46+
python build/build_all.py --exp
47+
```
3448

49+
## build_dataset.py script
50+
This script builds a single dataset for **debugging purposes only**. It can help determine if a dataset will build correctly in isolation. Note that the sample and drug identifiers generated may not align with those from other datasets, so this script is not suitable for building production datasets.
51+
52+
It requires the following authorization tokens to be set in the local environment depending on the dataset:
53+
54+
`SYNAPSE_AUTH_TOKEN`: Required for beataml and mpnst datasets. Follow the directions above to use gain access.
55+
56+
Available arguments:
57+
- `--dataset`: Required. Name of the dataset to build. At a minimum, this will build the docker images.
58+
- `--use_prev_dataset`: Optional. Prefix of the previous dataset for sample and drug ID continuation. The previous dataset files must be in the "local" directory.
59+
- `--build`: Optional. Build the desired Dataset.
60+
- `--validate`: Optional. Run the schema checker on the built files.
61+
- `--continue`: Optional. Continues from where the build left off by skipping existing files in "local" directory.
3562
Example usage:
63+
64+
Build the broad_sanger dataset:
3665
```bash
37-
python build/build_all.py --all --high_mem --validate --pypi --figshare --version 0.1.29
66+
python build/build_dataset.py --build --dataset broad_sanger
3867
```
68+
Build the mpnst dataset continuing from broad_sanger sample and drug IDs:
69+
```bash
70+
python build/build_dataset.py --build --dataset mpnst --use_prev_dataset broad_sanger
71+
```
72+
Build run schema validation on hcmi dataset:
73+
```bash
74+
python build/build_dataset.py --dataset hcmi --validate
75+
```
76+
Build the broad_sanger dataset but skip previously built files in "local" directory:
77+
```bash
78+
python build/build_dataset.py --dataset broad_sanger --continue
79+
```
80+
81+
82+
3983

4084
## Data Source Reference List
4185

@@ -66,4 +110,3 @@ python build/build_all.py --all --high_mem --validate --pypi --figshare --versio
66110
| BeatAML | NCI Proteomic Data Commons | Mapping the proteogenomic landscape enables prediction of drug response in acute myeloid leukemia | James Pino et al. | 23
67111
| MPNST | NF Data Portal | Chromosome 8 gain is associated with high-grade transformation in MPNST | David P Nusinow et al. | 24
68112

69-

0 commit comments

Comments
 (0)