Skip to content

Commit 8412244

Browse files
authored
Merge pull request #243 from PNNL-CompBio/doc-update
provided framework for coderdata UI
2 parents c4011a4 + bea6c4b commit 8412244

2 files changed

Lines changed: 131 additions & 86 deletions

File tree

README.md

Lines changed: 67 additions & 74 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,10 @@
11
## Cancer Omics Drug Experiment Response Dataset
22

3-
There is a recent explosion of deep learning algorithms that to tackle the computational problem of predicting drug treatment outcome from baseline molecular measurements. To support this,we have built a benchmark dataset that harmonizes diverse datasets to better assess algorithm performance.
3+
There is a recent explosion of deep learning algorithms that to tackle
4+
the computational problem of predicting drug treatment outcome from
5+
baseline molecular measurements. To support this,we have built a
6+
Python package that enables access to and facile usage of cancer drug
7+
sensitivity datsets for AI applications.
48

59
This package collects diverse sets of paired molecular datasets with corresponding drug sensitivity data. All data here is reprocessed and standardized so it can be easily used as a benchmark dataset for the
610
This repository leverages existing datasets to collect the data
@@ -12,81 +16,70 @@ existing models.
1216
![Coderdata Motivation](coderdata_overview.jpg?raw=true "Motivation behind
1317
coderdata develompent")
1418

15-
16-
The goal of this repository is two-fold: First, it aims to collate and
17-
standardize the data for the broader community. This requires
18-
running a series of scripts to build and append to a standardized data
19-
model. Second, it has a series of scripts that pull from the data
20-
model to create model-specific data files that can be run by the data
21-
infrastructure.
22-
23-
## Data access
19+
## Installation
20+
To install the CoderData Python package:
21+
```
22+
pip install coderdata
23+
```
24+
25+
## Usage
26+
The Python package is designed to facilitate the training and
27+
validating of computational models that predict drug
28+
response. Currently the package supports the following commands:
29+
30+
1. `list`: Lists names of datasets to download. This depends on what
31+
datasets have been in the main build.
32+
2. `download`: Downloads the dataset by name. Case insensitive, but
33+
should contain full name of dataset returned from the `list` command.
34+
2. `load`: This return a `dataset` object that houses all the
35+
data. This object has the following functions:
36+
1. `train_test_validate`: Splits the object into
37+
2. `types`: Each datasaet has different types of data included in
38+
it. For all possible data types see the
39+
[schema](schema/README.md). These can include:
40+
- transcriptomics
41+
- mutations
42+
- copy_numbers
43+
- proteomics
44+
- experiments
45+
- combinations
46+
- drugs
47+
- genes
48+
- samples
49+
3. `format`: Format performs data type-specific formatting. First
50+
argument is name of data type, next arguments are
51+
data-type-specific, last argument is `use_polars` to return a
52+
polars instead of a pandas data frame.
53+
- transcriptomics: `ds.format('transcriptomics')` returns a
54+
pandas or polars data frame with each row representing a gene
55+
and each column representing a sample.
56+
- mutations:
57+
`ds.format('mutations',['Frame_Shift_Del','Frame_Shift_Ins','Missense_Muation','Start_Codon_SNP'])`
58+
will return a binary matrix with rows representing genes and
59+
columns representing samples, and a `1` value if there there
60+
is a mutation in given gene/sample that falls into the class
61+
provided by the second argument.
62+
- copy_numbers: `ds.format('copy_number','copy_number')` returns a
63+
pandas or polars data frame with each row representing a gene
64+
and each column representing the average copy_number value for
65+
each gene. If the second argument is `copy_call` the data
66+
frame values are a discrete measurement of copy number
67+
provided by the schema.
68+
- proteomics: `ds.format('proteomics')` returns a
69+
pandas or polars data frame with each row representing a gene
70+
and each column representing a sample.
71+
- experiments: `ds.format('experiments', 'fit_auc')` returns a
72+
matrix with drugs represented by rows and samples represneted by
73+
columns the numeric values represent the measurement provided by
74+
the second argument.
75+
- combinations: `ds.format('combinations')` returns ???
76+
- drugs: Not sure what to return here - just table? how about descriptors?
77+
- samples: just return table?
78+
4. `save`: saves the object to a file.
79+
80+
## Additional documentation
2481
For the access to the latest version of CoderData, please visit our
2582
[documentation site](https://pnnl-compbio.github.io/coderdata/) which provides access to Figshare and
2683
instructions for using the Python package to download the data.
2784

28-
## Data format
29-
All coderdata files are in text format - either comma delimited or tab
30-
delimited (depending on data type). Each dataset can be evaluated
31-
individually according to the CoderData schema that is maintained in [LinkML](schema/coderdata.yaml)
32-
and can be udpated via a commit to the repository. For more details,
33-
please see the [schema description](schema/README.md).
34-
35-
## Building a local version
36-
37-
The build process can be found in our [build
38-
directory](build/README.md). Here you can follow the instructions to
39-
build your own local copy of the data on your machine.
40-
41-
## Adding a new dataset
42-
43-
We have standardized the build process so an additional dataset can be
44-
built locally or as part of the next version of coder. Here are the
45-
steps to follow:
46-
47-
1. First visit the [build
48-
directory](build/README.md) and ensure you can build a local copy of
49-
CoderData.
50-
51-
2. Checkout this repository and create a subdirectory of the
52-
[build directory](build) with your own build files.
53-
54-
3. Develop your scripts to build the data files according to our
55-
[LinkML Schema](schema/coderdata.yaml]). This will require collecting
56-
the following metadata:
57-
- entrez gene identifiers (or you can use the `genes.csv` file
58-
- sample information such as species and model system type
59-
- drug name that can be searched on PubChem
60-
61-
You can validate each file by
62-
using the [linkML
63-
validator](https://linkml.io/linkml/data/validating-data) together
64-
with our schema file.
65-
66-
You can use the following scripts as part of your build process:
67-
- [build/utils/fit_curve.py](build/utils/fit_curve.py): This script
68-
takes dose-response data and generates the dose-response statistics
69-
required by CoderData/
70-
- [build/utils/pubchem_retrieval.py](build/utils/pubchem_retreival.py):
71-
This script retreives structure and drug synonym information
72-
required to populate the `Drug` table.
73-
74-
4. Wrap your scripts in standard shell scripts with the following names
75-
and arguments:
76-
77-
| shell script | arguments | description |
78-
|------------------|--------------------------|---------------------|
79-
| `build_samples.sh` | [latest_samples] | Latest version of samples generated by coderdata build |
80-
| `build_omics.sh` | [gene file] [samplefile] | This includes the `genes.csv` that was generated in the original build as well as the sample file generated above. |
81-
| `build_drugs.sh` | [drugfile1,drugfile2,...] | This includes a comma-delimited list of all drugs files generated from previous build |
82-
| `build_exp.sh`| [samplfile ] [drugfile] | sample file and drug file generated by previous scripts |
83-
84-
5. Put the Docker container file inside the [Docker
85-
directory](./build/docker) with the name
86-
`Dockerfile.[datasetname]`.
87-
88-
6. Run `build_all.py` from the root directory, which should now add in
89-
your Dockerfile in the mix and call the scripts in your Docker
90-
container to build the files.
91-
9285

build/README.md

Lines changed: 64 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -8,16 +8,15 @@ are added.
88

99
![Build process](coderDataBuild.jpg?raw=true "Build process")
1010

11-
## build_all.py script
11+
## Build a local version using the `build_all.py` script
1212

1313
This script initializes all docker containers, builds all datasets, validates them, and uploads them to figshare and pypi.
1414

1515
It requires the following authorization tokens to be set in the local environment depending on the use case:
16-
`SYNAPSE_AUTH_TOKEN`: Required for beataml and mpnst datasets. Join the [CoderData team](https://www.synapse.org/#!Team:3503472) on Synapse and generate an access token.
17-
`PYPI_TOKEN`: This token is required to upload to PyPI.
18-
`FIGSHARE_TOKEN`: This token is required to upload to Figshare.
19-
`GITHUB_TOKEN`: This token is required to upload to GitHub.
20-
16+
`SYNAPSE_AUTH_TOKEN`: Required for beataml and mpnst datasets. Join the [CoderData team](https://www.synapse.org/#!Team:3503472) on Synapse and generate an access token.
17+
`PYPI_TOKEN`: This token is required to upload to PyPI.
18+
`FIGSHARE_TOKEN`: This token is required to upload to Figshare.
19+
`GITHUB_TOKEN`: This token is required to upload to GitHub.
2120
**Available arguments**:
2221

2322
- `--docker`: Initializes and builds all docker containers.
@@ -35,20 +34,20 @@ It requires the following authorization tokens to be set in the local environmen
3534
- `--github-username`: GitHub username matching the GITHUB_TOKEN. Required to push the new Tag to the GitHub Repository.
3635
- `--github-email`: GitHub email matching the GITHUB_TOKEN. Required to push the new Tag to the GitHub Repository.
3736

38-
**Example usage**:
39-
- Build all datasets and upload to Figshare and PyPI and GitHub.
40-
Required tokens for the following command: `SYNAPSE_AUTH_TOKEN`, `PYPI_TOKEN`, `FIGSHARE_TOKEN`, `GITHUB_TOKEN`.
37+
**Example usage**:
38+
- Build all datasets and upload to Figshare and PyPI and GitHub.
39+
Required tokens for the following command: `SYNAPSE_AUTH_TOKEN`, `PYPI_TOKEN`, `FIGSHARE_TOKEN`, `GITHUB_TOKEN`.
4140
```bash
4241
python build/build_all.py --all --high_mem --validate --pypi --figshare --version 0.1.41 --github-username jjacobson95 --github-email jeremy.jacobson3402@gmail.com
4342
```
4443

45-
- Build only the experiment files.
46-
**Note**: Preceding steps will not automatically be run. This assumes that docker images, samples, omics, and drugs were all previously built. Ensure all required tokens are set.
44+
- Build only the experiment files.
45+
**Note**: Preceding steps will not automatically be run. This assumes that docker images, samples, omics, and drugs were all previously built. Ensure all required tokens are set.
4746
```bash
4847
python build/build_all.py --exp
4948
```
5049

51-
## build_dataset.py script
50+
## Build/test individual datset using the `build_dataset.py` script
5251
This script builds a single dataset for **debugging purposes only**. It can help determine if a dataset will build correctly in isolation. Note that the sample and drug identifiers generated may not align with those from other datasets, so this script is not suitable for building production datasets.
5352

5453
It requires the following authorization tokens to be set in the local environment depending on the dataset:
@@ -79,6 +78,59 @@ Build the broad_sanger dataset but skip previously built files in "local" direct
7978
python build/build_dataset.py --dataset broad_sanger --continue
8079
```
8180

81+
## Adding a new dataset
82+
83+
We have standardized the build process so an additional dataset can be
84+
built locally or as part of the next version of coder. Here are the
85+
steps to follow:
86+
87+
1. First visit the [build
88+
directory](build/README.md) and ensure you can build a local copy of
89+
CoderData.
90+
91+
2. Checkout this repository and create a subdirectory of the
92+
[build directory](build) with your own build files.
93+
94+
3. Develop your scripts to build the data files according to our
95+
[LinkML Schema](schema/coderdata.yaml]). This will require collecting
96+
the following metadata:
97+
- entrez gene identifiers (or you can use the `genes.csv` file
98+
- sample information such as species and model system type
99+
- drug name that can be searched on PubChem
100+
101+
You can validate each file by
102+
using the [linkML
103+
validator](https://linkml.io/linkml/data/validating-data) together
104+
with our schema file.
105+
106+
You can use the following scripts as part of your build process:
107+
- [build/utils/fit_curve.py](build/utils/fit_curve.py): This script
108+
takes dose-response data and generates the dose-response statistics
109+
required by CoderData/
110+
- [build/utils/pubchem_retrieval.py](build/utils/pubchem_retreival.py):
111+
This script retreives structure and drug synonym information
112+
required to populate the `Drug` table.
113+
114+
4. Wrap your scripts in standard shell scripts with the following names
115+
and arguments:
116+
117+
| shell script | arguments | description |
118+
|------------------|--------------------------|---------------------|
119+
| `build_samples.sh` | [latest_samples] | Latest version of samples generated by coderdata build |
120+
| `build_omics.sh` | [gene file] [samplefile] | This includes the `genes.csv` that was generated in the original build as well as the sample file generated above. |
121+
| `build_drugs.sh` | [drugfile1,drugfile2,...] | This includes a comma-delimited list of all drugs files generated from previous build |
122+
| `build_exp.sh`| [samplfile ] [drugfile] | sample file and drug file generated by previous scripts |
123+
124+
5. Put the Docker container file inside the [Docker
125+
directory](./build/docker) with the name
126+
`Dockerfile.[datasetname]`.
127+
128+
6. Run `build_all.py` from the root directory, which should now add in
129+
your Dockerfile in the mix and call the scripts in your Docker
130+
container to build the files.
131+
132+
133+
82134

83135

84136

0 commit comments

Comments
 (0)