Merge pull request #250 from PNNL-CompBio/mpnst-readme-update

jjacobson95 · web-flow · commit bc7d75ae014f · 2024-12-06T13:45:33.000-08:00
Update MPNST and build_dataset.py README.md files
diff --git a/build/README.md b/build/README.md
@@ -56,21 +56,22 @@ It requires the following authorization tokens to be set in the local environmen
 `SYNAPSE_AUTH_TOKEN`: Required for beataml and mpnst datasets. Follow the directions above to use gain access.
 
 Available arguments:
-- `--dataset`: Required. Name of the dataset to build.
+- `--dataset`: Required. Name of the dataset to build. At a minimum, this will build the docker images.
 - `--use_prev_dataset`: Optional. Prefix of the previous dataset for sample and drug ID continuation. The previous dataset files must be in the "local" directory.
-- `--validate`: Optional. Runs the schema checker on the built files.
+- `--build`: Optional. Build the desired Dataset.
+- `--validate`: Optional. Run the schema checker on the built files.
 - `--continue`: Optional. Continues from where the build left off by skipping existing files in "local" directory.
 Example usage:
 
 Build the broad_sanger dataset:
 ```bash
-python build/build_dataset.py --dataset broad_sanger
+python build/build_dataset.py --build --dataset broad_sanger
 ```
 Build the mpnst dataset continuing from broad_sanger sample and drug IDs:
 ```bash
-python build/build_dataset.py --dataset mpnst --use_prev_dataset broad_sanger
+python build/build_dataset.py --build --dataset mpnst --use_prev_dataset broad_sanger
 ```
-Build the hcmi dataset and run validation:
+Build run schema validation on hcmi dataset:
 ```bash
 python build/build_dataset.py --dataset hcmi --validate
 ```
diff --git a/build/mpnst/README.md b/build/mpnst/README.md
@@ -1,34 +1,63 @@
 ## Build Instructions for MPNST Dataset
 
 To build the MPNST dataset, follow these steps from the coderdata root
-directory. Currently using the test files as input. 
+directory.
 
-1. Build the Docker image:
+### Step 1: Set the SYNAPSE_AUTH_TOKEN Environment Variable.
+This is required to download the data.
+```
+export SYNAPSE_AUTH_TOKEN="Your Synapse Token"
+```
+### Step 2: Choose an option below depending on your needs.
+---
+### Option 1: QuickBuild the test dataset using build_dataset.py
+
+This quick build process does not map sample identifers with previous data versions and is only for personal use.
+```
+python build/build_dataset.py --dataset mpnst --build 
+```
+---
+### Option 2: Build the test dataset using build_dataset.py with a previous dataset.
+
+This build process assumes you already built or have access to a previously built dataset. This previous dataset must be located in `$PWD/local`. The validate argument ensures the output aligns with the schema.
+```
+python build/build_dataset.py --dataset mpnst --build --validate --use_prev_dataset beataml
+```
+---
+### Option 3: Build each test file one at a time.
+This process does not map sample identifers with previous data versions and is only for personal use.
+
+1. Create an empty local directory in the coderdata root directory.
+   ```
+   mkdir local
+   ```
+2. Build the Docker image with the optional HTTPS_PROXY argument:
    ```
    docker build -f build/docker/Dockerfile.mpnst -t mpnst . --build-arg HTTPS_PROXY=$HTTPS_PROXY
    ```
 
-2. Generate new identifiers for these samples to create a
+3. Generate new identifiers for these samples to create a
    `mpnst_samples.csv` file. This pulls from the latest synapse
    project metadata table.
    ```
-   docker run -v $PWD:/tmp -e -e SYNAPSE_AUTH_TOKEN=$SYNAPSE_AUTH_TOKEN mpnst sh build_samples.sh /tmp/build/build_test/test_samples.csv
-   ```
 
-3. Pull the data and map it to the samples. This uses the metadata
+   docker run -v "$PWD/local":/tmp -e SYNAPSE_AUTH_TOKEN=$SYNAPSE_AUTH_TOKEN mpnst bash build_samples.sh [Previous Samples file or Empty Quotes ("")]
+
+
+4. Pull the data and map it to the samples. This uses the metadata
    table pulled above.
    ```
-   docker run -v $PWD:/tmp -e SYNAPSE_AUTH_TOKEN=$SYNAPSE_AUTH_TOKEN mpnst sh build_omics.sh /tmp/build/build_test/test_genes.csv /tmp/mpnst_samples.csv 
+   docker run -v "$PWD/local":/tmp -e SYNAPSE_AUTH_TOKEN=$SYNAPSE_AUTH_TOKEN mpnst bash build_omics.sh /tmp/genes.csv /tmp/mpnst_samples.csv 
    ```
 
-4. Process drug data
+5. Process drug data
    ```
-   docker run -v $PWD:/tmp -e SYNAPSE_AUTH_TOKEN=$SYNAPSE_AUTH_TOKEN  mpnst sh build_drugs.sh /tmp/build/build_test/test_drugs.tsv
+   docker run -v "$PWD/local":/tmp -e SYNAPSE_AUTH_TOKEN=$SYNAPSE_AUTH_TOKEN  mpnst bash build_drugs.sh [Previous Drugs file or Empty Quotes ("")]
    ```
    
-5. Process experiment data. This uses the metadata from above as well as the file directory on synapse:
+6. Process experiment data. This uses the metadata from above as well as the file directory on synapse:
    ```
-   docker run -v $PWD:/tmp -e SYNAPSE_AUTH_TOKEN=$SYNAPSE_AUTH_TOKEN mpnst sh build_exp.sh /tmp/mpnst_samples.csv /tmp/mpnst_drugs.tsv.gz
+   docker run -v "$PWD/local":/tmp -e SYNAPSE_AUTH_TOKEN=$SYNAPSE_AUTH_TOKEN mpnst bash build_exp.sh /tmp/mpnst_samples.csv /tmp/mpnst_drugs.tsv
    ```
 
 Please ensure that each step is followed in order for correct dataset compilation.