Skip to content

Commit 618b21b

Browse files
authored
Merge pull request #424 from PNNL-CompBio/hcmi-manifest
Hcmi manifest update and fix for duplicated counts
2 parents 23a25cc + c2d9db0 commit 618b21b

4 files changed

Lines changed: 4078 additions & 2146 deletions

File tree

build/hcmi/02-getHCMIData.py

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -613,10 +613,13 @@ def write_dataframe_to_csv(dataframe, outname):
613613
-------
614614
None
615615
"""
616+
dataframe = dataframe.to_pandas()
617+
dataframe = dataframe.drop_duplicates()
618+
616619
if('gz' in outname):
617-
dataframe.to_pandas().to_csv(outname,compression='gzip',index=False)
620+
dataframe.to_csv(outname,compression='gzip',index=False)
618621
else:
619-
dataframe.to_pandas().to_csv(outname,index=False)
622+
dataframe.to_csv(outname,index=False)
620623
return
621624

622625
def main():

build/hcmi/README.md

Lines changed: 27 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,38 @@
11
## HCMI Data
22

3-
Here we will store the scripts required to process the data from the [Human Cancer Models Initiative](https://ocg.cancer.gov/programs/HCMI)
3+
Here we will store the scripts required to process the data from the
4+
[Human Cancer Models
5+
Initiative](https://ocg.cancer.gov/programs/HCMI).
6+
7+
Currenlty all data collected is part of the [HCMI-CMDC Project on the
8+
GDC](https://portal.gdc.cancer.gov/analysis_page?app=Projects). To
9+
update:
10+
11+
1. Navigate to the [GDC Data
12+
Portal](https://portal.gdc.cancer.gov/analysis_page?app=Projects),
13+
and select 'HCMI-CMDC'
14+
2. Click on the 'Cases' button, and select the download button where
15+
it lists the number of files.
16+
3. This will download the ENTIRE Manifes
17+
4. Filter the manifest for RNASeq, WGS mutations, and copy number
18+
calls using the following command:
19+
```
20+
cat ~gdc_manifest.2025-07-08.091940.txt | grep 'mask\|copy\|rna_seq\|md5'
21+
| grep 'txt\|maf\|tsv\|md5' > new_manifest.txt
22+
cp new_manifest.txt full_manifest.txt
23+
24+
```
425

526

6-
Currently the tool require two steps to build the data:
27+
Currently the tool require two scripts to build the data:
728
```
829
python 01-createHCMISamplesFile.py
930
10-
python 02-getHCMIData.py -m transcriptomics_gdc_manifest.txt -t transcriptomics -o transcriptomics.csv
31+
python 02-getHCMIData.py -m full_manifest.txt -t transcriptomics -o transcriptomics.csv
1132
12-
python 02-getHCMIData.py -m mutations_manifest_gdc.txt -t mutations -o mutations.csv
33+
python 02-getHCMIData.py -m full_manifest.txt -t mutations -o mutations.csv
1334
14-
python 02-getHCMIData.py -m _manifest.txt -t copy_number -o copy_number.csv
35+
python 02-getHCMIData.py -m full_manifest.txt -t copy_number -o copy_number.csv
1536
1637
17-
```
38+
```

0 commit comments

Comments
 (0)