Skip to content

Commit 1426ae8

Browse files
committed
updating to address suggestions from @bcli4d
1 parent 98ba819 commit 1426ae8

3 files changed

Lines changed: 66 additions & 2 deletions

File tree

notebooks/getting_started/part1_prerequisites.ipynb

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,15 @@
3838
"[NCI Imaging Data Commons (IDC)](https://datacommons.cancer.gov/repository/imaging-data-commons) is a cloud-based repository of publicly available cancer imaging data co-located with the analysis and exploration tools and resources. IDC is a node within the broader NCI Cancer Research Data Commons (CRDC) infrastructure that provides secure access to a large, comprehensive, and expanding collection of cancer research data."
3939
]
4040
},
41+
{
42+
"cell_type": "markdown",
43+
"metadata": {},
44+
"source": [
45+
"## What is \"IDC collection\"?\n",
46+
"\n",
47+
"IDC contains images and image-derived data (i.e., annotations and analysis results) from a variety of repositories. Those images are broadly organized by programs, which correspond to the various data collection initiatives. Programs consist of collections, which group data collected by a specific entity for a specific application. Collections include both original data that was collected by the contributing entity, and the image-derived data that might have been generated by other contributors, extending the original collection."
48+
]
49+
},
4150
{
4251
"cell_type": "markdown",
4352
"metadata": {},
@@ -118,6 +127,33 @@
118127
"cell_type": "markdown",
119128
"metadata": {},
120129
"source": [
130+
"## Locate and add `bigquery-public-data` project\n",
131+
"\n",
132+
"`bigquery-public-data` is a public project that contains BigQuery tables with IDC metadata (we will work with those in the part 2 of this series). To tavigate those metadata tables we need to manually add this project.\n",
133+
"\n",
134+
"1. Navigate to the BigQuery console: https://console.cloud.google.com/bigquery, and click the `+ ADD DATA` button.\n",
135+
"\n",
136+
"![add data](https://www.dropbox.com/s/cg99cyn1uzigw7s/add_data.png?raw=1)\n",
137+
"\n",
138+
"2. Choose \"Star a project\" option from the list.\n",
139+
"\n",
140+
"![star a project](https://www.dropbox.com/s/6688galhthr5vsn/star_a_project.png?raw=1)\n",
141+
"\n",
142+
"3. Type `bigqeury-public-data` as the project name and click `STAR` button.\n",
143+
"\n",
144+
"![star](https://www.dropbox.com/s/nzh7aybkre138g1/star.png?raw=1)\n",
145+
"\n",
146+
"In a few moments, `bigquery-public-data` project should appear in the list on the left hand side of the BigQuery console.\n",
147+
"\n",
148+
"![starred](https://www.dropbox.com/s/s2f6vpolbimnyb8/bqpd_added.png?raw=1)"
149+
]
150+
},
151+
{
152+
"cell_type": "markdown",
153+
"metadata": {},
154+
"source": [
155+
"## Check the setup\n",
156+
"\n",
121157
"Finally, let's run a query to confirm that the setup is working for your account."
122158
]
123159
},

notebooks/getting_started/part2_searching_basics.ipynb

Lines changed: 29 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -148,6 +148,29 @@
148148
"In this query we work with the [`dicom_all` table](https://console.cloud.google.com/bigquery?p=bigquery-public-data&d=idc_current&t=dicom_all&page=table), which contains the DICOM metadata extracted from IDC images along with collection-level metadata that does not originate from DICOM."
149149
]
150150
},
151+
{
152+
"cell_type": "markdown",
153+
"metadata": {},
154+
"source": [
155+
"## Organization of IDC metadata in BigQuery tables\n",
156+
"\n",
157+
"Let's take a moment to look into the table used in the `FROM` clause of our query: [`bigquery-public-data.idc_current.dicom_all`](https://console.cloud.google.com/bigquery?p=bigquery-public-data&d=idc_current&t=dicom_all&page=table).\n",
158+
"\n",
159+
"This name is like an address that allows to locate the specific table in BigQuery. This \"address\" consists of three components: <project_id>.<dataset_id>.<table_id>\n",
160+
"\n",
161+
"1. `bigquery-public-data` is a public GCP _project_ that is maintained by Google Public Datasets Program. IDC-curated BigQuery tables with the metadata about IDC images is included in this project.\n",
162+
"2. `idc_current`is a _dataset_ within the `bigquery-public-data` project. Think of BigQuery datasets as containers that are used to organize and control access to the tables within the project.\n",
163+
"3. `dicom_all` is one of the tables within the `idc_current` dataset. As you spend more time learning about IDC, you will hopefully leverage other tables available in that dataset.\n",
164+
"\n",
165+
"If you now look back at the [BigQuery console](https://console.cloud.google.com/bigquery) and expand the list of datasets under the `bigquery-public-data` project, you will see that in addition to the `idc_current` dataset there are also datasets `idc_v12`, `idc_v11`, etc all the way to `idc_v1`. Those datasets correspond to the IDC data release versions, with `idc_current` being an alias for the latest (at the moment, v12) version of IDC data. \n",
166+
"\n",
167+
"We will not spend time discussing how IDC versioning works, but it is important to know that \n",
168+
"\n",
169+
"1. IDC data is versioned;\n",
170+
"2. queries against the `idc_current` dataset are equivalent to the queries against the latest version (currently, `idc_v12`) of IDC data;\n",
171+
"3. if you want the results of the queries to be persistent, write those against `idc_v*` datasets instead of `idc_current`."
172+
]
173+
},
151174
{
152175
"cell_type": "markdown",
153176
"metadata": {
@@ -345,6 +368,9 @@
345368
" bigquery-public-data.idc_current.dicom_all\n",
346369
"WHERE\n",
347370
" # write the selection criteria under this line!\n",
371+
" # Use AND operator to combine the filter values for the\n",
372+
" # Modality and tcia_tumorLocation to select collections that\n",
373+
" # include MR images for Lung cancer locations\n",
348374
"\"\"\"\n",
349375
"\n",
350376
"selection_result = bq_client.query(selection_query)\n",
@@ -615,7 +641,9 @@
615641
"* learned about BigQuery as the tool for searching IDC metadata\n",
616642
"* are motivated to start experimenting with the SQL interface to select subsets of IDC data at different levels of data model (collection, patient, study, series)\n",
617643
"\n",
618-
"If you have any questions about this tutorial, or about searching IDC metadata, please send us an email to support@canceridc.dev or posting your question on [IDC User forum](https://discourse.cancer.dev)!"
644+
"If you have any questions about this tutorial, or about searching IDC metadata, please send us an email to support@canceridc.dev or posting your question on [IDC User forum](https://discourse.cancer.dev)!\n",
645+
"\n",
646+
"This tutorial barely scratches the surface of what you can do with BigQuery SQL. If you are interested in a comprehensive tutorial about BigQuery SQL, check out this [\"Intro to SQL\" course on Kaggle](https://www.kaggle.com/learn/intro-to-sql)!"
619647
]
620648
},
621649
{

notebooks/getting_started/part3_exploring_cohorts.ipynb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -90,7 +90,7 @@
9090
"\n",
9191
"In IDC, a _cohort_ is set of objects stored in IDC that share certain characteristics as defined by metadata.\n",
9292
"\n",
93-
"In the previous tutorial you learned how to use IDC metadata and SQL to filter IDC data and define subsets/cohorts based on such metadata characteristics as cancer location or image modality. You also learned about the hierarchy of data organization in IDC, whereas your \"cohorts\" can be defined at the level of collections, patients, imaging studies or individual images (imaging series).\n",
93+
"In the previous tutorial you learned how to use IDC metadata and SQL to filter IDC data and define subsets/cohorts based on such metadata characteristics as cancer location or image modality. You also learned about the hierarchy of data organization in IDC, whereas your \"cohorts\" can be defined at the level of collections, patients, imaging studies, series or individual images (files).\n",
9494
"\n",
9595
"In the following cells we will learn:\n",
9696
"1. How to visualize images from your cohort\n",

0 commit comments

Comments
 (0)