|
98 | 98 | "id": "kZMMjky1898-" |
99 | 99 | }, |
100 | 100 | "source": [ |
101 | | - "## What does it mean to search?\n", |
| 101 | + "## How do I search?\n", |
102 | 102 | "\n", |
103 | 103 | "When you search, or _query_ IDC catalog, you specify what criteria should the metadata describing the selected files satisfy. \n", |
104 | 104 | "\n", |
|
112 | 112 | "\n", |
113 | 113 | "Although it would be very nice to just state what you need in free form, in practice queries need to be written in a formal way.\n", |
114 | 114 | "\n", |
115 | | - "IDC organizes all of the metadata into large tables, where each row corresponds to one image file (as of writing, IDC indexes ~42 millions of files) and each column represents a metadata attribute present in one or more files in IDC (currently, we have hundreds of such attributes). \n", |
| 115 | + "IDC organizes all of the metadata into large tables, where each row corresponds to one image file (as of IDC data release v12, we index ~42 millions of files) and each column represents a metadata attribute present in one or more files in IDC (currently, we index hundreds of such attributes). \n", |
116 | 116 | "\n", |
117 | | - "IDC metadata tables are maintained in [GCP BigQuery](https://cloud.google.com/bigquery), with only a tiny subset of the attributes indexed in the catalog available via the [IDC Portal exploration page](https://imaging.datacommons.cancer.gov/explore/). IDC metadata can be queried using Standard Query Language (SQL), and does not require learning any IDC-specific API. " |
| 117 | + "IDC metadata tables are maintained in [GCP BigQuery](https://cloud.google.com/bigquery), with only a tiny subset of the attributes indexed in the catalog available via the [IDC Portal exploration page](https://imaging.datacommons.cancer.gov/explore/). IDC metadata can be queried using Standard Query Language (SQL), and does not require learning any IDC-specific API. \n", |
| 118 | + "\n", |
| 119 | + "In the following steps of the tutorial we will use just a few of the attributes (SQL table columns) to get started. You will be able to use the same principles and SQL queries to extend your search criteria to include any of the other attributes indexed by IDC." |
118 | 120 | ] |
119 | 121 | }, |
120 | 122 | { |
|
125 | 127 | "source": [ |
126 | 128 | "## First query and BigQuery workspace\n", |
127 | 129 | "\n", |
128 | | - "To get started, let's build the queries that replicate the information about IDC data shown in the IDC Portal.\n", |
| 130 | + "To get started, let's build the queries that replicate some of the information about IDC data shown in the IDC Portal.\n", |
129 | 131 | "\n", |
130 | 132 | "As the very first query, let's get the list of all the image collections available in IDC. Here is that query:\n", |
131 | 133 | "\n", |
|
158 | 160 | "\n", |
159 | 161 | "This name is like an address that allows to locate the specific table in BigQuery. This \"address\" consists of three components: <project_id>.<dataset_id>.<table_id>\n", |
160 | 162 | "\n", |
161 | | - "1. `bigquery-public-data` is a public GCP _project_ that is maintained by Google Public Datasets Program. IDC-curated BigQuery tables with the metadata about IDC images is included in this project.\n", |
| 163 | + "1. `bigquery-public-data` is the ID of a public GCP _project_ that is maintained by Google Public Datasets Program. IDC-curated BigQuery tables with the metadata about IDC images are included in this project.\n", |
162 | 164 | "2. `idc_current`is a _dataset_ within the `bigquery-public-data` project. Think of BigQuery datasets as containers that are used to organize and control access to the tables within the project.\n", |
163 | 165 | "3. `dicom_all` is one of the tables within the `idc_current` dataset. As you spend more time learning about IDC, you will hopefully leverage other tables available in that dataset.\n", |
164 | 166 | "\n", |
|
179 | 181 | "source": [ |
180 | 182 | "## Same query using Python SDK\n", |
181 | 183 | "\n", |
182 | | - "BigQuery SQL workspace is a very convenient tool for exploring schemas of the tables, experimenting with writing and debugging queries, profiling their execution. But you can also run those queries programmatically, which is very convenient if you want to load the result the query into a pandas dataframe, or just perform your searches programmatically.\n", |
| 184 | + "BigQuery SQL workspace is a very convenient tool for exploring schemas of the tables, experimenting with writing and debugging queries, profiling their execution. But you can also run those queries programmatically, which is very convenient if you want to direct the result the query into a pandas dataframe, or just perform your searches programmatically.\n", |
183 | 185 | "\n", |
184 | | - "BigQuery API is implemented in a variety of languages, with the python bindings available in the `google-cloud-bigquery` package. Conveniently, this package is pre-installed in Colab!\n", |
| 186 | + "BigQuery API support is implemented in a variety of languages, with the python bindings available in the `google-cloud-bigquery` package. Conveniently, this package is pre-installed in Colab!\n", |
185 | 187 | "\n", |
186 | | - "HINT: SQL query syntax is not sensitive to indentation or capitalization - although those are quite helpful to make the query more readable!" |
| 188 | + "**HINT**: SQL query syntax is not sensitive to indentation or capitalization - although those are quite helpful to make the query more readable!" |
187 | 189 | ] |
188 | 190 | }, |
189 | 191 | { |
|
221 | 223 | "source": [ |
222 | 224 | "## Exploring other IDC portal attributes via SQL\n", |
223 | 225 | "\n", |
224 | | - "Next we will explore few other attributes that are available in the IDC portal (with the few exceptions, the mapping is pretty straightforward):\n", |
| 226 | + "Next we will explore few other attributes that are available in the [IDC portal](https://imaging.datacommons.cancer.gov/) (with the few exceptions, the mapping is pretty intuitive):\n", |
225 | 227 | "\n", |
226 | 228 | "\n", |
227 | 229 | "\n", |
|
301 | 303 | "# we specified in the beginning of the notebook!\n", |
302 | 304 | "bq_client = bigquery.Client(my_ProjectID)\n", |
303 | 305 | "\n", |
304 | | - "# Execution of this cell will fail unless you wrote the query below!\n", |
305 | 306 | "selection_query = \"\"\"\n", |
306 | 307 | "SELECT\n", |
307 | 308 | " collection_id,\n", |
|
0 commit comments