Skip to content

Commit a380222

Browse files
committed
cleanup
1 parent c2a45de commit a380222

2 files changed

Lines changed: 17 additions & 16 deletions

File tree

notebooks/getting_started/part1_prerequisites.ipynb

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -68,7 +68,7 @@
6868
"\n",
6969
"None of the activities in this tutorial series will require you to pay for use of any GCP services, to have cloud credits, or even to connect your credit card to your account.\n",
7070
"\n",
71-
"Egress of IDC data out of the cloud is free. While query of the data is not free, GCP [BigQuery free tier](https://cloud.google.com/bigquery/pricing#free-tier) includes 1 TB of query data per month, which will be sufficient to do a lot of queries of IDC data."
71+
"Egress of IDC data out of the cloud is free. While query of the data is not free, GCP [BigQuery free tier](https://cloud.google.com/bigquery/pricing#free-tier) includes 1 TB of query data per month, which we believe will be sufficient for most users."
7272
]
7373
},
7474
{
@@ -77,16 +77,16 @@
7777
"source": [
7878
"## What do I need to get started?\n",
7979
"\n",
80-
"All you need is a Google account (identity) and a web browser. If you don't have a Google account, you can learn how to get one [here](https://accounts.google.com/signup/v2/webcreateaccount?dsh=308321458437252901&continue=https%3A%2F%2Faccounts.google.com%2FManageAccount&flowName=GlifWebSignIn&flowEntry=SignUp#FirstName=&LastName=). Note that you do NOT need a Gmail email account - [you can use your non-Gmail email address to create one instead](https://support.google.com/accounts/answer/27441?hl=en#existingemail).\n",
80+
"All you need is a Google account (google identity) and a web browser. If you don't have a Google account, you can learn how to get one [here](https://accounts.google.com/signup/v2/webcreateaccount?dsh=308321458437252901&continue=https%3A%2F%2Faccounts.google.com%2FManageAccount&flowName=GlifWebSignIn&flowEntry=SignUp#FirstName=&LastName=). Note that you do NOT need a Gmail email account - [you can use your non-Gmail email address to create one instead](https://support.google.com/accounts/answer/27441?hl=en#existingemail).\n",
8181
"\n",
82-
"<font color='red'>**WARNING**</font>: if you have a Google account that was provided by your organization, it may not be suitable for this tutorial due to the restrictions imposed by your organization. "
82+
"<font color='red'>**WARNING**</font>: if you have a Google account that was provided by your organization, it may not be suitable for this tutorial if the organization managing your account has restrictions in place related to GCP! If you experience issues using your organization account, please switch to a personal one (you can create one just for the purposes of this tutorial, if you prefer)."
8383
]
8484
},
8585
{
8686
"cell_type": "markdown",
8787
"metadata": {},
8888
"source": [
89-
"# Let's do it!\n",
89+
"# Activate GCP for your account and create a GCP project\n",
9090
"\n",
9191
"1. Go to https://console.cloud.google.com/, and accept Terms and conditions.\n",
9292
"\n",
@@ -129,9 +129,9 @@
129129
"source": [
130130
"## Locate and add `bigquery-public-data` project\n",
131131
"\n",
132-
"`bigquery-public-data` is a public project that contains BigQuery tables with IDC metadata (we will work with those in the part 2 of this series). To tavigate those metadata tables we need to manually add this project.\n",
132+
"`bigquery-public-data` is a public project that contains BigQuery tables with IDC metadata (we will work with those in the part 2 of this series). To navigate those metadata tables you need to manually add this project to your workspace.\n",
133133
"\n",
134-
"1. Navigate to the BigQuery console: https://console.cloud.google.com/bigquery, and click the `+ ADD DATA` button.\n",
134+
"1. Open the BigQuery console: https://console.cloud.google.com/bigquery, and click the `+ ADD DATA` button.\n",
135135
"\n",
136136
"<img src=\"https://www.dropbox.com/s/cg99cyn1uzigw7s/add_data.png?raw=1\" alt=\"add data\" width=\"400\"/>\n",
137137
"\n",

notebooks/getting_started/part2_searching_basics.ipynb

Lines changed: 11 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -98,7 +98,7 @@
9898
"id": "kZMMjky1898-"
9999
},
100100
"source": [
101-
"## What does it mean to search?\n",
101+
"## How do I search?\n",
102102
"\n",
103103
"When you search, or _query_ IDC catalog, you specify what criteria should the metadata describing the selected files satisfy. \n",
104104
"\n",
@@ -112,9 +112,11 @@
112112
"\n",
113113
"Although it would be very nice to just state what you need in free form, in practice queries need to be written in a formal way.\n",
114114
"\n",
115-
"IDC organizes all of the metadata into large tables, where each row corresponds to one image file (as of writing, IDC indexes ~42 millions of files) and each column represents a metadata attribute present in one or more files in IDC (currently, we have hundreds of such attributes). \n",
115+
"IDC organizes all of the metadata into large tables, where each row corresponds to one image file (as of IDC data release v12, we index ~42 millions of files) and each column represents a metadata attribute present in one or more files in IDC (currently, we index hundreds of such attributes). \n",
116116
"\n",
117-
"IDC metadata tables are maintained in [GCP BigQuery](https://cloud.google.com/bigquery), with only a tiny subset of the attributes indexed in the catalog available via the [IDC Portal exploration page](https://imaging.datacommons.cancer.gov/explore/). IDC metadata can be queried using Standard Query Language (SQL), and does not require learning any IDC-specific API. "
117+
"IDC metadata tables are maintained in [GCP BigQuery](https://cloud.google.com/bigquery), with only a tiny subset of the attributes indexed in the catalog available via the [IDC Portal exploration page](https://imaging.datacommons.cancer.gov/explore/). IDC metadata can be queried using Standard Query Language (SQL), and does not require learning any IDC-specific API. \n",
118+
"\n",
119+
"In the following steps of the tutorial we will use just a few of the attributes (SQL table columns) to get started. You will be able to use the same principles and SQL queries to extend your search criteria to include any of the other attributes indexed by IDC."
118120
]
119121
},
120122
{
@@ -125,7 +127,7 @@
125127
"source": [
126128
"## First query and BigQuery workspace\n",
127129
"\n",
128-
"To get started, let's build the queries that replicate the information about IDC data shown in the IDC Portal.\n",
130+
"To get started, let's build the queries that replicate some of the information about IDC data shown in the IDC Portal.\n",
129131
"\n",
130132
"As the very first query, let's get the list of all the image collections available in IDC. Here is that query:\n",
131133
"\n",
@@ -158,7 +160,7 @@
158160
"\n",
159161
"This name is like an address that allows to locate the specific table in BigQuery. This \"address\" consists of three components: <project_id>.<dataset_id>.<table_id>\n",
160162
"\n",
161-
"1. `bigquery-public-data` is a public GCP _project_ that is maintained by Google Public Datasets Program. IDC-curated BigQuery tables with the metadata about IDC images is included in this project.\n",
163+
"1. `bigquery-public-data` is the ID of a public GCP _project_ that is maintained by Google Public Datasets Program. IDC-curated BigQuery tables with the metadata about IDC images are included in this project.\n",
162164
"2. `idc_current`is a _dataset_ within the `bigquery-public-data` project. Think of BigQuery datasets as containers that are used to organize and control access to the tables within the project.\n",
163165
"3. `dicom_all` is one of the tables within the `idc_current` dataset. As you spend more time learning about IDC, you will hopefully leverage other tables available in that dataset.\n",
164166
"\n",
@@ -179,11 +181,11 @@
179181
"source": [
180182
"## Same query using Python SDK\n",
181183
"\n",
182-
"BigQuery SQL workspace is a very convenient tool for exploring schemas of the tables, experimenting with writing and debugging queries, profiling their execution. But you can also run those queries programmatically, which is very convenient if you want to load the result the query into a pandas dataframe, or just perform your searches programmatically.\n",
184+
"BigQuery SQL workspace is a very convenient tool for exploring schemas of the tables, experimenting with writing and debugging queries, profiling their execution. But you can also run those queries programmatically, which is very convenient if you want to direct the result the query into a pandas dataframe, or just perform your searches programmatically.\n",
183185
"\n",
184-
"BigQuery API is implemented in a variety of languages, with the python bindings available in the `google-cloud-bigquery` package. Conveniently, this package is pre-installed in Colab!\n",
186+
"BigQuery API support is implemented in a variety of languages, with the python bindings available in the `google-cloud-bigquery` package. Conveniently, this package is pre-installed in Colab!\n",
185187
"\n",
186-
"HINT: SQL query syntax is not sensitive to indentation or capitalization - although those are quite helpful to make the query more readable!"
188+
"**HINT**: SQL query syntax is not sensitive to indentation or capitalization - although those are quite helpful to make the query more readable!"
187189
]
188190
},
189191
{
@@ -221,7 +223,7 @@
221223
"source": [
222224
"## Exploring other IDC portal attributes via SQL\n",
223225
"\n",
224-
"Next we will explore few other attributes that are available in the IDC portal (with the few exceptions, the mapping is pretty straightforward):\n",
226+
"Next we will explore few other attributes that are available in the [IDC portal](https://imaging.datacommons.cancer.gov/) (with the few exceptions, the mapping is pretty intuitive):\n",
225227
"\n",
226228
"![portal_filters](https://www.dropbox.com/s/qt3dhzara1ap7s3/portal_filters.png?raw=1)\n",
227229
"\n",
@@ -301,7 +303,6 @@
301303
"# we specified in the beginning of the notebook!\n",
302304
"bq_client = bigquery.Client(my_ProjectID)\n",
303305
"\n",
304-
"# Execution of this cell will fail unless you wrote the query below!\n",
305306
"selection_query = \"\"\"\n",
306307
"SELECT\n",
307308
" collection_id,\n",

0 commit comments

Comments
 (0)