bigquery 教程_bigquery挑战实验室教程从数据中获取见解

最新推荐文章于 2025-04-15 18:01:14 发布

张_伟_杰

最新推荐文章于 2025-04-15 18:01:14 发布

阅读量855

点赞数

文章标签：人工智能 python 机器学习大数据

原文链接：https://medium.com/swlh/insights-from-data-with-bigquery-challenge-lab-tutorial-f868992ef9dc

版权

bigquery 教程

This medium article focusses on the detailed walkthrough of the steps I took to solve the challenge lab of the Insights from Data with BigQuery Skill Badge on the Google Cloud Platform (Qwiklabs). I got access to this lab in the Google Cloud Ready Facilitator Program. Thanks to Google!

这篇中篇文章重点介绍了我为解决Google Cloud Platform( Qwiklabs )上的BigQuery Skill Badge数据见解挑战实验室而采取的步骤的详细演练。我可以通过Google Cloud Ready Facilitator计划访问此实验室。 感谢Google！

Till now, I have completed over 100 labs and 23 quests on Qwiklabs. Below is the reference of my profile.

到目前为止，我已经完成了100多个实验室和Qwiklabs上的23个任务 。以下是我的个人资料参考。

This lab is only recommended for students who have completed the labs in the Insights from Data with BigQuery Quest. Knowledge of SQL and BigQuery is also needed to solve this challenge lab. Are you up for the challenge? Let’s go!

仅向在使用BigQuery Quest进行数据洞察中完成实验的学生推荐该实验。的知识解决此挑战实验室也需要SQL 和 BigQuery 。 你准备好接受挑战了吗？ 我们走吧！

使用的数据集 (Dataset Used)

The dataset that we would be using in this challenge lab is bigquery-public-data.covid19_open_data.covid19_open_data. This dataset contains data related to covid-19 on a country basis globally. We would be using this in this skill badge tutorial.

我们将在此挑战实验室中使用的数据集为bigquery-public-data.covid19_open_data.covid19_open_data。 该数据集包含全球基于国家/地区与covid-19相关的数据。我们将在本技能徽章教程中使用它。

BigQuery Tutorial can be found on the reference below:

可以在以下参考资料中找到BigQuery教程：

挑战场景 (Challenge Scenario)

There are 10 small tasks in this challenge lab, all of which should be completed to score 100/100. In order to pass the lab, there are 9 SQL commands and 1 Data Studio report that should be generated in order to score 100. This tutorial list out the steps I took to solve all the ten challenges within the lab. The ten tasks are as follows:

这个挑战实验室中有10个小任务 ，所有这些小任务都应得分为100/100。为了通过实验室，应生成9条SQL命令和1个Data Studio报告才能获得100分。本教程列出了我为解决实验室中的所有十个挑战而采取的步骤。十个任务如下：

Building a SQL query that outputs the total no. of confirmed cases.
建立一个SQL查询，输出总编号。 确诊病例。
Building a SQL query that outputs the worst affected areas.
构建一个SQL查询以输出受影响最严重的区域。
Building a SQL query that identifies the Hotspots in USA.
建立一个SQL查询来标识美国的热点。
Building a SQL query that outputs the Fatality Ratio.
建立一个输出致命率SQL查询。
Building a SQL query that identifies a specific day according to the constraints.
建立一个SQL查询来根据约束条件确定特定的一天 。
Building a SQL query that outputs the number of days with zero net new cases.
建立一个SQL查询，以输出净新案例为零的天数。
Building a SQL query that outputs the Doubling Rate.
建立一个输出双倍速率SQL查询。
Building a SQL query that outputs the Recovery Rate.
构建一个输出恢复率SQL查询。
Building a SQL query that outputs the CDGR — Cumulative Daily Growth Rate.
构建一个输出CDGRSQL查询-累积每日增长率。
Creating a Datastudio report.
创建一个Datastudio报告。

重要的提示 (Important Note)

Before starting this lab, ensure that you do whatever is required. Allocating more resources or doing something that is not required may lead to blocking of account by qwiklabs admin. Doing something other than that required in the lab results in account blocked by qwiklabs. Don’t worry. I came across this problem. The account can easily be unblocked by contacting qwiklabs support within a second.

在开始本实验之前，请确保您执行所需的任何操作。 分配更多资源或执行不必要的操作可能会导致qwiklabs管理员阻止帐户。 如果执行实验室中未要求的操作，则会导致qwiklabs阻止帐户。 不用担心 我遇到了这个问题。 一秒钟内联系qwiklabs支持人员即可轻松解除帐户锁定。

加载数据集 (Loading the Dataset)

In the cloud console, once logged in completely, Go to Menu > BigQuery.
在云控制台中，一旦完全登录，请转到菜单> BigQuery。
Click + Add Data and then click on Explore Public Datasets from the left pane.
单击+添加数据 ，然后从左窗格中单击探索公共数据集 。
Search covid19_open_data and then select “Covid-19 Open Data”. Click on View Dataset to explore more!
搜索covid19_open_data ，然后选择“ Covid-19 Open Data”。 单击查看数据集以探索更多内容！
Use filter and locate the table covid19_open_data under the covid19_open_data dataset.
使用过滤器并在covid19_open_data下找到表covid19_open_data 数据集。

Image for post — Image by Wynn Pointaux on Pixabay

任务详细教程— 1 (Detailed Tutorial of Task — 1)

In task 1 it requires the user to execute a query that outputs the total count of confirmed cases on Apr 15, 2020. The output should contain only a single row containing the sum of confirmed cases across all the countries in the dataset. total_cases_worldwide should be the name of the column.

在任务1中，它要求用户执行查询，以输出2020年4月15日确诊病例的总数 。输出应仅包含一行，其中包含数据集中所有国家/地区的确诊病例的总数。 total_cases_worldwide应该是列的名称。

Copy the below query in the query editor and click on RUN.

在查询编辑器中复制以下查询，然后单击“ 运行”。

SELECTSUM(cumulative_confirmed) AS total_cases_worldwideFROM
  `bigquery-public-data.covid19_open_data.covid19_open_data`WHERE
  date = "2020-04-15"

任务详细教程— 2 (Detailed Tutorial of Task — 2)

Task 2 requires to build a query for extracting the result of: “How many states in the US had more than 100 deaths on Apr 10, 2020?” The output should have the field name as count_of_states.

任务2需要构建一个查询来提取以下结果：“ 到2020年4月10日，美国有多少州的死亡人数超过100？ 输出的字段名称应为count_of_states。

Hint: We don’t have to include NULL values.(Important)

提示：我们不必包含NULL值。(重要)

Copy the below query in the query editor and click on RUN.

在查询编辑器中复制以下查询，然后单击“ 运行”。

SELECTCOUNT(*) AS count_of_statesFROM (SELECT
    subregion1_name AS state,SUM(cumulative_deceased) AS death_countFROM
  `bigquery-public-data.covid19_open_data.covid19_open_data`WHERE
  country_name="United States of America"AND date='2020-04-10'AND subregion1_name IS NOT NULLGROUP BY
  subregion1_name
)WHERE death_count > 100

任务详细教程— 3 (Detailed Tutorial of Task — 3)

Writing a query that will output the result of: “List all the states in the United States of America that had more than 1000 confirmed cases on Apr 10, 2020?” The output should have two columns named state and total_confirmed_cases that corresponds to State Name and the confirmed cases arranged in descending order.

编写查询将输出以下结果：“ 列出2020年4月10日美国确诊病例超过1000的所有州？ ”输出应具有名为state和total_confirmed_cases的两列，分别对应于State Name和已确认的个案，它们以降序排列。

Copy the below query in the query editor and click on RUN.

在查询编辑器中复制以下查询，然后单击“ 运行”。

SELECT
    subregion1_name AS state,SUM(cumulative_confirmed) AS total_confirmed_casesFROM
    `bigquery-public-data.covid19_open_data.covid19_open_data`WHERE
    country_name="United States of America"AND date = "2020-04-10"GROUP BY subregion1_nameHAVING total_confirmed_cases > 1000ORDER BY total_confirmed_cases DESC

任务详细教程— 4 (Detailed Tutorial of Task — 4)

Building a query in the query editor that will answer the following question: “What was the case-fatality ratio in Italy for the month of April 2020?”

在查询编辑器中构建一个查询，该查询将回答以下问题： “意大利2020年4月的病死率是多少？ ”

Case-fatality ratio is defined as (total deaths / total confirmed cases) * 100. The output should have three columns named total_confirmed_cases, total_deaths and case_fatality_ratio.

病死率定义为(总死亡人数/确诊病例总数)*100 。输出应具有三列，分别称为total_confirmed_cases ， total_deaths和case_fatality_ratio 。

Copy the below query in the query editor and click on RUN.

在查询编辑器中复制以下查询，然后单击“ 运行”。

SELECT SUM(cumulative_confirmed) AS total_confirmed_cases, SUM(cumulative_deceased) AS total_deaths, (SUM(cumulative_deceased)/SUM(cumulative_confirmed))*100 AS case_fatality_ratioFROM `bigquery-public-data.covid19_open_data.covid19_open_data`WHERE country_name="Italy" AND date BETWEEN "2020-04-01" AND "2020-04-30"

任务详细教程— 5 (Detailed Tutorial of Task — 5)

Building a query that will answer the following question: “On what day did the total number of deaths cross 10000 in Italy?”

建立一个查询，将回答以下问题：“ 意大利的总死亡人数在哪一天超过10000？ ”

The query should output the date with a column name “date” and in the format “yyyy-mm-dd”.

查询应以列名称“ date”和格式“ yyyy-mm-dd”输出日期。

Copy the below query in the query editor and click on RUN.

在查询编辑器中复制以下查询，然后单击“ 运行”。

SELECT
 dateFROM
  `bigquery-public-data.covid19_open_data.covid19_open_data`WHERE
 country_name = 'Italy'AND cumulative_deceased > 10000ORDER BY dateLIMIT 1

任务详细教程— 6 (Detailed Tutorial of Task — 6)

The query given should be updated to output the correct number of days in India between 21 Feb 2020 and 15 March 2020 when there were zero increases in the number of confirmed cases.

给出的查询应进行更新，以输出2020年2月21日至2020年3月15日之间印度的正确天数，此时确诊病例数增加为零。

Copy the below query in the query editor and click on RUN.

在查询编辑器中复制以下查询，然后单击“ 运行”。

WITH india_cases_by_date AS (SELECT
    date,SUM(cumulative_confirmed) AS casesFROM
    `bigquery-public-data.covid19_open_data.covid19_open_data`WHERE
    country_name="India"AND date between '2020-02-21' and '2020-03-15'GROUP BY
    dateORDER BY
    date ASC
 )
, india_previous_day_comparison AS
(SELECT
  date,
  cases,
  LAG(cases) OVER(ORDER BY date) AS previous_day,
  cases - LAG(cases) OVER(ORDER BY date) AS net_new_casesFROM india_cases_by_date
)SELECTCOUNT(date)FROM
  india_previous_day_comparisonWHERE
  net_new_cases = 0

任务详细教程— 7 (Detailed Tutorial of Task — 7)

Using the query that we ran in Task 6 as a template, the user has to build a query to find out the dates on which the confirmed cases increased by more than 10% compared to the previous day in the US between the dates March 22, 2020 and April 20, 2020.

使用我们在任务6中运行的查询作为模板，用户必须构建查询以找出确认的病例比3月22日在美国的前一天增加了10％以上的日期， 2020年和2020年4月20日。

There should be four columns named Date, Confirmed_Cases_On_Day, Confirmed_Cases_Previous_Day and Percentage_Increase_In_Cases.

应该有四列，分别命名为Date ， Confirmed_Cases_On_Day ， Confirmed_Cases_Previous_Day和Percentage_Increase_In_Cases 。

Copy the below query in the query editor and click on RUN.

在查询编辑器中复制以下查询，然后单击“ 运行”。

WITH us_cases_by_date AS (SELECT
    date,SUM( cumulative_confirmed ) AS casesFROM
    `bigquery-public-data.covid19_open_data.covid19_open_data`WHERE
    country_name="United States of America"AND date between '2020-03-22' and '2020-04-20'GROUP BY
    dateORDER BY
    date ASC
 )
, us_previous_day_comparison AS
(SELECT
  date,
  cases,
  LAG(cases) OVER(ORDER BY date) AS previous_day,
  cases - LAG(cases) OVER(ORDER BY date) AS net_new_cases,
  (cases - LAG(cases) OVER(ORDER BY date))*100/LAG(cases) OVER(ORDER BY date) AS percentage_increaseFROM us_cases_by_date
)SELECT
  Date,
  cases AS Confirmed_Cases_On_Day,
  previous_day AS Confirmed_Cases_Previous_Day,
  percentage_increase AS Percentage_Increase_In_CasesFROM
  us_previous_day_comparisonWHERE
  percentage_increase > 10

任务详细教程— 8 (Detailed Tutorial of Task — 8)

Building a query to list the recovery rates of countries on the date May 10, 2020 with only those countries having more than 50K confirmed cases and output arranged in descending order (limit to 10). The name of the columns in the output should be as country, recovered_cases, confirmed_cases, recovery_rate in order to score full marks.

生成查询以列出2020年5月10日的国家的恢复率，只有那些确认病例和产量超过5万的国家/地区以降序排列(限制为10个)。在输出列的名称应为国家 ，recovered_cases，confirmed_cases，recovery_rate才能得满分。

Copy the below query in the query editor and click on RUN.

在查询编辑器中复制以下查询，然后单击“ 运行”。

WITH cases_by_country AS (SELECT
    country_name AS country,SUM(cumulative_confirmed) AS cases,SUM(cumulative_recovered) AS recovered_casesFROM
    `bigquery-public-data.covid19_open_data.covid19_open_data`WHERE
    date="2020-05-10"GROUP BY
    country_name
)
, recovered_rate AS (SELECT
    country, cases, recovered_cases,
    (recovered_cases * 100)/cases AS recovery_rateFROM
    cases_by_country
)SELECT country, cases AS confirmed_cases, recovered_cases, recovery_rateFROM
   recovered_rateWHERE
   cases > 50000ORDER BY recovery_rate DESCLIMIT 10

任务详细教程— 9 (Detailed Tutorial of Task — 9)

Building a query that outputs the correct CDGR in the correct format. The CDGR or Cumulative Daily Growth Rate is calculated as:

建立一个以正确格式输出正确CDGR的查询。 CDGR或累计每日增长率计算为：

((last_day_cases/first_day_cases)^1/days_diff)-1)

Where last_day_cases, first_day_cases and days_diff is given as:

其中last_day_cases，first_day_cases和days_diff给出为：

last_day_cases corresponds to the number of confirmed cases on May 10, 2020
last_day_cases对应于2020年5月10日的确诊病例数
first_day_cases corresponds to the number of confirmed cases on Feb 02, 2020
first_day_cases对应于2020年2月2日的确诊病例数
days_diff corresponds to the number of days between Feb 02 - May 10, 2020
days_diff对应于2020年2月2日至5月10日之间的天数

Copy the below query in the query editor and click on RUN.

在查询编辑器中复制以下查询，然后单击“ 运行”。

WITH
  france_cases AS (SELECT
    date,SUM(cumulative_confirmed) AS total_casesFROM
    `bigquery-public-data.covid19_open_data.covid19_open_data`WHERE
    country_name="France"AND date IN ('2020-01-24',
      '2020-05-10')GROUP BY
    dateORDER BY
    date)
, summary as (SELECT
  total_cases AS first_day_cases,
  LEAD(total_cases) OVER(ORDER BY date) AS last_day_cases,
  DATE_DIFF(LEAD(date) OVER(ORDER BY date),date, day) AS days_diffFROM
  france_casesLIMIT 1
)select first_day_cases, last_day_cases, days_diff, POWER(last_day_cases/first_day_cases,1/days_diff)-1 as cdgrfrom summary

任务详细教程— 10 (Detailed Tutorial of Task — 10)

For creating the Data Studio report, a number of steps should be followed.

要创建Data Studio报表，应遵循许多步骤。

1. First of all, Copy the below query in the query editor and click on RUN.

1.首先，在查询编辑器中复制以下查询，然后单击“ 运行”。

SELECT
  date, SUM(cumulative_confirmed) AS country_cases,SUM(cumulative_deceased) AS country_deathsFROM
  `bigquery-public-data.covid19_open_data.covid19_open_data`WHERE
  date BETWEEN '2020-03-15'AND '2020-04-30'AND country_name='United States of America'GROUP BY date

2. Click on EXPLORE DATA > Explore with Data Studio.

2.单击探索数据 > 使用Data Studio探索 。

3. Give access to Data Studio and authorize it to control BigQuery.

3.授予对Data Studio的访问权限，并授权它控制BigQuery。

If you fail to create a report for the very first time login of Data Studio, click + Blank Report option and accept the Terms of Service. Then, go back again to BigQuery page and click Explore with Data Studio again.

如果您第一次登录Data Studio时未能创建报告，请单击+空白报告选项并接受服务条款。然后，再次返回BigQuery页面，然后再次单击“使用Data Studio探索” 。

4. Create a new Time series chart in the new Data Studio report by selecting Add a chart > Time series Chart.

4.通过选择新的Data Studio报告创建一个新的时间序列图表 添加图表 > 时间序列图 。

5. Add country_cases and country_deaths to the Metric field.

5.将country_cases和country_deaths添加到“ 度量”字段。

6. Click Save to commit the change.

6.单击保存以提交更改。

恭喜!! (Congratulations!!)

This is the skill badge I got after completing this challenge lab :P

这是完成挑战实验后获得的技能徽章：P

With this, we have come to the end of this challenge lab. Thanks for reading this and following along. Hope you loved it! Bundle of thanks for reading it!

至此，我们已经到了挑战实验室的终点。感谢您阅读并继续。希望你喜欢它！ 捆绑感谢您阅读！

My Portfolio and Linkedin :)

我的投资组合和Linkedin :)