How to carry out data exploration of large datasets using BigQuery, Pandas, and Juypter?

原创 2018年04月16日 14:28:42

Analyzing data using Datalab and BigQuery

Overview

Duration is 1 min

In this lab you analyze a large (70 million rows, 8 GB) airline dataset using Google BigQuery and Cloud Datalab.

What you learn
In this lab, you:

  • Launch Cloud Datalab
  • Invoke a BigQuery query
  • Create graphs in Datalab
    This lab illustrates how you can carry out data exploration of large datasets, but continue to use familiar tools like Pandas and Juypter. The “trick” is to do the first part of your aggregation in BigQuery, get back a Pandas dataset and then work with the smaller Pandas dataset locally. Cloud Datalab provides a managed Jupyter experience, so that you don’t need to run notebook servers yourself.

Launch Cloud Datalab

Duration: 2

To launch Cloud Datalab:

Step 1
Open Cloud Shell. The cloud shell icon is at the top right of the Google Cloud Platform web console:

Step 2
In Cloud Shell, type:

gcloud compute zones list

Note: Please pick a zone in a geographically closeby region from the following: us-east1, us-central1, asia-east1, europe-west1. These are the regions that currently support Cloud ML Engine jobs. Please verify here since this list may have changed after this lab was last updated. For example, if you are in the US, you may choose us-east1-c as your zone.

Step 2
In Cloud Shell, type:

datalab create mydatalabvm --zone <ZONE>
Note: Please replace with a zone name you picked from the previous step):

Note: follow the prompts during this process.

Datalab will take about 5 minutes to start. Move on to the next step.

Invoke BigQuery

Duration is 3 min

To invoke a BigQuery query:

Step 1
Navigate to the BigQuery console by selecting BigQuery from the top-left-corner (“hamburger”) menu.

Note: If you get an error on the BigQuery console that starts with Error: Access Denied: Project qwiklabs-resources:, then click on the drop-down menu and switch to your Qwiklabs project.

Step 2
In the BigQuery Console, click on Compose Query. Then, select Show Options and ensure that the Legacy SQL menu is NOT checked (we will be using Standard SQL).

Step 3
In the query textbox, type:

standardSQL

SELECT
departure_delay,
COUNT(1) AS num_flights,
APPROX_QUANTILES(arrival_delay, 5) AS arrival_delay_quantiles
FROM
bigquery-samples.airline_ontime_data.flights
GROUP BY
departure_delay
HAVING
num_flights > 100
ORDER BY
departure_delay ASC

What is the median arrival delay for flights left 35 minutes early? _

(Answer: the typical flight that left 35 minutes early arrived 28 minutes early.)

Step 4
Look back at Cloud Shell, and follow any prompts. If asked for a ssh passphrase, just hit return (for no passphrase).

Step 5 (Optional)
Can you write a query to find the airport pair (departure and arrival airport) that had the maximum number of flights between them?

Hint: you can group by multiple fields.

One possible answer:

standardSQL
SELECT
departure_airport,
arrival_airport,
COUNT(1) AS num_flights
FROM
bigquery-samples.airline_ontime_data.flights
GROUP BY
departure_airport,
arrival_airport
ORDER BY
num_flights DESC
LIMIT
10

Draw graphs in Cloud Datalab

Duration is 7 min

Step 1
If necessary, wait for Datalab to finish launching. Datalab is ready when you see a message prompting you to do a “Web Preview”.

Step 2
Click on the Web Preview icon on the top-right corner of the Cloud Shell ribbon. Switch to port 8081.

Note: If the cloud shell used for running the datalab command is closed or interrupted, the connection to your Cloud Datalab VM will terminate. If that happens, you may be able to reconnect using the command ‘datalab connect mydatalabvm’ in your new Cloud Shell.

Step 3
In Cloud Datalab home page (browser), navigate into notebooks. You should now be in datalab/notebooks/

Step 4

Start a new notebook by clicking on the +Notebook icon. Rename the notebook to be flights.

Step 5
In a cell in Datalab, type the following, then click Run

`query=
SELECT
departure_delay,
COUNT(1) AS num_flights,
APPROX_QUANTILES(arrival_delay, 10) AS arrival_delay_deciles
FROM
bigquery-samples.airline_ontime_data.flights
GROUP BY
departure_delay
HAVING
num_flights > 100
ORDER BY
departure_delay ASC

import google.datalab.bigquery as bq
df = bq.Query(query).execute().result().to_dataframe()
df.head()`

Note that we have gotten the results from BigQuery as a Pandas dataframe.

In what Python data structure are the deciles in?

Step 6
In the next cell in Datalab, type the following, then click
Run

import pandas as pd
percentiles = df['arrival_delay_deciles'].apply(pd.Series)
percentiles = percentiles.rename(columns = lambda x : str(x*10) + "%")
df = pd.concat([df['departure_delay'], percentiles], axis=1)
df.head()

What has the above code done to the columns in the Pandas DataFrame?

Step 7
In the next cell in Datalab, type the following, then click Run

without_extremes = df.drop(['0%', '100%'], 1)
without_extremes.plot(x='departure_delay', xlim=(-30,50), ylim=(-50,50));

Suppose we were creating a machine learning model to predict the arrival delay of a flight. Do you think departure delay is a good input feature? Is this true at all ranges of departure delays?

Hint: Try removing the xlim and ylim from the plotting command.

Cleanup (Optional)

Duration is 1 min

Step 1

You could leave Datalab instance running until your class ends. The default machine type is relatively inexpensive. However, if you want to be frugal, you can stop and restart the instance between labs or when you go home for the day. To do so, follow the next two steps.

Step 2

Click on the person icon in the top-right corner of your Datalab window and click on the button to STOP the VM.

Step 3

You are not billed for stopped VMs. Whenever you want to restart Datalab, open Cloud Shell and type in:

datalab connect mydatalabvm
This will restart the virtual machine and launch the Docker image that runs Datalab.

Summary

In this lab, you learned how to carry out data exploration of large datasets using BigQuery, Pandas, and Juypter. The “trick” is to do the first part of your aggregation in BigQuery, get back a Pandas dataset and then work with the smaller Pandas dataset locally. Cloud Datalab provides a managed Jupyter experience, so that you don’t need to run notebook servers yourself.

收藏助手
不良信息举报
您举报文章:How to carry out data exploration of large datasets using BigQuery, Pandas, and Juypter?
举报原因:
原因补充:

(最多只允许输入30个字)