bigquery使用教程_使用PHP和Google BigQuery探索Github的公共活动

bigquery使用教程

If you’ve been following along with my previous articles about Github’s API, you know that Github’s developers are doing their best to ease the pain of interacting with Github data. In this article, we’re going to take a look at the Github public events API and we will build a small demo along the way.

如果您一直在关注我以前关于Github API的文章,那么您就会知道Github的开发人员正在尽最大的努力来减轻与Github数据交互的痛苦。 在本文中,我们将研究Github公共事件API,并在此过程中构建一个小型演示。

Github Logo

什么是Github公共活动? (What are Github Public Events?)

Github events are interactions made by users, like pushing, merging, creating repositories, etc. It’s like storing the history of Github. However, we are limited to the last 300 events, and that’s a problem if you want to search through the data of the whole history.

Github事件是用户进行的交互,例如推,合并,创建存储库等。这就像存储Github的历史一样。 但是,我们仅限于最近的300个事件,如果您要搜索整个历史记录的数据,那将是一个问题。

Ilya Grigorik wanted to keep track of his open source project, but due to the limitation of the Github public events API, he decided to create the GithubArchive website where he queries the Github API and stores the results. You can query the archive with links like http://data.githubarchive.org/2015-01-01-15.json.gz and you’ll get an archive containing the JSON payload. This solution is not really efficient and doesn’t help with querying the history for data, so, when Google BigQuery was released, Grigorik moved all of his data to the new storage and made it public.

Ilya Grigorik希望跟踪他的开源项目,但是由于Github公共事件API的局限性,他决定创建GithubArchive网站,在该网站上他查询Github API并存储结果。 您可以使用http://data.githubarchive.org/2015-01-01-15.json.gz链接查询档案,您将获得一个包含JSON有效负载的档案。 该解决方案并非真正有效,并且无法查询数据历史记录,因此,在发布Google BigQuery时,Grigorik将其所有数据移至新存储中并公开。

为什么选择BigQuery? (Why BigQuery?)

Google BigQuery was created to solve the queries latency problem. When you’re dealing with big data, you need to have the right hardware and infrastructure to handle the load properly. Now, you have the opportunity to move your append-only database to Google’s infrastructure and see millions of data rows processed in seconds. If you’re not familiar with Google APIs, be sure to check this guide about starting a new Google API project. After that, be sure to enable the BigQuery API on your Google developers console.

创建Google BigQuery是为了解决查询延迟问题。 处理大数据时,您需要具有正确的硬件和基础架构来正确处理负载。 现在,您可以将仅附加数据库移动到Google的基础架构,并在几秒钟内查看数百万个数据行。 如果您不熟悉Google API,请务必查看有关启动新Google API项目的指南 。 之后,请确保在您的Google开发人员控制台上启用BigQuery API。

搭建环境 (Setting Up the Environment)

I will be using Laravel as my framework of choice, and Homestead Improved for my VM. We will grab our Google credentials from the config/google.php file, and to follow Laravel’s security conventions I will store those credentials inside the .env file.

我将使用Laravel作为我的选择框架,并为我的VM使用Homestead Improvementd 。 我们将从config/google.php文件中获取Google凭据,并遵循Laravel的安全约定,我将这些凭据存储在.env文件中。

// config/google.php

return [
    'app_name'      => env('app_name'),
    'client_id'     => env('client_id'),
    'client_secret' => env('client_secret'),
    'api_key'       => env('api_key')
];
// .env

APP_DEBUG=true
app_name='Optional app name'
client_id='CHANGE IT'
client_secret='CHANGE IT'
api_key='CHANGE IT'

I will also create a new Service Provider to hold our Google services bindings.

我还将创建一个新的服务提供商来保存我们的Google服务绑定。

// app/Providers/BigQueryServiceProvider.php

class BigQueryServiceProvider extends ServiceProvider 
{

	public function register()
	{
        $this->app->bind('GoogleClient', function () {
            $googleClient = new \Google_Client();
            $googleClient->setAccessToken(\Session::get("token"));

            return $googleClient;
        });

        $this->app->bind('bigquery', function () {
            $googleClient = \App::make('GoogleClient');
            $bigquery = new \Google_Service_Bigquery($googleClient);

            return $bigquery;
        });
	}
}
// config/app.php

// ...
'providers' => [
    // ...
    'App\Providers\BigQueryServiceProvider',
]

Since our application will require the user authorization to query the API, I tried to make the process easier by creating a Middleware for guarding our routes.

由于我们的应用程序需要用户授权才能查询API,因此我试图通过创建用于保护我们的路由的中间件来简化此过程。

// app/Http/Middleware/GoogleLogin.php

class GoogleLogin
{
  public function handle($request, Closure $next)
  {
      $ga = \App::make('\App\Services\GoogleLogin');
      if (!$ga->isLoggedIn()) {
          return redirect('login');
      }

      return $next($request);
  }
}
// app/Kernel.php

protected $routeMiddleware = [
    // ...
    'google_login' => '\App\Http\Middleware\GoogleLogin'
];

Google BigQuery查询控制台 (Google BigQuery Query Console)

Before we start working with BigQuery using the PHP library, lets start exploring the BigQuery Query console. The console has a Compose Query button to let you write your queries. Underneath, we have a Query History link which shows our queries history. Since our queries will take a while to respond, we can execute a query as a job and ask for its stats after a while – the Job History link holds the list of the current jobs.

在开始使用PHP库使用BigQuery之前,让我们开始探索BigQuery查询控制台 。 控制台上有一个“ Compose Query按钮,可让您编写查询。 在下面,我们有一个Query History链接,其中显示了我们的查询历史记录。 由于我们的查询需要一段时间才能响应,因此我们可以将查询作为一项工作执行,并在一段时间后询问其统计信息-“ Job History链接保存了当前工作的列表。

BigQuery Console

If you want to take a look at the table schema and explore the available fields, you can navigate to the left panel and select the Github timeline table.

如果要查看表架构并浏览可用字段,可以导航到左侧面板并选择Github时间轴表。

Table schema

Google BigQuery API和Github (Google BigQuery API and Github)

As an example, let’s try to retrieve the top ten popular projects on Github. I will define popularity as the number of forks. The Google_Service_Bigquery class is responsible for all types of interactions with the Google BigQuery API. I previously introduced the term jobs, which means that we can execute a query in the background and we can fetch its status over time to see if it completed or not.

例如,让我们尝试检索Github上最受欢迎的十个项目。 我将受欢迎程度定义为叉子的数量。 Google_Service_Bigquery类负责与Google BigQuery API的所有类型的交互。 我以前推出的长期jobs ,这意味着我们可以在后台执行查询,如果它完成与否,我们可以随着时间的推移,看获取其状态。

// app/Http/routes.php

Route::any('/topTen', ['middleware' => 'google_login', 'as' => 'topTen', 'uses' => 'BigQueryAPIController@topTen']);

// app/Http/Controllers/BigQueryAPIController.php

class BigQueryAPIController extends Controller
{
    public function __construct()
    {
        $this->bigqueryService = \App::make("bigquery");
    }
    
    public function topTen()
    {
        $projectID = 'modular-robot-22';
        $query_str = "SELECT repository_url, MAX(repository_forks) as max_forks" .
            " FROM githubarchive:github.timeline" .
            " GROUP EACH BY repository_url" .
            " ORDER BY max_forks DESC" .
            " LIMIT 10";

        $query = new \Google_Service_Bigquery_QueryRequest();
        $query->setQuery($query_str);

        $result = $this->bigqueryService->jobs->query($projectID, $query);
        
        $fields = $result->getSchema()->getFields();
        $rows = $result->getRows();

        dump($fields);
        foreach ($rows as $row) {
            dump($row->getF());
        }
    }    
}

The project ID can be found on the Google Developers Console. The query string is fetching the respository_url, and MAX(repository_forks). The getSchema method returns a Google_Service_Bigquery_TableSchema instance which contains the selected column names from the response, and the getRows method returns an iterable list of Google_Service_Bigquery_TableRow. The output looks like the following.

您可以在Google Developers Console上找到项目ID。 查询字符串正在获取respository_urlMAX(repository_forks)getSchema方法返回一个Google_Service_Bigquery_TableSchema实例,该实例包含从响应中选择的列名,而getRows方法返回一个可迭代的Google_Service_Bigquery_TableRow列表。 输出如下所示。

Top Ten Query Dump

The githubarchive:github.timeline dataset has been deprecated and won’t be updated anymore. The new way to work with data is through a specific time range (year dataset, month dataset, day dataset); you can visit githubarchive to learn more about this. Now we must update our query above to work with last year’s data.

githubarchive:github.timeline数据集已被弃用,不再更新。 处理数据的新方法是通过特定的时间范围(年数据集,月数据集,日数据集); 您可以访问githubarchive以了解更多信息。 现在,我们必须更新上面的查询以使用去年的数据。

// app/Http/Controllers/BigQueryAPIController.php

public function topTen()
{
    $projectID = 'modular-robot-22';
    $query_str = "SELECT repository_url, MAX(repository_forks) as max_forks" .
        " FROM githubarchive:year.2014" .
        " GROUP EACH BY repository_url" .
        " ORDER BY max_forks DESC" .
        " LIMIT 10";

    //...
}

按语言列出的十大存储库 (Top Ten Repositories by Language)

We will adapt the previous query to pull repositories of a specific language. Our view markup will contain a form and a set of language options.

我们将调整先前的查询以提取特定语言的存储库。 我们的视图标记将包含一个表单和一组语言选项。

// resources/views/home.blade.php

<form action="/" method="POST" class="form-inline">
    <label for="language">Language: </label>
    <select name="language" id="language" class="form-control">
        <option value="PHP">PHP</option>
        <option value="JavaScript">JavaScript</option>
        <option value="CSS">CSS</option>
        <option value="Python">Python</option>
    </select>

    <input type="submit" class="form-control" value="Submit"/>
</form>

When the user submits the form, we retrieve the selected language and fill the condition in our SQL query. Because we will print our query result as a table, we will split the table header part from the table body.

当用户提交表单时,我们检索选择的语言并在SQL查询中填写条件。 因为我们会将查询结果打印为表格,所以我们将从表格主体中拆分表格标题部分。

// app/Http/Controllers/BigQueryAPIController.php

public function topTen()
{
    if (!\Input::has('language')) {
        return \View::make('home', ['repos' => null]);
    }

    $language = \Input::get('language', 'PHP');
    $projectID = 'modular-robot-647';
    $query_str = "SELECT repository_url, MAX(repository_forks) as max_forks" .
        " FROM githubarchive:year.2014" .
        " WHERE repository_language='$language'" .
        " GROUP EACH BY repository_url" .
        " ORDER BY max_forks DESC" .
        " LIMIT 10";

    $query = new \Google_Service_Bigquery_QueryRequest();
    $query->setQuery($query_str);

    $result = $this->bigqueryService->jobs->query($projectID, $query);
    // getting the table header
    $fields = $result->getSchema()->getFields();
    // query response rows
    $repos = $result->getRows();
    
    return \View::make('home', ['repos' => $repos, 'tableHeader' => $fields]);
}

The only thing changed from the previous query is the where condition, which filters languages by user input. When printing the result table I tried to keep things generic for most queries.

与上一个查询相比,唯一的变化是where条件,该条件通过用户输入过滤语言。 在打印结果表时,我尝试使大多数查询保持通用。

// resources/views/home.blade.php

@if($repos)
    <table class="table table-hover">
        <tr>
            @foreach($tableHeader as $item)
                <th>{{$item->getName()}}</th>
            @endforeach
        </tr>
        @foreach($repos as $repo)
            <tr>
                @foreach($repo->getF() as $item)
                    <td>{{$item->getV()}}</td>
                @endforeach
            </tr>
        @endforeach
    </table>
@else
    <h2>Nothing to show</h2>
@endif

The getF() method returns the row content, while the getV() method returns the cell content. You can inspect result values by dumping the class names and going through properties.

getF()方法返回行内容,而getV()方法返回单元格内容。 您可以通过转储类名称并遍历属性来检查结果值。

PHP top ten

It looks like CodeIgniter was on top when I tested this, but this is not the case at the time of writing – Laravel is the most popular PHP framework on Github. Make sure to check the Github trending repositories page to see almost realtime data.

当我测试此代码时,看起来好像是CodeIgniter在顶部,但在撰写本文时情况并非如此– Laravel是Github上最受欢迎PHP框架。 确保检查Github 趋势存储库页面以查看几乎实时的数据。

If you chose to work with data added post 01/01/2015, the scheme will also contain nested records plus a JSON encoded payload field. This applies to working with the day datasets as well – you can use the Google BigQuery console to inspect the dataset schema. BigQuery has some useful JSON functions to work with JSON encoded data, be sure to check them out.

如果您选择使用01/01/2015日之后添加的数据,则该方案还将包含嵌套记录以及JSON编码的有效负载字段。 这也适用于处理当天数据集-您可以使用Google BigQuery控制台检查数据集架构。 BigQuery具有一些有用的JSON函数 ,可用于JSON编码的数据,请务必将其检出。

结论 (Conclusion)

In this article, we introduced the Github Public Events and how you can use them through BigQuery. The final demo includes a basic usage example, but you may want to experiment with the API and get yourself ready for the annual Github data challenge with some cool Github data facts. What do you think about the Github Events API? Do you have any cool ideas for the data challenge? Let us know!

在本文中,我们介绍了Github公共事件以及如何通过BigQuery使用它们。 最终的演示包括一个基本的用法示例,但是您可能想尝试使用API​​,并通过一些很棒的Github数据事实为年度Github数据挑战做准备。 您如何看待Github Events API? 您对数据挑战有什么好主意吗? 让我们知道!

翻译自: https://www.sitepoint.com/exploring-githubs-public-events-php-google-bigquery/

bigquery使用教程

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值