r语言抓取nba数据_抓取NBA数据比迈克尔·乔丹酷吗

r语言抓取nba数据

介绍 (Introduction)

This summer I picked up a new hobby of following the NBA, as a data enthusiast, I wanted to understand how this season of NBA was different on paper from the previous ones as this was played inside the bubble without any fans.

今年夏天,我成为了数据迷,成为了追随NBA的新爱好,我想了解本赛季的NBA与以往相比在纸面上有何不同,因为这个赛季是在泡沫中进行的,没有任何球迷。

For acquiring relevant data I started by utilizing the python library beautiful soup. However, to my surprise, the data wasn’t stored on the HTML Source page. After some digging, I discovered that the NBA stats website was built with an AngularJS, which means that the site is rendered client-side instead of server-side.

为了获取相关数据,我从利用python库漂亮的汤开始。 但是,令我惊讶的是,数据没有存储在HTML Source页面上。 经过一番挖掘,我发现NBA stats网站是使用AngularJS构建的,这意味着该网站是在客户端而非服务器端呈现的。

什么是客户端渲染 (What is Client-side rendering)

The HTML which is rendered is only a template and it doesn’t hold any data, The Javascript in the server response fetches the data from an API and uses it to create the page client-side.

呈现HTML只是一个模板,不包含任何数据。服务器响应中的Javascript从API提取数据,并使用它来创建客户端页面。

Basically, when you view the pages source code, you wouldn’t find the data but just a template of the webpage.

基本上,当您查看页面源代码时,您不会找到数据,而只会找到网页的模板。

Image for post
NBA States Website NBA各州网站
Image for post
Ctrl + U takes you to the Page source
Ctrl + U将您带到Page源

让我们开始吧 (Let’s Get Started)

In this article, we will be scraping the NBA stats website for the League player’s stats. After hours of researching, I decided to adopt a process that was a lot simpler than Beautiful soup.

在本文中,我们将抓取NBA统计信息网站上的球员数据。 经过数小时的研究,我决定采用比Beautiful汤简单得多的方法。

从网站查找API端点 (Finding the API endpoint from the website)

The first step is to open the webpage you want to scrape on your web browser (preferably google chrome or firefox) and open the developer tools. To do this just right click select inspect.

第一步是在网络浏览器(最好是google chrome或firefox)上打开要抓取的网页,然后打开开发人员工具。 为此,只需右键单击选择检查。

Image for post
Right Click Followed by Inspect
右键单击后进行检查

This would open a panel either to the right or the bottom of the page, Select network and XHR, and reload the page.

这将在页面的右侧或底部打开一个面板,选择Network and XHR,然后重新加载页面。

Image for post
Inspect panel
检查面板

Once we reload the page, all the requests from the page will be visible. At this point, you should do some digging to find the request you want. Most probably the endpoint would be named after the webpage you are looking at.

重新加载页面后,该页面上的所有请求都将可见。 此时,您应该进行一些挖掘以找到所需的请求。 端点很可能以您正在查看的网页命名。

Since we are looking at the league players’ stats page, the endpoint might be named something similar. Select every option and preview the results to find the correct endpoint.

由于我们正在查看联盟球员的统计信息页面,因此端点的名称可能类似。 选择每个选项并预览结果以找到正确的端点。

Image for post
Select and Preview
选择并预览

Once you find the right endpoint you are all set to move to the next step.

找到正确的端点后,就可以进行下一步了。

调用API端点以获取数据 (Calling the API endpoint to get the data)

In order to call the API, we would make use of the requests python package. In order to do so, We need 3 components as part of the request syntax below.

为了调用API,我们将使用请求 python包。 为此,我们需要3个组件作为下面请求语法的一部分。

The first part is the URL, in our case since we are accessing the league player stats, we can get it from the last step.

第一部分是URL,在本例中,由于我们正在访问联盟球员统计信息,因此可以从最后一步获取它。

Under the Header tab, select general and copy the first part of the request URL.

在[标题]标签下,选取[一般],然后复制要求网址的第一部分。

Image for post
Request URL
要求网址

Next, we need the request headers, Which can also be found under the same Header tab but under “Request Headers” Subsection.

接下来,我们需要请求标头,也可以在同一“标头”选项卡的“请求标头”子部分下找到该标头。

Image for post
Request Header
请求标题

Header as a dictionary

标头作为字典

The final component we need is the parameters, which can be found under the “Query String Parameter” subsection under the Header tab.

我们需要的最后一个组成部分是参数,可以在“标题”选项卡的“查询字符串参数”小节中找到这些参数。

Image for post
Parameters
参量

Parameters

参量

Now that we have all three parts, it’s simple to call the API. The response can then be manipulated into a data frame for analysis.

既然我们拥有了全部三个部分,那么调用该API变得很简单。 然后可以将响应处理到数据帧中进行分析。

Get request

获取请求

The final request will look something like this,

最终请求将如下所示:

Image for post
Dataframe
数据框

谢谢 !! (Thank You !!)

Congratulations 👏 !! we have successfully scraped the NBA stats website.

恭喜👏! 我们已经成功地删除了NBA统计信息网站。

PS: This process definitely works for stats.nba.com. This might also work for any other website which was built with a client-side web framework using languages such as AngularJS. If the website you are targeting is built with the server-side framework and languages like Django or Ruby on Rails then our friend Beautiful Soup will give you a hand.

PS:此过程绝对适用于stats.nba.com 。 这对于使用客户端网络框架(使用AngularJS等语言)构建的任何其他网站也可能适用。 如果您定位的网站是使用服务器端框架和诸如Django或Ruby on Rails之类的语言构建的,那么我们的朋友Beautiful Soup将为您提供帮助。

Good luck with your web scraping journey! I hope this post is helpful.

祝您在网络抓取过程中一切顺利! 希望这篇文章对您有所帮助。

Feel free to reach out to me on Twitter or Linkedin if you may have any questions.

如果您有任何疑问,请随时在TwitterLinkedin上与我联系。

翻译自: https://towardsdatascience.com/how-scraping-nba-stats-is-cooler-than-michael-jordan-49d7562ce3ef

r语言抓取nba数据

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值