jquery 抓取微博_使用jQuery和RegexJavaScript进行客户端网络抓取

最新推荐文章于 2024-10-08 12:37:10 发布

cumian8165

最新推荐文章于 2024-10-08 12:37:10 发布

阅读量181

点赞数

文章标签： python java js javascript web ViewUI

原文链接：https://www.freecodecamp.org/news/client-side-web-scraping-with-javascript-using-jquery-and-regex-5b57a271cb86/

版权

jquery 抓取微博

by Codemzy

由Codemzy

使用jQuery和RegexJavaScript进行客户端网络抓取 (Client-side web scraping with JavaScript using jQuery and Regex)

When I was building my first open-source project, codeBadges, I thought it would be easy to get user profile data from all the main code learning websites.

当我构建第一个开源项目codeBadges时，我认为从所有主要的代码学习网站上获取用户资料数据将很容易。

I was familiar with API calls and get requests. I thought I could just use jQuery to fetch the data from the various API’s and use it.

我熟悉API调用和获取请求。我以为我可以使用jQuery从各种API提取数据并使用它。

var name = 'codemzy';                $.get('https://api.github.com/users/' + name, function(response) {                      var followers = response.followers;});

Well, that was easy. But it turns out that not every website has a public API that you can just grab the data you want from.

好吧，那很容易。但事实证明，并非每个网站都有一个公共API，您可以从中获取所需的数据。

But just because there is no public API doesn’t mean you need to give up! You can use web scraping to grab the data, with only a little extra work.

但是，仅仅因为没有公共API并不意味着您需要放弃！您可以使用Web抓取来获取数据，而只需做一些额外的工作 。

Let’s see how we can use client-side web scraping with JavaScript.

让我们看看如何在JavaScript中使用客户端Web抓取。

For an example, I will grab my user information from my public freeCodeCamp profile. But you can use these steps on any public HTML page.

例如，我将从公共的freeCodeCamp配置文件中获取用户信息。但是您可以在任何公共HTML页面上使用这些步骤。

The first step in scraping the data is to grab the full page html using a jQuery .get request.

抓取数据的第一步是使用jQuery .get请求获取整页html。

var name = "codemzy";$.get('https://www.freecodecamp.com/' + name, function(response) {  console.log(response);});

Awesome, the whole page source code just logged to the console.

太棒了，整个页面的源代码都刚刚登录到控制台。

Note: If you get an error at this stage along the lines of No ‘Access-Control-Allow-Origin’ header is present on the requested resource don’t fret. Scroll down to the Don’t Let CORS Stop You section of this post.

注意：如果您在此阶段遇到错误， No 'Access-Control-Allow-Origin' header is present on the requested resource 不要担心。向下滚动到这篇文章的“ 不要让CORS阻止您”部分。

That was easy. Using JavaScript and jQuery, the above code requests a page from www.freecodecamp.org, like a browser would. And freeCodeCamp responds with the page. Instead of a browser running the code to display the page, we get the HTML code.

那很简单。上面的代码使用JavaScript和jQuery，从www.freecodecamp.org请求页面，就像浏览器一样。而freeCodeCamp则以页面响应。我们得到的是HTML代码，而不是浏览器运行代码显示页面。

And that’s what web scraping is, extracting data from websites.

这就是从网站提取数据的网络抓取。

Ok, the response is not exactly as neat as the data we get back from an API.

好的，响应与我们从API返回的数据并不完全一样。

But… we have the data, in there somewhere.

但是……我们在某处有数据。

Once we have the source code the information we need is in there, we just have to grab the data we need!

一旦我们有了源代码，就可以在其中找到所需的信息，我们只需要获取所需的数据即可！

We can search through the response to find the elements we need.

我们可以搜索响应以找到我们需要的元素。

Let’s say we want to know how many challenges the user has completed, from the user profile response we got back.

假设我们想从返回的用户个人资料响应中知道用户已经完成了多少个挑战。

At the time of writing, a camper’s completed challenges completed are organized in tables on the user profile. So to get the total number of challenges completed, we can count the number of rows.

在撰写本文时，露营者已完成的挑战已在用户个人资料的表格中组织。因此，要完成挑战的总数，我们可以计算行数。

One way is to wrap the whole response in a jQuery object, so that we can use jQuery methods like .find() to get the data.

一种方法是将整个响应包装在jQuery对象中，以便我们可以使用.find()类的jQuery方法来获取数据。

// number of challenges completedvar challenges = $(response).find('tbody tr').length;

This works fine — we get the right result. But its is not a good way to get the result we are after. Turning the response into a jQuery object actually loads the whole page, including all the external scripts, fonts and stylesheets from that page…Uh oh!

效果很好-我们得到了正确的结果。但这不是获得我们追求的结果的好方法 。将响应转换为jQuery对象实际上会加载整个页面，包括该页面上的所有外部脚本，字体和样式表……哦！

We need a few bits of data. We really don’t need the page the load, and certainly not all the external resources that come with it.

我们需要一些数据。我们确实不需要页面加载，当然也不需要页面附带的所有外部资源。

We could strip out the script tags and then run the rest of the response through jQuery. To do this, we could use Regex to look for script patterns in the text and remove them.

我们可以剥离脚本标签，然后通过jQuery运行其余的响应。为此，我们可以使用Regex在文本中查找脚本模式并将其删除。

Or better still, why not use Regex to find what we are looking for in the first place?

还是更好，为什么不使用Regex首先找到我们想要的东西？

// number of challenges completedvar challenges = response.replace(/<thead>[\s|\S]*?<\/thead>/g).match(/<tr>/g).length;

And it works! By using the Regex code above, we strip out the table head rows (that did not contain any challenges), and then match all table rows to count the number of challenges completed.

而且有效！通过使用上面的Regex代码，我们去除了表头行(不包含任何挑战)，然后匹配所有表行以计算完成的挑战数。

It’s even easier if the data you want is just there in the response in plain text. At the time of writing the user points were in the html like <h1 class=”flat-top text-primary”>[ 1498 ]</h1> just waiting to be scraped.

如果所需的数据以纯文本形式出现在响应中，则更加容易。在编写本文时，用户点在html中，例如<h1 class=”flat-top text-primary”>[ 1498 ] </ h1>只是在等待被抓取。

var points = response.match(/<h1 class="flat-top text-primary">\[ ([\d]*?) \]<\/h1>/)[1];

In the above Regex pattern we match the h1 element we are looking for including the [ ] that surrounds the points, and group any number inside with ([\d]*?). We get an array back, the first [0] element is the entire match and the second [1] is our group match (our points).

在上面的Regex模式中，我们匹配要寻找的h1元素，包括包围点的[ ] ，并用([\d]*?).分组内部的任何数字([\d]*?). 我们得到一个数组，第一个[0]元素是整个比赛，第二个[1]是我们的小组比赛(我们的得分)。

Regex is useful for matching all sorts of patterns in strings, and it is great for searching through our response to get the data we need.

正则表达式对于匹配字符串中的各种模式很有用，对于搜索我们的响应以获取所需的数据非常有用。

You can use the same 3 step process to scrape profile data from a variety of websites:

您可以使用相同的3个步骤来从各种网站抓取个人资料数据：

Use client-side JavaScript
使用客户端JavaScript
Use jQuery to scrape the data
使用jQuery抓取数据
Use Regex to filter the data for the relevant information
使用Regex过滤数据以获取相关信息

Until I hit a problem, CORS.

在我遇到问题之前，CORS。

不要让CORS阻止您！ (Don’t Let CORS Stop You!)

CORS or Cross-Origin Resource Sharing, can be a real problem with client-side web scraping.

CORS或跨域资源共享可能是客户端Web抓取的真正问题。

For security reasons, browsers restrict cross-origin HTTP requests initiated from within scripts. And because we are using client-side Javascript on the front end for web scraping, CORS errors can occur.

出于安全原因，浏览器限制了从脚本内部发起的跨域HTTP请求。并且因为我们在前端使用客户端Javascript进行Web抓取，所以可能会发生CORS错误。

Here’s an example trying to scrape profile data from CodeWars…

这是一个示例，尝试从CodeWars抓取配置文件数据…

var name = "codemzy";$.get('https://www.codewars.com/users/' + name, function(response) {  console.log(response);});

At the time of writing, running the above code gives you a CORS related error.

在撰写本文时，运行上述代码会给您提供CORS相关错误。

If there is noAccess-Control-Allow-Origin header from the place you’re scraping, you can run into problems.

如果您要剪贴的地方没有Access-Control-Allow-Origin标头，则可能会遇到问题。

The bad news is, you need to run these sorts of requests server-side to get around this issue.

坏消息是，您需要在服务器端运行这些类型的请求才能解决此问题。

Whaaaaaaaat, this is supposed to be client-side web scraping?!

哇，这应该是客户端网络抓取吗？

The good news is, thanks to lots of other wonderful developers that have run into the same issues, you don’t have to touch the back end yourself.

好消息是，由于其他许多出色的开发人员都遇到了同样的问题，因此您不必自己碰后端。

Staying firmly within our front end script, we can use cross-domain tools such as Any Origin, Whatever Origin, All Origins, crossorigin and probably a lot more. I have found that you often need to test a few of these to find the one that will work on the site you are trying to scrape.

牢牢地掌握在前端脚本中，我们可以使用跨域工具，例如Any Origin ， Whatever Origin ， All Origins ， crossorigin等等。我发现您经常需要测试其中一些才能找到要在您要抓取的网站上运行的软件。

Back to our CodeWars example, we can send our request via a cross-domain tool to bypass the CORS issue.

回到我们的CodeWars示例，我们可以通过跨域工具发送请求以绕过CORS问题。

var name = "codemzy";var url = "http://anyorigin.com/go?url=" + encodeURIComponent("https://www.codewars.com/users/") + name + "&callback=?";$.get(url, function(response) {  console.log(response);});

And just like magic, we have our response.

就像魔术一样，我们有回应。