chrome 抓取图片_利用chrome开发工具进行动态网页抓取

chrome 抓取图片

So you have a website you want to scrape? But don’t necessarily know what package to use or how to go about the process. This is common when first starting out web scraping. Understanding how to efficiently get what you want from a website takes time and multiple scripts.

因此,您有一个要抓取的网站? 但是不一定知道要使用哪个程序包或如何执行该过程。 第一次开始刮网时,这很常见。 了解如何有效地从网站获取所需内容需要花费时间和多个脚本。

In this article, we will go through the process of planning a web scraping script.

在本文中,我们将完成计划Web抓取脚本的过程。

在本文中,您将学习 (In this article, you will learn)

  1. To understand the workflow of web scraping

    了解网络抓取的工作流程
  2. How to quickly analyse a website for data extraction

    如何快速分析网站以进行数据提取
  3. How to leverage Chrome Tools for web scraping

    如何利用Chrome工具进行网页抓取

了解Web搜寻的工作流程 (Understanding the workflow of Web Scraping)

There are three key areas to consider when looking to do web scraping

进行网页抓取时需要考虑三个关键领域

  1. Inspecting the website

    检查网站

  2. Planning the data you require and their selectors/attributes from the page

    在页面上规划所需的数据及其选择器/属性
  3. Writing the code

    编写代码

In this article, we will focus on inspecting the website. This is the first and most important part of web scraping. It is also the least talked about, which is why you’re here reading this!

在本文中,我们将重点检查网站。 这是网页抓取的第一个也是最重要的部分。 它也是谈论最少的,这就是为什么您在这里阅读本文!

1.数据是在一页,几页还是在页面的多次点击中? (1. Is the data on one page, several pages or through multiple click-throughs of pages?)

When you first think of a website to extract data from, you will have some idea of the data you are wanting.

当您第一次想到要从中提取数据的网站时,就会对所需的数据有所了解。

You can imagine that information on one page is the easiest and the code will inevitably be more simple whereas nested pages of information will make many more HTTP requests and the code will be more complex as a result. Knowing this helps plan out what types of functions will be needed to do the scrape.

您可以想象一页上的信息是最简单的,并且代码将不可避免地变得更简单,而嵌套的信息页面将发出更多的HTTP请求,结果代码将更加复杂。 知道这一点有助于计划进行爬取所需的功能类型。

2.网站如何建设? Java语言使用了多少? (2. How is the website built? How heavily is Javascript used?)

Knowing how the website is built is a vital part to know early on in the process. This often dictates how easy or difficult the scrape will be. Almost all pages on the internet will be using HTML and CSS and there are good frameworks within python that deal with that easily. However, knowing if there is any javascript being implemented to manipulating the website. To load new information which may or may not be accessible is important.

知道网站的构建方式是在此过程中尽早了解的至关重要的部分。 这通常指示刮擦将是多么容易或困难。 互联网上几乎所有页面都将使用HTML和CSS,而python中有很好的框架可以轻松地处理这些问题。 但是,知道是否有任何JavaScript正在实施以操纵网站。 加载可能无法访问的新信息很重要。

3.页面需要登录吗? (3. Does the page require a login?)

Logging in presents a specific challenge in web scraping, if a login is required this slows down the efficiency of the scrape, but also makes it far easy for your scraper to be blocked.

登录是Web抓取中的一个特定挑战,如果需要登录,这会减慢抓取的效率,但是也很容易使您的抓取器被阻塞。

4.是否有动态生成的内容? (4. Is there dynamically generated content?)

By this we mean is there enough at a quick glance to know that the functionality of the website is interactive and most likely generated by javascript? The more interactivity on the website the more challenge the scrape.

通过这种方式,我们是否有足够的快速浏览能力就可以知道该网站的功能是交互式的,并且很可能是由javascript生成的? 网站上的互动越多,刮刮乐就越具有挑战性。

5.是否启用了无限滚动? (5. Is there infinite scrolling enabled?)

Infinite scrolling is a javascript orientated feature where new requests are made to a server and based on these either generic or very specific requests either the DOM is manipulated or data from the server is made available. Now infinite scrolling requires HTTP requests to be made and new information to be displayed on the page. This is important to understand because we often either need to simulate this behaviour or we use browser activity to simulate that behaviour for us.

无限滚动是一种面向javascript的功能,其中向服务器发出新请求,并基于这些通用请求或非常特定的请求来操作DOM或使来自服务器的数据可用。 现在,无限滚动要求发出HTTP请求,并且新信息要显示在页面上。 了解这一点很重要,因为我们经常需要模拟此行为,或者使用浏览器活动为我们模拟该行为。

6.是否有下拉菜单? (6. Are there drop-down menus?)

Any sort of drop-down menus can present a particular challenge in web scraping. This is because more often than not you are needing to simulate browser activity to get to the data.

任何类型的下拉菜单都可能在Web抓取中提出特定的挑战。 这是因为您经常需要模拟浏览器活动来获取数据。

7.有表格吗? (7. Are there forms?)

Forms are often used in many websites, either for data you have to search for or to login into part of the website. HTML Form usually invokes javascript to post data to a server that will authenticate and respond with the information you want. Javascript has the ability to invoke HTTP requests and is often a source of changing the information on a page without rendering the page. So you need to understand how the website does this, is there an API that responds to HTTP requests invoked by javascript? Can this be used to gain the information you want or will it need to be automated?

表单经常在许多网站中使用,用于存储您必须搜索的数据或登录到网站的一部分。 HTML表单通常调用javascript将数据发布到服务器,该服务器将进行身份验证并使用所需信息进行响应。 Javascript具有调用HTTP请求的能力,通常是在不渲染页面的情况下更改页面信息的来源。 因此,您需要了解网站是如何做到的,是否有一个API可以响应javascript调用的HTTP请求? 可以使用它来获取所需的信息,还是需要将其自动化?

8.是否有包含可用信息的表 (8. Are there tables that have the available information)

Let's face it, tables are a pain! Pain to create in HTML and also painful to scrape. Be wary of table data, you might be in for a headache. Fortunately, there are frameworks that can grab table data quickly but just be prepared it’s not a 100% certainty you’ll be able to use these frameworks and may have to manually loop around rows to gain the data you want

面对现实吧,桌子很痛苦! 用HTML进行创建很痛苦,而且很难抓取。 注意表数据,您可能会头疼。 幸运的是ÿ

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值