网页视频15分钟自动暂停_在15分钟内学习网页爬取

最新推荐文章于 2024-04-11 18:39:08 发布

weixin_26713521

最新推荐文章于 2024-04-11 18:39:08 发布

阅读量2.6k

点赞数

文章标签： python css 机器学习人工智能 java

原文链接：https://towardsdatascience.com/learn-web-scraping-in-15-minutes-27e5ebb1c28e

版权

网页视频15分钟自动暂停

什么是网页抓取？ (What is Web Scraping?)

Web scraping, also known as web data extraction, is the process of retrieving or “scraping” data from a website. This information is collected and then exported into a format that is more useful for the user and it can be a spreadsheet or an API. Although web scraping can be done manually, in most cases, automated tools are preferred when scraping web data as they can be less costly and work at a faster rate.

Web抓取，也称为Web数据提取，是从网站检索或“抓取”数据的过程。收集此信息，然后将其导出为对用户更有用的格式，可以是电子表格或API。尽管可以手动进行Web抓取，但是在大多数情况下，抓取Web数据时首选自动化工具，因为它们的成本较低且工作速度更快。

网站搜刮合法吗？ (Is Web Scraping Legal?)

The simplest way is to check the robots.txt file of the website. You can find this file by appending “/robots.txt” to the URL that you want to scrape. It is usually at the website domain /robots.txt. If all the bots indicated by ‘user-agent: *’ are blocked/disallowed in the robots.txt file, then you’re not allowed to scrape. For this article, I am scraping the Flipkart website. So, to see the “robots.txt” file, the URL is www.flipkart.com/robots.txt.

最简单的方法是检查网站的robots.txt文件。您可以通过将“ /robots.txt”附加到要抓取的URL来找到此文件。它通常位于网站域/robots.txt中。如果robots.txt文件中阻止/禁止了“用户代理：*”指示的所有漫游器，则不允许您抓取。对于本文，我将抓取Flipkart网站。因此，要查看“ robots.txt”文件，URL为www.flipkart.com/robots.txt。

用于Web爬网的库 (Libraries used for Web Scraping)

BeautifulSoup: BeautifulSoup is a Python library for pulling data out of HTML and XML files. It works with your favourite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree.

BeautifulSoup：BeautifulSoup是一个Python库，用于从HTML和XML文件中提取数据。它与您最喜欢的解析器一起使用，提供了导航，搜索和修改解析树的惯用方式。

Pandas: Pandas is a fast, powerful, flexible, and easy to use open-source data analysis and manipulation tool, built on top of the Python programming language.

Pandas：Pandas是一种快速，强大，灵活且易于使用的开源数据分析和处理工具，建立在Python编程语言之上。

为什么选择BeautifulSoup？ (Why BeautifulSoup?)

It is an incredible tool for pulling out information from a webpage. You can use it to extract tables, lists, paragraphs and you can also put filters to extract information from web pages. For more info, you can refer to the BeautifulSoup documentation

它是从网页中提取信息的不可思议的工具。您可以使用它来提取表，列表，段落，还可以放置过滤器以从网页中提取信息。有关更多信息，您可以参考BeautifulSoup 文档。

刮Flipkart网站 (Scraping Flipkart Website)

from bs4 import BeautifulSoup 
import requests 
import csv
import pandas as pd

First, we import the BeautifulSoup and the requests library and these are very important libraries for web scraping.

首先，我们导入BeautifulSoup和请求库，这些对于Web抓取是非常重要的库。

requests: requests, is one of the packages in Python that made the language interesting

最低0.47元/天解锁文章

weixin_26713521

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
网页视频15分钟自动暂停_在15分钟内学习网页爬取

网页视频15分钟自动暂停什么是网页抓取？ (What is Web Scraping?)Web scraping, also known as web data extraction, is the process of retrieving or “scraping” data from a website. This information is collected and then ex...
复制链接

扫一扫