python 网页编程_通过Python编程检索网页

最新推荐文章于 2024-02-29 22:25:11 发布

weixin_26713521

最新推荐文章于 2024-02-29 22:25:11 发布

阅读量1.1k

点赞数

文章标签： python

原文链接：https://medium.com/@ODSC/retrieving-webpages-through-python-programming-8f3bae8518a5

版权

python 网页编程

The internet and the World Wide Web (WWW), is probably the most prominent source of information today. Most of that information is retrievable through HTTP. HTTP was invented originally to share pages of hypertext (hence the name Hypertext Transfer Protocol), which eventually started the WWW.

互联网和万维网(WWW)可能是当今最突出的信息来源。大多数信息可通过HTTP检索。最初是发明HTTP来共享超文本页面的(因此被称为超文本传输协议)，该页面最终启动了WWW。

This process occurs every time we request a web page through our devices. The exciting part is we can perform these operations programmatically to automate the retrieval and processing of information.

每当我们通过设备请求网页时，都会发生此过程。令人兴奋的部分是我们可以以编程方式执行这些操作，以自动进行信息的检索和处理。

This article is an excerpt from the book Python Automation Cookbook, Second Edition by Jamie Buelta, a comprehensive and updated edition that enables you to develop a sharp understanding of the fundamentals required to automate business processes through real-world tasks, such as developing your first web scraping application, analyzing information to generate spreadsheet reports with graphs, and communicating with automatically generated emails.

本文摘自 Jamie Buelta撰写 的《 Python Automation Cookbook，第二版》 ，这是一个全面而更新的版本，使您能够深入了解通过实际任务(例如，开发第一个任务)来实现业务流程自动化的基本原理。网络抓取应用程序，分析信息以生成带有图表的电子表格报告，以及与自动生成的电子邮件进行通信。

In this article, we will learn how to leverage the Python language to fetch HTTP. Python has an HTTP client in its standard library. Further, the fantastic request modules make obtaining web pages very convenient.

在本文中，我们将学习如何利用Python语言来获取HTTP。 Python在其标准库中有一个HTTP客户端。此外，出色的请求模块使获取网页非常方便。

[Related article: Web Scraping News Articles in Python]

[相关文章： Python中的Web搜刮新闻文章 ]

与表格互动 (Interacting with forms)

A common element present in web pages is forms. Forms are a way of sending values to a web page, for example, to create a new comment on a blog post, or to submit a purchase.

网页中常见的元素是表单。表单是一种将值发送到网页的方法，例如，在博客文章上创建新评论或提交购买。

Browsers present forms so you can input values and send them in a single action after pressing the submit or equivalent button. We’ll see how to create this action programmatically in this recipe.

浏览器显示表单，因此您可以输入值并在按下提交或等效按钮后以单个操作发送它们。我们将在本食谱中了解如何以编程方式创建此动作。

Image for post — https://odsc.com/ https://odsc.com/

做好准备 (Getting ready)

We’ll work against the test server https://httpbin.org/forms/post, which allows us to send a test form and sends back the submitted information.

我们将针对测试服务器https://httpbin.org/forms/post进行工作，该服务器允许我们发送测试表单并发回已提交的信息。

The following is an example form to order a pizza:

以下是订购比萨饼的示例表格：

Figure 1 Rendered form

图1呈现的表单

You can fill the form in manually and see it return the information in JSON format, including extra information such as the browser being used.

您可以手动填写表单，然后查看它以JSON格式返回信息，包括其他信息，例如正在使用的浏览器。

The following is the frontend of the web form that is generated:

以下是生成的Web表单的前端：

Figure 2: Filled-in form

图2：填写表格

The following screenshot shows the backend of the web form that is generated:

以下屏幕快照显示了生成的Web表单的后端：

Figure 3: Returned JSON content

图3：返回的JSON内容

We need to analyze the HTML to see the accepted data for the form. The source code is as follows:

我们需要分析HTML以查看表单的可接受数据。源代码如下：

Figure 4: Source code

图4：源代码

Check the names of the inputs, custname, custtel, custemail, size (a radio option), topping (a multiselection checkbox), delivery (time), and comments.

检查输入的名称，客户名称，客户名称，客户邮件，大小(单选)，打顶(多选复选框)，传递(时间)和注释。

怎么做… (How to do it…)

1. Import the requests, BeautifulSoup, and re modules:

1.导入请求，BeautifulSoup，然后重新模块：

>>> import requests >>> from bs4 import BeautifulSoup >>> import re

2. Retrieve the form page, parse it, and print the input fields. Check that the posting URL is /post (not /forms/post): >>> response = requests.get(‘https://httpbin.org/forms/post’)

2.检索表单页面，对其进行解析，然后打印输入字段。 检查发布URL是否为/ post(不是/ forms / post)： >>> response = requests.get('https://httpbin.org/forms/post')

>>> page = BeautifulSoup(response.text) >>> form = page.find('form') >>> {field.get('name') for field in form.find_all(re. compile('input|textarea'))} {'delivery', 'topping', 'size', 'custemail', 'comments', 'custtel', 'custname'}

3. Note that textarea is a valid input and is defined in the HTML format. Prepare the data to be posted as a dictionary. Check that the values are as defined in the form:

3.请注意，textarea是有效输入，并以HTML格式定义。 准备要作为字典发布的数据。 检查值是否符合以下格式中的定义：

>>> data = {'custname': "Sean O'Connell", 'custtel': '123-456- 789', 'custemail': 'sean@oconnell.ie', 'size': 'small', 'topping': ['bacon', 'onion'], 'delivery': '20:30', 'comments': ''}

4. Post the values and check that the response is the same as returned in the browser:

4.发布值，并检查响应是否与浏览器中返回的相同：

>>> response = requests.post('https://httpbin.org/post', data) >>> response <Response [200]> >>> response.json() {'args': {}, 'data': '', 'files': {}, 'form': {'comments': '', 'custemail': 'sean@oconnell.ie', 'custname': "Sean O'Connell", 'custtel': '123-456-789', 'delivery': '20:30', 'size': 'small', 'topping': ['bacon', 'onion']}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Connection': 'close', 'Content-Length': '140', 'Content-Type': 'application/x-wwwform- urlencoded', 'Host': 'httpbin.org', 'User-Agent': 'pythonrequests/ 2.22.0'}, 'json': None, 'origin': '89.100.17.159', 'url': 'https://httpbin.org/post'}

这个怎么运作… (How it works…)

Requests directly encodes and sends data in the configured format. By default, it sends POST data in the application/x-www-form-urlencoded format.

请求以配置的格式直接编码并发送数据。默认情况下，它以application / x-www-form-urlencoded格式发送POST数据。

The key aspect here is to respect the format of the form and the possible values that can return an error if incorrect, typically a 400 error, indicating a problem with the client.

此处的关键方面是尊重表单的格式和可能的值，如果不正确，则可能返回错误，通常为400错误，这表明客户端存在问题。

[Related article: Building a Scraper Using Browser Automation]

[相关文章： 使用浏览器自动化构建刮板 ]

还有更多… (There’s more…)

Other than following the format of forms and inputting valid values, the main problem when working with forms is the multiple ways of preventing spam and abusive behavior. You will often have to ensure that you have downloaded a form before submitting it, to avoid submitting multiple forms or Cross-Site Request Forgery (CSRF).

除了遵循表格的格式和输入有效值外，使用表格时的主要问题还在于防止垃圾邮件和滥用行为的多种方法。您通常必须确保在提交表单之前已经下载了表单，以避免提交多个表单或跨站点请求伪造 ( CSRF )。

To obtain the specific token, you need to first download the form, as shown in the recipe, obtain the value of the CSRF token, and resubmit it. Note that the token can have different names; this is just an example:

要获取特定令牌，您需要先下载表单，如配方所示，获取CSRF令牌的值，然后重新提交。请注意，令牌可以具有不同的名称。这只是一个例子：

>>> form.find(attrs={'name': 'token'}).get('value') 'ABCEDF12345'

In this article, we learned how to obtain data from the forms of the web, parse it, and print the input fields using Python’s HTTP client. We also explored the role and application of requests, Beautiful Soup, and re–modules.

在本文中，我们学习了如何使用Python的HTTP客户端从Web表单中获取数据，进行解析并打印输入字段。我们还探讨了请求，“美丽的汤”和“重新模块”的作用和应用。

关于作者 (About the Author)

Jaime Buelta is a full-time Python developer since 2010 and a regular speaker at PyCon Ireland. He has been a professional programmer for over two decades with a rich exposure to a lot of different technologies throughout his career. He has developed software for a variety of fields and industries, including aerospace, networking and communications, industrial SCADA systems, video game online services, and financial services.

Jaime Buelta自2010年以来一直是Python的专职开发人员，并在PyCon Ireland担任定期发言人。在过去的二十多年中，他一直是一名专业的程序员，在他的整个职业生涯中，他对许多不同的技术有着丰富的了解。他开发了适用于各个领域和行业的软件，包括航空航天，网络和通信，工业SCADA系统，视频游戏在线服务以及金融服务。

Editor’s note: Interested in learning more about coding beyond just retrieving webpages through Python? Check out some of these upcoming similar ODSC talks:

编者注：除了通过Python检索网页之外，您还想了解更多有关编码的信息吗？ 查看以下即将举行的类似ODSC讲座：

ODSC Europe: “Programming with Data: Python and Pandas” — In this training, you will learn how to accelerate your data analyses using the Python language and Pandas, a library specifically designed for tabular data analysis.

ODSC欧洲：“ 使用数据编程：Python和Pandas ” —在本培训中，您将学习如何使用Python语言和Pandas(专门用于表格数据分析的库)来加速数据分析。

ODSC Europe: “Introduction to Linear Algebra for Data Science and Machine Learning With Python” — The goal of this session is to show you that you can start learning the math needed for machine learning and data science using code.

ODSC欧洲：“ 使用Python进行数据科学和机器学习的线性代数简介 ” —本课程的目的是向您展示您可以开始使用代码学习机器学习和数据科学所需的数学。

翻译自: https://medium.com/@ODSC/retrieving-webpages-through-python-programming-8f3bae8518a5

python 网页编程

weixin_26713521

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
python 网页编程_通过Python编程检索网页

python网页编程The internet and the World Wide Web (WWW), is probably the most prominent source of information today. Most of that information is retrievable through HTTP. HTTP was invented originally to s...
复制链接

扫一扫