Web Scraping with Python: 使用 Python 爬 Baidu 关键词

本文介绍使用Python爬取百度搜索结果的过程,包括分析爬取规则、伪装浏览器以绕过反爬机制,以及代码实现。最终,爬虫程序将搜索结果以Markdown表格形式保存到本地。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

一、引言

自从开始看《Web Scraping with Python》这本书之后,我就天天想着给自己创造需求练习爬虫实践。

我相信每一个学习爬虫的人,都曾经想过在搜索引擎上爬些有趣的东西。在自己实现了 GitHub Star 数、CSDN 博客信息的爬取之后,自然而然,就想要去爬取一下百度。

想要了解前两个实例的实现的同学,可以点击这里:
Web Scraping with Python: 使用 Python 爬 GitHub Star 数
Web Scraping with Python: 使用 Python 爬 CSDN 博客

先给自己定一个需求吧:

输入指定关键词,输入指定的条目数,最后爬虫程序将爬取到的结果信息写入到本地的一个 MarkDown 文件中,以表格形式显示出来

在经历了之前两个实例的实践之后,这个需求看起来也不会很难。

接下来,让我们开始吧!

二、分析:爬取规则

我们想要爬取指定关键词的搜索条目,就需要了解一些规则信息,通过这些规则信息,我们就可以编写指定的爬虫程序完成我们想要的需求。

在这里,我们需要的信息如下:

1. 入口网址是什么?
这是一个非常重要的问题。接下来让我们打开网址测试下,我们在百度一下中输入 python 测试词条,发现其生成了复杂的 url:

https://www.baidu.com/s?wd=python&rsv_spt=1&rsv_iqid=0xf2a1a84300027a1e&issp=1&f=8&rsv_bp=0&rsv_idx=2&ie=utf-8&tn=baiduhome_pg&rsv_enter=1&rsv_sug3=10&rsv_sug1=11&rsv_sug7=100&rsv_sug2=0&inputT=2354&rsv_sug4=2354

这么复杂?!通过查询资料,发现百度会通过环境信息对于你搜索的结果进行一些适配的变化,而我们不需要这些信息,因此,我们可以输入这么一个 url:

https://www.baidu.com/s?wd=python

那么,这个就是我们的入口网址了,https://www.baidu.com/s?wd= 后加上关键词即可。

2. 我们如何实现多页的跳转?
这个问题涉及到了我们大量信息的收集。这里,我查看了下百度的下一页按钮的样式:

<a href="/s?wd=asdfafdsasdfasdfffffffffffffffffffff&amp;pn=10&amp;oq=asdfafdsasdfasdfffffffffffffffffffff&amp;ie=utf-8&amp;rsv_pq=b5e8070300006cbc&amp;rsv_t=d4cbSiBUdx2J%2BXs6yKLuS3IvH8QA0fg9TcrRwpbwp0WASJ1szyVs5a20HdU&amp;rsv_page=1" class="n">下一页&gt;
Python Web Scraping - Second Edition by Katharine Jarmul English | 30 May 2017 | ASIN: B0725BCPT1 | 220 Pages | AZW3 | 3.52 MB Key Features A hands-on guide to web scraping using Python with solutions to real-world problems Create a number of different web scrapers in Python to extract information This book includes practical examples on using the popular and well-maintained libraries in Python for your web scraping needs Book Description The Internet contains the most useful set of data ever assembled, most of which is publicly accessible for free. However, this data is not easily usable. It is embedded within the structure and style of websites and needs to be carefully extracted. Web scraping is becoming increasingly useful as a means to gather and make sense of the wealth of information available online. This book is the ultimate guide to using the latest features of Python 3.x to scrape data from websites. In the early chapters, you'll see how to extract data from static web pages. You'll learn to use caching with databases and files to save time and manage the load on servers. After covering the basics, you'll get hands-on practice building a more sophisticated crawler using browsers, crawlers, and concurrent scrapers. You'll determine when and how to scrape data from a JavaScript-dependent website using PyQt and Selenium. You'll get a better understanding of how to submit forms on complex websites protected by CAPTCHA. You'll find out how to automate these actions with Python packages such as mechanize. You'll also learn how to create class-based scrapers with Scrapy libraries and implement your learning on real websites. By the end of the book, you will have explored testing websites with scrapers, remote scraping, best practices, working with images, and many other relevant topics. What you will learn Extract data from web pages with simple Python programming Build a concurrent crawler to process web pages in parallel Follow links to crawl a website Extract features from the HTML Cache downloaded HTML for reuse Compare concurrent models to determine the fastest crawler Find out how to parse JavaScript-dependent websites Interact with forms and sessions About the Author Katharine Jarmul is a data scientist and Pythonista based in Berlin, Germany. She runs a data science consulting company, Kjamistan, that provides services such as data extraction, acquisition, and modelling for small and large companies. She has been writing Python since 2008 and scraping the web with Python since 2010, and has worked at both small and large start-ups who use web scraping for data analysis and machine learning. When she's not scraping the web, you can follow her thoughts and activities via Twitter (@kjam) Richard Lawson is from Australia and studied Computer Science at the University of Melbourne. Since graduating, he built a business specializing in web scraping while travelling the world, working remotely from over 50 countries. He is a fluent Esperanto speaker, conversational in Mandarin and Korean, and active in contributing to and translating open source software. He is currently undertaking postgraduate studies at Oxford University and in his spare time enjoys developing autonomous drones. Table of Contents Introduction Scraping the data Caching downloads Concurrent downloading Dynamic content Interacting with forms Solving CAPTCHA Scrapy Putting it All Together
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值