python爬虫学习笔记一

最新推荐文章于 2024-07-18 15:10:27 发布

xue251248603

最新推荐文章于 2024-07-18 15:10:27 发布

阅读量466

点赞数

分类专栏：开发技术文章标签： python 爬虫 selenium

本文链接：https://blog.csdn.net/xue251248603/article/details/83590072

版权

开发技术专栏收录该内容

69 篇文章 1 订阅

订阅专栏

这篇博客介绍了使用Python的Selenium库抓取网页内容，特别是从知乎搜索结果中爬取美女图片的过程。首先，博主分享了安装Selenium和chromedriver的步骤，然后提供了修改后的代码片段，用于下载图片。由于知乎的翻页机制是通过API获取JSON数据，博主提到当前无法抓取翻页数据，计划后续研究。

摘要由CSDN通过智能技术生成

废话不多说，直接上代码：

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

def main():
    chrome_options = Options()
    chrome_options.add_argument('--headless')
    chrome_options.add_argument('--disable-gpu')
    driver = webdriver.Chrome(executable_path='G:/pythonLib/chromedriver.exe', options=chrome_options)
    driver.get("https://www.baidu.com")
    print(driver.page_source)
    driver.save_screenshot(r'baidu_explorer.png')

    driver.close()

if __name__ == '__main__':
    main()

需要安装

selenium：
pip install Selenium
chromedriver：
下载地址
https://sites.google.com/a/chromium.org/chromedriver/downloads （原地址被墙了）
http://npm.taobao.org/mirrors/chromedriver （可用）
下载到本地之后记得将代码中的地址“G:/pythonLib/chromedriver.exe”更换为本地路径

另附一段抓取网页中的图片的代码（转自传送门）

from urllib import request
from bs4 import BeautifulSoup
import re
import time

url = "https://www.zhihu.com/question/66313867"
'''
request.urlopen(url)返回的是一个HTTPResposne类型的对象，它主要包含的方法有read()、
readinto()、getheader(name)、getheaders()、fileno()等函数和msg、version、status、reason、debuglevel、closed等属性。
'''
resp = request.urlopen(url)
buff = resp.read()  # 网页内容
print(resp.status)  # 打印请求结果的状态码
html = buff.decode("utf8")
print(html)  # 打印请求到的网页源码
soup = BeautifulSoup(html, 'html.parser')  # 将网页源码构造成BeautifulSoup对象，方便操作
# print(soup.prettify())

# 用Beautiful Soup结合正则表达式来提取包含所有图片链接（img标签中，class=**，以.jpg结尾的链接）的语句
links = soup.find_all('img', "origin_image zh-lightbox-thumb", src=re.compile(r'.jpg$'))
print(links)

# 设置保存图片的路径，否则会保存到程序当前路径
path = r'G:\BeautifulGril'  # 路径前的r是保持字符串原始值的意思，就是说不对其中的符号进行转义
for link in links:
    print(link.attrs['src'])
    # 保存链接并命名，time.time()返回当前时间戳防止命名冲突
    request.urlretrieve(link.attrs['src'], path + '\%s.jpg' % time.time())

urllib和bs4通过pip安装就好了：
pip install urllib
pip install bs4

对其稍作修改，我们根据知乎查询“美女”的返回结果一个个爬取美女图片：

from urllib import request
from bs4 import BeautifulSoup
import re
import time

url = "https://www.zhihu.com/search?type=content&q=%E7%BE%8E%E5%A5%B3"
'''
request.urlopen(url)返回的是一个HTTPResposne类型的对象，它主要包含的方法有read()、
readinto()、getheader(name)、getheaders()、fileno()等函数和msg、version、status、reason、debuglevel、closed等属性。
'''
resp = request.urlopen(url)
buff = resp.read()  # 网页内容
print(resp.status)  # 打印请求结果的状态码
html = buff.decode("utf8")
print(html)  # 打印请求到的网页源码
soup = BeautifulSoup(html, 'html.parser')  # 将网页源码构造成BeautifulSoup对象，方便操作
# print(soup.prettify())
# 提取查询结果中的文章链接，例如：<meta itemprop="url" content="https://www.zhihu.com/question/25509555">
links = soup.find_all('meta', itemprop='url', content=re.compile(r'^https:'))
print(links)

for link in links:
    curUrl = link.attrs['content']
    print(link.attrs['content'])
    curBuff = request.urlopen(curUrl).read()  # 网页内容
    curHtml = curBuff.decode("utf8")
    print(curHtml)  # 打印请求到的网页源码
    curSoup = BeautifulSoup(curHtml, 'html.parser')  # 将网页源码构造成BeautifulSoup对象，方便操作
    # 用Beautiful Soup结合正则表达式来提取包含所有图片链接（img标签中，class=**，以.jpg结尾的链接）的语句
    curlinks = curSoup.find_all('img', "origin_image zh-lightbox-thumb", src=re.compile(r'.jpg$'))
    print(curlinks)

    # 设置保存图片的路径，否则会保存到程序当前路径
    path = r'G:\BeautifulGril'  # 路径前的r是保持字符串原始值的意思，就是说不对其中的符号进行转义
    for jpgLink in curlinks:
        print(jpgLink.attrs['src'])
        # 保存链接并命名，time.time()返回当前时间戳防止命名冲突
        request.urlretrieve(jpgLink.attrs['src'], path + '\%s.jpg' % time.time())