Python---python网络爬虫入门实践总结

最新推荐文章于 2021-06-21 18:22:55 发布

maidu_xbd

最新推荐文章于 2021-06-21 18:22:55 发布

阅读量1.2k

点赞数

分类专栏： Python

本文链接：https://blog.csdn.net/maidu_xbd/article/details/108881245

版权

Python 专栏收录该内容

16 篇文章 0 订阅

订阅专栏

五、selenium+chromeDriver

一、爬虫介绍

爬虫：网络数据采集的程序。

爬虫爬取的数据有什么用？

（1）资料库

（2）数据分析

（3）人工智能：人物画像；推荐系统：今日头条、亚马逊等；图像识别；自然语言处理

为什么用python写爬虫？

java：代码量很大，重构成本变大。

php：天生对多任务支持不太友好，爬取效率低。

c/c++：对程序员不友好，学习成本高，但是非常灵活，运行效率高。

python：生态健全，语法简洁。

爬虫分类：

通用网络爬虫（General Purpose Web Crawler）、聚焦网络爬虫（Focused Crawler）、增量式网络爬虫（Incremental Web Crawler）、深层网络爬虫（Deep Web Crawler）

二、利用urllib实现最小的爬虫程序

urllib简介：urllib用于获取url（统一资源定位符）的一个python内置模块。

它以urlopen函数形式提供了非常简单的接口。能够使用各种不同的协议来获取网址。

它还提供了一个稍微复杂的接口用于处理常见的情况，比如基本身份验证，cookies,proxies代理等。这些是由handlers和openers提供

案例1： :使用urlopen爬虫实现最简单的爬虫

# 最简单的爬虫
from urllib.request import urlopen
# 使用urlopen爬虫
# 1.发送http请求
resp = urlopen("http://www.baidu.com")
# 2.从返回的响应中提取数据
print(resp.read())  # 提取所有数据
print(resp.read(100))  # 提取数据中的前100个数据
# 3.将数据存入html文件 wb 以二进制格式打开一个文件只用于写入。如果该文件已存在则将其覆盖。如果该文件不存在，创建新文件。
with open("D:/Learning/Python/spider/baidu1.html", "wb") as f:  # 绝对路径
    f.write(resp.read())
with open("./baidu2.html", "wb") as f:  # 相对路径./ 表示当前文件所在的目录(可以省略)
    f.write(resp.read())

案例2：使用urlretrieve爬虫

from urllib.request import urlretrieve
# 使用urlretrieve爬虫
urlretrieve("http://www.baidu.com",
            filename="D:/Learning/Python/spider/baidu3.html")

三、Requests爬虫实现

Requests简介：Requests是python的http第三方库，在Apache2许可证下发布，该项目是的http请求更加简单和人性化。

Requests特点：Requests底层实现就是urllib，在python2和python3中通用，简单易用

安装Requests：【pip install requests -i https://mirrors.aliyun.com/pypi/simple/】

案例1：Requests爬取格言网

import requests
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36"}
# resp = requests.get("https://www.baidu.com", headers=headers)
# print(resp.content.decode("utf-8"))
resp = requests.get("http://www.geyanw.com/", headers=headers)
print(resp.content.decode("gb2312"))
# 请求的参数
print(resp.request.headers)
# 写入文件
with open("D:/Learning/Python/spider/geyan.html", "wb") as f:
    f.write(resp.content)

四、数据解析利器：lxml xpath

lmx语法：

/：从根节点选取

//：从匹配选择的当前节点开始，不考虑位置

.：当前节点

..：当前节点的父节点

div[@lang]：选取存在lang属性的元素

div[@lang=’eng’]：选取元素属性为eng的元素

text()：获取文本信息

安装lmxl:【pip install lxml -i https://mirrors.aliyun.com/pypi/simple/】

lxml使用方式：打开浏览器，访问要爬取的网站内容，点击【F12】，打开开发者工具，按【Ctrl+F】调查开发者工具中的查找输入框工具，利用选择元素工具选择要爬取的内容。从外层往内层取元素。

案例1：利用requests爬取格言网数据，利用lxml解析数据，获取格言标题和具体格言内容，并存入txt文档，如下图

【demo_lxml.py】

# 爬虫网址 "http://www.geyanw.com/" 爬取格言存入txt文件
# 提取名人名言标题 //div[@class='listbox']/dl[@class='tbox']/dd/ul/li/a/text()
# 提取名人名言标题链接列表：//div[@class='listbox']/dl[@class='tbox']/dd/ul/li/a/@href

import requests
from lxml.etree import HTML


def get_content_by_xpath(url, xpath):
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36"}
    resp = requests.get(url, headers=headers)

    html_tree = HTML(resp.content)

    content_list = html_tree.xpath(xpath)
    return content_list


title_list = get_content_by_xpath(
    "http://www.geyanw.com/", "//div[@class='listbox']/dl[@class='tbox']/dd/ul/li/a/text()")
link_list = get_content_by_xpath(
    "http://www.geyanw.com/", "//div[@class='listbox']/dl[@class='tbox']/dd/ul/li/a/@href")
print("标题：", title_list)
print("链接：", link_list)


# //div[@class='conten']//span/text()
for index, link in enumerate(link_list):
    url = "http://www.geyanw.com/"+link
    content_list = get_content_by_xpath(
        url, "//div[@class='content']//span/text()")
    # print(content_list)
    with open("D:/Learning/Python/spider/geyan.txt", "a+", encoding="utf-8") as f:
        f.write(title_list[index])
        f.write("\n")
        for line in content_list:
            f.write(line)
            f.write("\n")
        f.write("\n\n")

案例2：调用【get_content_by_xpath】

import requests
from demo_lxml import get_content_by_xpath

list = get_content_by_xpath(
    "http://fund.eastmoney.com/519674.html?spm=001.1.swh/", "//div[@class='fundDetail-tit']/div/text()")

print(list)

五、selenium+chromeDriver

对于比较难爬虫的网站措施：利用selenium中的webdriver

（1）安装selenium：pip install selenium -i https://mirrors.aliyun.com/pypi/simple/

（2）找到谷歌浏览器对应版本的chromeDriver.Mirror

查看本机的chrome浏览器版本操作如下：

打开网址http://npm.taobao.org/mirrors/chromedriver/

http://npm.taobao.org/mirrors/chromedriver/84.0.4147.30/

下载chromedriver 84.0.4147.30 ，下载后解压文件为【chromedriver.exe】,拷贝至项目目录下。

（3）在.py文件写入并运行

【demo_selenium.py】

from selenium import webdriver
import time

option = webdriver.ChromeOptions()
# http://quote.eastmoney.com/center/gridlist.html#hs_a_board
driver = webdriver.Chrome(
    "D:/Learning/Python/spider/chromedriver", chrome_options=option)
driver.get("http://quote.eastmoney.com/center/gridlist.html#hs_a_board")
# 股票代码：//tbody/tr/td[2]/a/text()
# 股票涨跌：//tbody/tr/td[6]/span/text()
time.sleep(5)
code_ele_list = driver.find_elements_by_xpath("//tbody/tr/td[2]/a")
incr_ele_list = driver.find_elements_by_xpath("//tbody/tr/td[6]/span")
code_list = []
incr_list = []
for index, code in enumerate(code_ele_list):
    code_list.append(code.text)
    incr = incr_ele_list[index].text
    incr_list.append(incr)

print(code_list)
print(incr_list)