爬虫入门(一)

最新推荐文章于 2024-06-02 21:01:46 发布

没蜡笔的小鑫++

最新推荐文章于 2024-06-02 21:01:46 发布

阅读量580

点赞数

分类专栏： python 爬虫

本文链接：https://blog.csdn.net/weixin_46604741/article/details/118890391

版权

python 同时被 2 个专栏收录

3 篇文章

订阅专栏

爬虫

3 篇文章

订阅专栏

爬虫入门（一）

1.使用python类库自带的api

from urllib.request import urlopen

url = "http://www.baidu.com"

response = urlopen(url)

# 打开文件,使用with open（）语句就不用去关闭 这个文件的通道
with open('mybaidu.html', mode='w', encoding='utf-8') as f:
    f.write(response.read().decode("utf-8"))
print("over")

应该注意的是，百度的url不应该写成https。

urlopen相当于模仿浏览器去打开了一个网页

并且使用了with语句就不用再去管理因为open的开启和关闭问题，所以涉及文件操作建议使用with语句

def open(file, mode='r', buffering=None, encoding=None, errors=None, newline=None, closefd=True)

2.使用requests

首先导入需要的类库

方法一：通过pycharm

方法二：在命令行使用语句：

pip install requests

python3.5版本以上自带该指令

2.1 通过get请求获取

import requests

url = "https://www.sogou.com/web?query=周杰伦"
resp = requests.get(url)
print(resp.text)

会发现该网站具有反爬措施，所以为了模仿的更像浏览器，我们可以加入请求头，添加User-Agent信息

import requests

url = "https://www.sogou.com/web?query=周杰伦"
dic = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36 Edg/91.0.864.67"
}

resp = requests.get(url, headers=dic)
print(resp.text)

User-Agent从浏览器的抓包工具中获得

2.2 通过post请求获取

import requests

url = "https://fanyi.baidu.com/sug"

dat = {
    "kw": "dog"
}

resp = requests.post(url, data=dat)

print(resp.json())
resp.close()

注意： dat是一个dict字典类型。这种数据类型的函数参数前面带有两个*，即**dat

3.淦百度翻译

https://https://fanyi.baidu.com/

通过抓包工具，我们可以发现。百度翻译的那个请求是一个xhr的异步请求，直接返回json数据

import requests

url = "https://fanyi.baidu.com/sug"

dat = {
    "kw": "dog"
}

resp = requests.post(url, data=dat)

print(resp.json())
resp.close()

除了通过在url在后缀拼接字符串的方式传递参数外，也可以通过dic来传递请求参数。

4.使用正则表达式进行有效数据抓取

有时候光能拿到页面源代码还远远不够，我们需要从页面的源代码中提取到有用的信息

4.1爬虫中常用的正则表达式

.*?

表示惰性匹配，即有多少匹配多少

让我们来康个简单的demo

import re
s =""" 
<div class='jay'><span id='1'>郭麒麟</span></div>
<div class='jj'><span id='2'>宋铁</span></div>
<div class='jolin'><span id='3'>大聪明</span></div>
<div class='sylar'><span id='4'>范思哲</span></div>
<div class='tory'><span id='5'>胡说八道</span></div>
"""

# re.S: 让.能匹配换行符, 取出
obj = re.compile(r"<div class='.*?'><span id='(?P<id>.*?)'>(?P<wahaha>.*?)</span></div>", re.S)
# 返回的是迭代器对象
result = obj.finditer(s)
for it in result:
    print(it.group("id"))
    print(it.group("wahaha"))

有时候我们通过url直接抓取到的是一个网页的源代码，所以我们需要从中提取有用的信息。

看下输出结果

1
郭麒麟
2
宋铁
3
大聪明
4
范思哲
5
胡说八道

成功提取到了

在字符串的最左边加上r，是告诉解释器所有字符按照原本的样子进行解释.即不进行转义，所以我们推荐在使用正则表达式的时候前面加上个r

而(?P.*?)这一部分是一个固定写法（）包裹起来的部分即需要提取数据的部分，？P表示提取到部分键值取名为id。

4.2 一些常见的写法

import re
#
lst = re.findall(r"\d+", "我的电话号码是：10086,10010")
print(lst)

# finditer：匹配字符串中所有的内容【返回的是迭代器】
it = re.finditer(r"\d+", "我的电话号码是：10086,10010")
for i in it:
    print(i.group())

# search：返回的结果是match对象，拿数据需要。group（），找到一个结果就返回
search = re.search(r"\d+", "我的电话号码是：10086,10010")
print(search.group())

# match从头开始匹配
match = re.match(r"\d+", "我的电话号码是：10086,10010")
print(search.group())

# 预加载正则表达式,提高效率
re_compile = re.compile(r"\d+")
finditer = re_compile.finditer("我的电话号码是：10086,10010")
for it in finditer:
    print(it.group())

4.3 淦盗版天堂电影

import requests
import re
import time
domain = "https://www.dy2018.com/"

# 解决加密和证书校验问题,去除安全验证
response = requests.get(domain, verify=False)
response.encoding = 'gb2312' # 指定字符集

obj1 = re.compile(r"2021必看热片.*?<ul>(?P<ul>.*?)</ul>", re.S)
obj2 = re.compile(r"<a href='(?P<href>.*?)'", re.S)
obj3 = re.compile(r'<div class="title_all"><h1>(?P<movie>.*?)</h1></div>.*?<td style="WORD-WRAP: break-word" bgcolor="#fdfddf"><a href='
                  r'"(?P<download>.*?)"', re.S)

finditer = obj1.finditer(response.text)
response.close()

child_href_list = []
for it in finditer:
    ul = it.group()
    # 提取子页面链接
    result = obj2.finditer(ul)
    for itt in result:
        child_href = domain + itt.group('href').strip("/")
        child_href_list.append(child_href) # 把子页面链接保存进列表

# 提取子页面内容
for href in child_href_list:
    child_response = requests.get(href, verify=False)
    child_response.encoding = 'gb2312'
    search = obj3.search(child_response.text)
    child_response.close()
    time.sleep(2)
    print(search.group("movie"))
    print(search.group("download"))

需要结合网页源代码去理解，注意该处如何躲避证书校验问题

5.使用bs4进行有效数据抓取

同样的现在命令行输入：

pip install bs4

bs4 主要是根据标签来进行页面的抓取。

5.1爬取北京新发地菜价

import requests
from bs4 import BeautifulSoup
import csv

# 主要通过属性来查找
url = "http://www.xinfadi.com.cn/marketanalysis/0/list/1.shtml"
resp = requests.get(url)

# 解析数据
# 1、把页面源代码交给BeautifulSoup进行处理，生成bs4对象,"html.parser"不加也可以会有警告，猜测你是html代码
page = BeautifulSoup(resp.text, "html.parser")
resp.close()

# newline='' 表示不生成结尾的回车
f = open("菜价.csv", mode='w', encoding='utf-8',  newline='')
csv_writer = csv.writer(f)

# 2、从bs对象中查找数据
# find(标签名， 属性 = 值)
# find_all(标签名， 属性 = 值)
table = page.find("table", attrs={
    "class": "hq_table"
})
# 拿到所有数据行
trs = table.find_all("tr")[1:]
for tr in trs:
    tds = tr.find_all("td")
    name = tds[0].text   # .text表示拿到被标签标记的内容，即一对标签中夹着的文本内容
    low = tds[1].text
    avg = tds[2].text
    high = tds[3].text
    gui = tds[4].text
    kind = tds[5].text
    date = tds[6].text
    csv_writer.writerow([name, low, avg, high, gui, kind, date])
f.close()

同样的需要观察页面的原来吗来进行理解

5.2抓取优美图库

一般常见的壁纸网站，首页看到的都是图片的压缩版，并非高清图片，而需要点击这张图片进去之后，才能看到原图然后才能进行下载。即在首页显示的图片的url，并非原图的url。
所以我们可以用循环依次点击页面跳转的url进行整个页面的图片爬取

import time
import requests
from bs4 import BeautifulSoup


url = "https://www.umei.net/bizhitupian/weimeibizhi/"
resp = requests.get(url)
resp.encoding = 'utf-8'

# 把源代码交给bs
main_page = BeautifulSoup(resp.text, "html.parser")
resp.close()

a_list = main_page.find("div", class_="TypeList").find_all("a")
for a in a_list:
    href = "https://www.umei.net/" + a.get('href')   # 直接通过get就可以拿到属性值
    child_page_resp = requests.get(href)
    child_page_resp.encoding = 'utf-8'
    chile_page_content = child_page_resp.text  # 子页面的源代码
    child_page_resp.close()
    # 从子页面中拿到图片的下载路径
    child_page = BeautifulSoup(chile_page_content, "html.parser")
    p = child_page.find("p", align="center")
    img = p.find("img")
    src = img.get("src")
    # 下载图片
    img_resp = requests.get(src)
    # img_resp.content # 这里拿到的是字节
    img_name = src.split("/")[-1] # 拿到url最后一个/ 以后的内容
    # wb表示以二进制的方式写
    with open("picture/" + img_name, mode='wb') as f:
        f.write(img_resp.content)
        print('over!', img_name)
        time.sleep(1)

注意class是python中的关键字，所以需要用到class属性的时候，可以在后面加_，即 class_ = xxxx

抓取到的图片存放位置我们可以单开一个文件夹 “picture”，如果是在jetbraines系列软件中，可以在该文件夹上右键点击mark Directory

as 然后在选择Excluded。因为pycharm为了快速查找文件，会给每一个进来的文件建立索引。如果这样操作后，pycharm就不会为这个文件夹下的内容建立索引，从而减少卡顿

<img src="http://kr.shanghai-jiuxin.com/file/2020/1031/small191468637cab2f0206f7d1d9b175ac81.jpg">

<div>content<div>

bs4本身是找到标签对象，找到标签对象后。通过：

.text获取其中的内容值，content

.get(“xxx”)方法获取xxx属性值，如：img.get（“src”）获取这张图图片的src

6.使用xpath进行有效数据抓取（推荐）

xpath更为关注节点，我们知道，想html，xml等标记性语言是一层一层的DOM树，xpath更为关注这颗树上每个节点处在的位置。

同样的，上来先导入包

pip install lxml

6.1 入门demo，解析某段xml代码

# xpath 是在xml文档中搜索内容的一门语言
# html是xml的一个子集
# 主要依靠节点间关系来查找
# 安装lxml模块
from lxml import etree
xml = """
<book>
    <id>1</id>
    <name>野花遍地香</name>
    <price>1.23</price>
    <nick>臭豆腐</nick>
    <author>
        <nick id="10086">周大强</nick>
        <nick id="10010">周芷若</nick>
        <nick class="joy">周杰伦</nick>
        <nick class="jolin">蔡依林</nick>
        <div>
            <nick>惹了</nick>
        </div>
    </author>

    <partner>
        <nick id="ppc">胖胖陈</nick>
        <nick id="ppbc">胖胖不陈</nick>
    </partner>
</book>
"""

tree = etree.XML(xml)
# result = tree.xpath("/book") # /表示层级关系， 第一个/表示根节点
# result = tree.xpath("/book/name")
# result = tree.xpath("/book/name/text()")    # text() 拿文本
result = tree. xpath("/book/author/*/nick/text()")   # *任意的节点。通配符(会儿)

print(result)

通过etree.xml()方法建立这颗树对象。如果需要获取内容值，就需要用text（）。如示例代码

6.2解析html文件

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8"/>
    <title>Title</title>
</head>
<body>
    <ul>
        <li><a href="http://www.baidu.com">百度</a></li>
        <li><a href="http://www.google.com">谷歌</a></li>
        <li><a href="http://www.sogou.com">搜狗</a></li>
    </ul>
    <ol>
        <li><a href="feiji">飞机</a></li>
        <li><a href="dapao">大炮</a></li>
        <li><a href="huoche">火车</a></li>
    </ol>
    <div class="job">李嘉诚</div>
    <div class="common">胡辣汤</div>
</body>
</html>

先创建一个html文件，命名为 a.html

from lxml import etree

# 可以直接加载文件
tree = etree.parse("a.html")
# result = tree.xpath('/html/body/ul/li/a/text()')
# result = tree.xpath('/html/body/ul/li[1]/a/text()')  # xpath 的顺序从1开始 【】表示索引
# result = tree.xpath("/html/body/ol/li/a[@href='dapao']/text()")  # [@xx=xxx] 是属性的筛选
# print(result)

ol_li_li = tree.xpath("/html/body/ol/li")

for li in ol_li_li:
    result = li.xpath("./a/text()") # 从li中继续寻找，./表示从当前节点开始，是一种相对查找
    print(result)
    result2 = li.xpath("./a/@href")
    print(result2)

基本解析流程就是创建树对象，从这颗树对象中寻找需要的节点。

text（）表示获取文本值，直接返回的是一个list列表对象，需要提取成字符串

可以使用@获取属性值，如@href获取href的值

6.3淦猪八戒

import requests
from lxml import etree

url = "https://beijing.zbj.com/search/f/?type=new&kw=saas"
resp = requests.get(url)

html = etree.HTML(resp.text)

# 拿到每一个服务商的div
divs = html.xpath("/html/body/div[6]/div/div/div[2]/div[5]/div[1]/div")
for div in divs:
    price = "".join(div.xpath("./div/div/a[1]/div[2]/div[1]/span[1]/text()")).strip("¥").strip()
    title = "saas".join(div.xpath("./div/div/a[1]/div[2]/div[2]/p/text()")).strip()
    com_name = "".join(div.xpath("./div/div/a[2]/div[1]/p/text()")).strip()
    location = "".join(div.xpath("./div/div/a[2]/div[1]/div/span/text()")).strip()
    print(title)
    print(price)
    print(com_name)

注意，创建树对象。HTML就用HTML（）方法，XML就用XML（）方法，不要混用

有个快速获取到xpath的方式，即在网页上打开F12调试模式，然后选择需要的元素标签，之后右键 -->复制 -->复制xpath

7.总结

爬虫本质就是我们相当于模拟一个浏览器，去进行请求URL，然后从获取到的所有信息中，提炼出有效信息的过程。
获取url的返回值可以通过request包下的 get 或 post 方法
而提取页面的有效信息则可以通过正则表达式匹配，bs4提取，xpath提取。其中正则表达式效率最高但操作麻烦。xpath和bs4效率差不多，但xpath使用简便
有的时候，这个页面上的有效信息并不直接存在于页面源代码中，即通过JS和AJAX在客户端完成的渲染，并非在服务端。所以通过F12看见的代码和，右键查看页面源代码获取到的代码并不一致。所以这个时候我们就需要通过抓包工具，抓取xhr这样的异步请求。但是注意：抓包的时候请把保留日志勾上，不然一旦页面重定向，xhr请求就会消失