python（数据分析与可视化）三

最新推荐文章于 2022-12-31 12:58:49 发布

孤星1212

最新推荐文章于 2022-12-31 12:58:49 发布

阅读量235

点赞数

分类专栏： python数据分析与可视化前期知识文章标签： python 爬虫

本文链接：https://blog.csdn.net/qq_50742585/article/details/112746717

版权

python数据分析与可视化前期知识专栏收录该内容

6 篇文章 1 订阅

订阅专栏

python（数据分析与可视化）三

爬取网页文本的牛刀小试

今天我们来进行一些有趣的爬虫实战

1.煎蛋网文本爬虫

import requests
from lxml import etree

url = 'http://jandan.net/'

#没有请求头直接请求，响应码403
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36',
}
response = requests.get(url,headers=headers)
print(response.status_code)
if response.status_code == 200: #初级进入 直接请求，如果响应码403，有资源但禁止访问
    html = response.text   #response.text->得到html源代码
    #print(html)
    #小技巧：网页想要分析的地方，右键检查，开发者工具的elementshidin定位到哪里
    #小技巧：开发者工具elements，ctrl+f，打开搜索框，里面尝试xpath表达式
    dom = etree.HTML(html)
    xpath_pattern = '//div[@class="post f list-post"]/div[@class="indexs"]/h2/a/text()'
    titles = dom.xpath(xpath_pattern)
    print('titles',titles)
    
    for i in titles:
        print(i)

2.网易新闻头部文本爬虫

在这里需要说一下的就是新闻网站文章从其他地方爬取，访问量大，自己网站也被其他网站爬取。经测试发现，网易新闻非常宽容，不需要伪造请求头，大量请求也不会封id。
备注：有段时间，网易新闻，开发者工具中xpath可以匹配到，但代码中匹配不到，因为使用了nodejs类前端框架，在第一次请求返回html后，js又动态的进行了一些修改，所以以network中请求的response和代码第一次请求后得到的html为准。
下面是我的一个简单例子供大家参考一下：

#网易新闻头部爬虫
import requests
from lxml import etree

url = 'https://news.163.com/'

# headers = ''     #请求头
resp = requests.get(url)
if resp.status_code == 200:
    html = resp.text  # resp.text->得到html源代码
    #print(html)
    dom = etree.HTML(html) #弄成层状结构
    xpath_pattern = '//ul/li[@class="top "]/a/text()'
    titles = dom.xpath(xpath_pattern)
    print('titles',titles)

3.网易新闻热点排行爬虫

这里之所以重新再说一下是因为有段时间，网易新闻，开发者工具中xpath可以匹配到，但代码中匹配不到，因为使用了nodejs类前端框架，在第一次请求返回html后，js又动态的进行了一些修改，所以以network中请求的response和代码第一次请求后得到的html为准。
xpath在浏览器开发者工具验证成功，但xpath查找不出内容，返回空列表
比对第一次请求后得到的原始html，class有三个值，开发者工具中又四个值，说明js后续生成，应以原始信息为准。

为了理解方便，我下面附个例子：

#网易新闻 中部热点排行爬虫
import requests
from lxml import etree

url = 'https://news.163.com/'

# headers = ''     #请求头

# 新闻网站文章从其他地方爬取，访问量大，自己网站也被其他网站爬取。
#测试发现，网易新闻非常宽容，不需要伪造请求头，大量请求也不会封id
#备注：有段时间，网易新闻，开发者工具中xpath可以匹配到，但代码中匹配不到，因为使用了nodejs类前端框架，
#在第一次请求返回html后，js又动态的进行了一些修改
resp = requests.get(url)
if resp.status_code == 200:
    html = resp.text  # resp.text->得到html源代码
    #print(html)
    dom = etree.HTML(html) #弄成层状结构
    xpath_pattern = '//div[@class="mt35 mod_hot_rank clearfix"]/ul/li/a/@href'
    news_href_pattern = dom.xpath(xpath_pattern)
    print(news_href_pattern)

    #找文章详情页
    #先找第一次请求a标签的href值，再发起第二次请求
    for news_href in news_href_pattern:
        print(news_href)
        resp2 = requests.get(news_href)
        html2 = resp2.text
        #print(html2)
        dom2 = etree.HTML(html2)
        xpath_pattern2 = '///div[@class="post_body"]/p/text()'
        titles = dom2.xpath(xpath_pattern2)
        print('titles',titles)
        for t in titles:
            print(t)

4.os库

①我们已经用requests模拟请求，拿到网页源代码，str字符串，里面HTML模式
#需要分析
字符串自带的find方法功能有限，如下：

html = '<html><body><h1>标题</h1></body></html>'
start_index = html.find('<h1>')
end_index = html.find('</h1>')
print(html[start_index:end_index])

因此有三种解析方法：
解析方式一：正则 regex，专门针对字符串处理的语法
（不推荐，了解即可）

import re
text1 = 'ilikepythonbutyouarebeautiful'
pattern1 = re.compile(r'python')
matcher1 = re.search(pattern1,text1)
print(matcher1[0])

text2 = '<h1>i like world</h1>'
pattern2 = re.compile(r'<h1>.+</h1>')  # . 表示一个字符
matcher2 = re.search(pattern2,text2)
print(matcher2[0])

text3 = 'beautiful'
text4 = 'you are a good boy'
text5 = '13243454454@qq.com'
#注册验证邮箱

#手册  https://tool.oschina.net/uploads/apidocs/jquery/regexp.html
#常用正则 https://www.cnblogs.com/qq364735538/p/11099572.html

text6 = """
<html>
aaacc<h1>adsd
sss
</h1>
aaaa
</html>
"""
pattern10 = re.compile(r'<h1>(.*?)</h1>',re.S)
print(pattern10.findall(text6))

#把网页上HTML目标区域标签复制到上述代码中，像抓取的信息用(.*?)代替

5.天堂图片网爬虫

这个图片的爬取就相对来说比较综合了，涉及到的知识也多了起来，代码也较长，下面我就这个例子进行代码讲解，不足之处还望见谅。

# 天堂图片网爬虫，这个网站没什么反爬措施
import os
import requests
from lxml import etree

# home_url = 'https://www.ivsky.com/'
# catalog_url = 'https://www.ivsky.com/tupian/dongwutupian/index_2.html'

#图集页，下有缩略图
album_url = 'https://www.ivsky.com/tupian/lugui_v62472/'

headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36'
}
#请求图集页
resp = requests.get(album_url)
status_code = resp.status_code
print(status_code)
album_html = resp.text
print(album_html)

#获取一个图集下的所有缩略图片
album_dom = etree.HTML(album_html)
title_pattern = '//h1/text()'
img_pattern = '//ul[@class="pli"]/li/div/a/img/@src'
album_title = album_dom.xpath(title_pattern)[0]
album_title = album_title.strip()
img_src_list = album_dom.xpath(img_pattern)
print(album_title)
print(len(img_src_list),img_src_list)

#以图集名创建文件夹
if not os.path.exists('./'+album_title): #
    os.mkdir('./'+album_title)

#循环图片地址列表，请求每一张图片
for i,img_src in enumerate(img_src_list):
    #拼完整图片url
    img_src = 'https:' + img_src
    print(img_src)
    resp = requests.get(img_src,headers=headers)
    print(resp.status_code)
    img_countent_bytes = resp.content


    #图片二进制信息写入本地
    img_path = os.path.join(os.path.dirname(__file__),album_title,f'{i+1}.jpg')
    with open(img_path,mode='wb') as f:
        f.write(img_countent_bytes)
        print(f'第{i+1}张图片保存完毕，保存到了{img_path}')