Python爬虫

最新推荐文章于 2020-07-22 10:54:00 发布

wang转圈儿

最新推荐文章于 2020-07-22 10:54:00 发布

阅读量184

点赞数

分类专栏： python

本文链接：https://blog.csdn.net/weixin_42168987/article/details/104588216

版权

python 专栏收录该内容

15 篇文章 0 订阅

订阅专栏

爬虫

什么是爬虫

爬虫：
模拟客户端（浏览器）发送网页请求，获取响应，按照规则提取数据
爬虫的数据去向：
呈现出来：展现在网页或app上；
进行分析：从数据中找出一些规律
url请求
url：请求的协议+网站的域名+资源的路径+参数（？开头，and连接）
当前url对应的响应：在network里找到当前url，点击response；或在页面右键显示网页源码

开发爬虫的步骤

目标数据
-网站
-页面
分析数据加载流程
-分析目标数据对应的url
下载数据
清洗、处理数据
数据持久化

代码例子

示例1 小说网站

import requests
import re
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:67.0) Gecko/20100101 Firefox/67.0",}
url='https://jingcaiyuedu.com/novel/bbQAZ/list.html'
#模拟浏览器发送http请求
response=requests.get(url,headers=headers)
#编码方式
response.encoding='utf-8'
#目标小说主页源码
html=response.text
#获取小说名字
title=re.findall(r'<h1>(.*?)</h1>',html)[0]

#新建一个文件保存小说内容
fb=open('%s.txt' % title,'w',encoding='utf-8')
#获取每一章的信息（章节，url）
#正则表达式 .匹配任意字符，默认不匹配非可见字符（空格、换行）；re.S使.可以匹配到不可见字符
dl=re.findall(r'<dl class="panel-body panel-chapterlist">.*?</dl>',html,re.S)[0]
chapter_info_list=re.findall(r'<a href="(.*?)">(.*?)</a>',dl)
#循环每一章节分别去下载
for chapter_info in chapter_info_list:
    chapter_url,chapter_title=chapter_info
    chapter_url="https://jingcaiyuedu.com%s"%chapter_url
    #下载章节内容
    chapter_response=requests.get(chapter_url,headers=headers)
    chapter_response.encoding='utf-8'
    chapter_html=chapter_response.text
    chapter_content=re.findall(r'<div class="panel-body" id="htmlContent">(.*?)</p>                        <div>',chapter_html,re.S)[0]
    #清洗数据
    chapter_content=chapter_content.replace('</p>','')
    chapter_content=chapter_content.replace(' ','')
    chapter_content=chapter_content.replace('<p>','')
    chapter_content=chapter_content.replace('精彩小说网<ahref="https://www.jingcaiyuedu.com">www.jingcaiyuedu<fontcolor="red">.com</font></a>，最快更新<ahref="https://www.jingcaiyuedu.com/novel/bbQAZ.html">仙道至尊</a>最新章节！','')
    chapter_content=chapter_content.replace('<div>','')
    chapter_content=chapter_content.replace('<insclass="adsbygoogle"','')
    chapter_content=chapter_content.replace('style="display:block"','')
    chapter_content=chapter_content.replace('data-ad-client="ca-pub-5369371977356587"','')
    chapter_content=chapter_content.replace('data-ad-slot="8549019189"','')
    chapter_content=chapter_content.replace('data-ad-format="auto"','')
    chapter_content=chapter_content.replace('data-full-width-responsive="true"></ins>','')
    chapter_content=chapter_content.replace('<script>','')
    chapter_content=chapter_content.replace('(adsbygoogle=window.adsbygoogle||[]).push({});','')
    chapter_content=chapter_content.replace('</script>','')
    chapter_content=chapter_content.replace('</div>','')
    #数据持久化
    fb.write(chapter_title)
    fb.write(chapter_content)
    fb.write('\n')
    print(chapter_url)

示例2 爬豆瓣电影top25

import requests
import re
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:67.0) Gecko/20100101 Firefox/67.0",}
url='https://movie.douban.com/top250'
response=requests.get(url,headers=headers)
response.encoding="utf-8"
html=response.text

with open('douban.txt','w',encoding='utf-8') as fb:
    ol=re.findall(r'<ol class="grid_view">(.*?)</ol>',html,re.S)
    film_list=re.findall('<li.*?hd.*?title">(.*?)</span>.*?<p.*?>(.*?);&nbsp;&nbsp;(.*?)<br>(.*?)&nbsp;/&nbsp;(.*?)&nbsp;/&nbsp;(.*?)</p>.*?average">(.*?)</span>.*?inq">(.*?)</span>.*?</li>',str(ol),re.S)
    i=1
    for film in film_list:
        title=film[0]
        derector=film[1]
        actor=film[2]
        year=film[3]
        country=film[4]
        type=film[5]
        score=film[6]
        derector=derector.replace('\\n','')
        derector=derector.replace(' ','')
        year=year.replace('\\n','')
        year=year.replace(' ','')
        type=type.replace('\\n','')
        type=type.replace(' ','')
        fb.write("%d. "%i+title+'\n'+derector+'\n'+actor+'\n'+year.strip()+'\n'+country+'\n'+type.strip()+'\n'+'评分：'+score+'\n--*---*---*---\n')
        i=i+1

douban.txt：
在这里插入图片描述

Request库

request.get(url,params=None,**kwargs)
url：拟获取页面的url链接
params：url中的额外参数，字典或字节流格式，可选
**kwargs：12个控制访问的参数
Request库的两个重要对象：
1.Response：包含爬虫返回的内容
2.Request
Response对象的属性
Response编码
Request库的异常
HTTP协议对资源的操作

HTTP协议与Request库方法一一对应：
Request库的主要方法
requests.request(method,url,kwargs)：
method：请求方式，对应’GET’/‘HEAD’/‘POST’/‘PUT’/‘PATCH’/‘DELETE’/'OPTIONS’7种；
url：拟获取页面的url链接；
kwargs：控制访问的参数，共13个，包括params（字典或字节序列，作为参数增加到url中）、data（字典、字节序列或文件对象，作为Request的内容）、json（JSON格式的数据，作为Request的内容）、headers（字典，HTTP定制头）、cookies（字典或Cookiejar，Request中的cookie）、auth（元组，支持HTTP认证功能）、files（字典，传输文件）、timeout（设定超过时间，以秒为单位）、proxies（字典，设定访问代理服务器，可以增加登录认证）、allow_redirects（True/False，默认为True，重定向开关）、stream（True/False，默认为True，获取内容立即下载开关）、verify（True/False，默认为True，认证SSL证书开关）、cert（本地SSL证书路径）
requests.get(url,params=None,kwargs):
url：拟获取页面的url链接
params：url中的额外参数，字典或字节流格式
kwargs：12个控制访问的参数(除了params)
requests.head(url,kwargs):
url：拟获取页面的url链接
kwargs：13个控制访问的参数
requests.post(url,data=None,json=None,kwargs)：
url：拟获取页面的url链接
data：字典、字节序列或文件对象，作为Request的内容
json：JSON格式的数据，作为Request的内容
kwargs：11个控制访问的参数（除了data、json）
requests.put(url,data=None,kwargs):
url：拟获取页面的url链接
data：字典、字节序列或文件对象，作为Request的内容
kwargs：12个控制访问的参数(除了params)
requests.patch(url,data=None,kwargs):
url：拟获取页面的url链接
data：字典、字节序列或文件对象，作为Request的内容
kwargs：12个控制访问的参数(除了params)
requests.delete(url,kwargs):
url：拟获取页面的url链接
kwargs：13个控制访问的参数(除了params)
亚马逊商品页面爬取
有些网站识别到此次访问是通过Python的Request库访问的会加以控制，此时通过更改头部信息模拟浏览器实现访问
百度，360搜索关键字提交
百度的关键词接口：https://www.baidu.com/s?wd=keyword
360的关键词接口：https://www.so.com/s?q=keyword
构建相应的url即可实现关键词的提交
通过params将键词对输入到url
爬取并存储图片
IP地址归属地查询
m.ip138.com这个网站可以查询IP地址归属地

BeautifulSoup库

是解析、遍历、维护“标签树”的功能库。

BeautifulSoup(data,“html.parser”):

import requests
from bs4 import BeautifulSoup

url="http://www.baidu.com/s"
r=requests.get(url)
data=r.text
soup=BeautifulSoup(data,"html.parser")

第一个参数data：需要解析的html格式的信息。
第二个参数：解析html信息的解析器。

解析器：
在这里插入图片描述
BeautifulSoup类的基本元素：

wang转圈儿

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录