爬虫的流程步骤:
确定需求:要爬取什么信息
寻找需求:在哪个网站上爬取?网站格式是啥?
发送请求:开启爬虫,发送请求
解析数据:提取数据
存储数据:存储
环境介绍:
系统:Win10
版本:Python3.7
IDE: Pycharm
Requests库安装使用:
获取网页大部分是GET参数跟随url,提交表单大部分为POST,参数不需要跟随URL
安装:
pip install requests
判断是否安装成功:
pip show requests
成功会出现:
使用GET方法请求WWW.BAIDU.COM
import requests #引入包 url = 'http://www.baidu.com' #定义要GET的URL res = requests.get(url) #调用get方法获得结果集 print(res.content) #打印文本 print(res.status_code) #打印状态码,200为正常 print(res.request.headers) #请求头
有时会遇到网站反扒不可启动爬虫,我们需要自定义Agent-headers来逃离网站监视:
import requests url = 'https://www.xicidaili.com/nn/' #此网站不可爬,原因是User-Agent在上述中为Python请求,需要将其改为浏览器 headers={ 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.18362' } #定义更改为浏览器,此时可成功爬取 res = requests.get(url=url,headers=headers) code = res.status_code print(code) if code == 200 : with open('./test.html','w',encoding='utf-8') as fp: fp.write(res.text)
Post方法进行请求:
import requests url = 'https://fanyi.baidu.com/?aldtype=16047#auto/zh' headers={ 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.18362' } data ={ 'kw':'你好' } res = requests.post(url = url,headers = headers,data=data) print(res.status_code) if res.status_code ==200: with open('./fanyi.html','w',encoding='utf-8') as fp: fp.write(res.text)
总代码:
爬取CSDN上的首页推荐信息:
使用beautifulsoup来解析文章,headers头部伪装成自己浏览器标识,会对应输出文章标题、作者姓名、摘要。
import requests from bs4 import BeautifulSoup url = 'https://www.csdn.net/' headers={ 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.18362', 'Cookie': 'Hm_ct_6bcd52f51e9b3dce32bec4a3997715ac=6525*1*10_18807340900-1577241599852-974584!5744*1*caicai779369786; uuid_tt_dd=10_18807340900-1577241599852-974584; UserNick=cs_yougar; dc_session_id=10_1577241599852.972347; UserName=caicai779369786; UserToken=a69c341ae7324fc485d7e644984332ea; searchHistoryArray=%255B%2522xpath%2522%252C%2522%25E7%2588%25B1%25E5%25A5%2587%25E8%2589%25BA%25E8%25BD%25AC%25E7%25A0%2581%2522%252C%2522dos%2520%25E8%25A7%2586%25E9%25A2%2591%25E8%25BD%25AC%25E7%25A0%2581%2522%252C%2522%25E7%2588%25B1%25E5%25A5%2587%25E8%2589%25BA%25E8%25BD%25AC%25E6%258D%25A2mp4%25E6%25A0%25BC%25E5%25BC%258F%2522%252C%2522python%25E5%259F%25BA%25E7%25A1%2580%25E6%2595%2599%25E7%25A8%258B%2522%255D; BT=1577862438892; p_uid=U000000; UserInfo=a69c341ae7324fc485d7e644984332ea; AU=69D; UN=caicai779369786; Hm_lvt_6bcd52f51e9b3dce32bec4a3997715ac=1579057054,1579057582,1579060359,1579231198; dc_tos=q48ela; Hm_lpvt_6bcd52f51e9b3dce32bec4a3997715ac=1579231198; announcement=%257B%2522isLogin%2522%253Atrue%252C%2522announcementUrl%2522%253A%2522https%253A%252F%252Fblog.csdn.net%252Fblogdevteam%252Farticle%252Fdetails%252F103603408%2522%252C%2522announcementCount%2522%253A0%252C%2522announcementExpire%2522%253A3600000%257D' } res = requests.get(url=url,headers=headers) code = res.status_code print(code) soup =BeautifulSoup(res.text,"html.parser") name =soup.select('#feedlist_id > li > div > div.title > h2 > a') writer=soup.select('#feedlist_id > li > div > dl > dd > a') zhaiyao=soup.select('#feedlist_id > li > div > div.summary.oneline') for name in name : print(name.get_text()) for writer in writer : print(writer.get_text()) for zhaiyao in zhaiyao : print(zhaiyao.get_text())
为了方便,提供给大家我整理的数据集:https://download.csdn.net/download/caicai779369786/12160954,勿做商用