——>>python小白一枚,仅限于了解基础语法,经过多天的学习,结果令人欣慰,可以正确的输出"Hello world!!"。
学习python的动力很直接,就是想看看爬虫怎么实现。修过javaweb,学习过程中“我觉得我懂的地方”(手动滑稽)直接跳过。
所谓爬虫,就是模拟客户端(浏览器)进行网络请求,获取响应,按照规则提取数据。
-
简单说一下url的协议,信息安全课程中简单的了解过。HTTP:超文本传输协议,明文传输,效率高,不安全;HTTPS:HTTP+SSL协议,传输数据之前先加密,双方有密钥,之后解密获得数据,效率低,但是安全。
-
requests请求,分为get请求和post请求
-
获取网页源码的方式
---response.content()
---response.content().decode()
---response.text
import requests
#发送一个网络请求
#发送get请求
url = "http://www.baidu.com"
response = requests.get(url)
print(response)
#发送post请求
url = "https://fanyi.baidu.com/basetrans"
#步骤
#找到url地址 复制Formdata
params = {
"query": "人生苦短,我用python",
"from": "zh",
"to": "en"
}
response = requests.post(url, data=params)
print(response)
-
retrying模块的使用,使用超时参数
#coding:utf-8
from retrying import retry
import requests
'''headers = {
"user-agent": "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Mobile Safari/537.36"
,
"referer": "https://fanyi.baidu.com/"
}'''
headers = {
"User-Agent": "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Mobile Safari/537.36"
,
"Referer": "https://m.douban.com/tv/american"
}
@retry(stop_max_attempt_number=3) #让被装饰的函数反复之星三次,三次全部报错才会报错,中间有一次从正确程序就向下执行
def _parse_url(url):
print("*" * 10)
response = requests.get(url, headers = headers, timeout=5)
return response.content.decode()
def parse_url(url):
try:
html_str = _parse_url(url)
except:
html_str = None
return html_str
if __name__ == '__main__':
url = "http://www.baidu.com"
url1 ="ww.baidu.com"
print(parse_url(url1))
-
Cookie相关的请求 两种请求方式 1:放在headers里 2:放在字典里
-
json解析 http://www.bejson.com/
哪里会返回json的数据(1)浏览器切换到手机版(2)抓包app
豆瓣解析案例
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Time : 2019/6/26 20:20
# @Author : Knight
# @Site :
# @File : demo_douban.py
# @Software: PyCharm
import json
import requests
url = "https://m.douban.com/rexxar/api/v2/subject_collection/tv_american/items?os=android&for_mobile=1&start=0&count=18&loc_id=108288&_=1561551375897"
headers = {
"User-Agent": "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Mobile Safari/537.36"
,
"Referer": "https://m.douban.com/tv/american"
}
response = requests.get(url, headers=headers)
response.encoding = 'utf-8'
json_str = response.content.decode()
ret1 = json.loads(json_str)
print(ret1)
with open("douban.txt", "w", encoding="utf-8") as f:
f.write(json.dumps(ret1, ensure_ascii=False, indent=2)) #前者不再以asc码的形式保存 后者在上一行的基础上空格
-
xpath和lxml模块的使用
xpath的安装过程中遇到了点小问题,下载的crx文件添加不到谷歌浏览器中提示
程序包无效:”CRX_HEADER_INVALID”
解决方案:将刚刚下载的crx文件后缀名改为.rar 加压之后选择加载已解压到的拓展程序
xpath的快捷键 Ctrl+shift+x
直接上实例,糗事百科数据爬取
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Time : 2019/6/28 16:47
# @Author : Knight
# @Site :
# @File : demo_qiubai.py
# @Software: PyCharm
from lxml import etree
import requests
import json
class Qiubaispider:
def __init__(self):
self.temp_url = "https://www.qiushibaike.com/hot/page/{}/"
self.headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36"}
def Get_url_list(self):
url_list = [self.temp_url.format(i) for i in range(1,14)]
return url_list
def parse_url(self, url):
print("当前URL:", url)
response = requests.get(url, self.headers)
return response.content.decode()
def Get_content_list(self, html_str):
html = etree.HTML(html_str)
div_list = html.xpath("//div[@id='content-left']/div")
content_list = []
for div in div_list:
item = {}
item["UserName"] = div.xpath(".//h2/text()")[0].strip() if len(div.xpath(".//h2/text()"))>0 else None
item["content"] = div.xpath(".//div[@class='content']/span/text()")
item["content"] = [i.strip() for i in item["content"]]
item["stats_vote"] = div.xpath(".//span[@class='stats-vote']//i/text()")
item["stats_vote"] = item["stats_vote"][0] if len(item["stats_vote"])>0 else None
item["stats_comments"] = div.xpath(".//span[@class='stats-comments']//i/text()")
item["stats_comments"] = item["stats_comments"][0] if len(item["stats_comments"]) > 0 else None
item["img"] = div.xpath(".//div[@class='thumb']//img/@src")
item["img"] = "https:" + item["img"][0] if len(item["img"])>0 else None
content_list.append(item)
return content_list
def Save(self, content_list):
with open("qiubai.txt", "a", encoding='utf-8') as f:
for content in content_list:
f.write(json.dumps(content, ensure_ascii=False))
f.write("\n")
print("保存成功")
def __mainloop__(self):
#主循环
url_list = self.Get_url_list()
for url in url_list:
html_str = self.parse_url(url)
content_list = self.Get_content_list(html_str)
self.Save(content_list)
if __name__ == '__main__':
qiubaispider = Qiubaispider()
qiubaispider.__mainloop__()