文章目录
一、准备
1.数据
https://www.lmonkey.com/ask
2.数据字段
问题 时间 作者 问题url
二、数据爬取
1.设置预设
'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.20 Safari/537.36'
2.观察网页源码(部分)
<div class="flex-fill ml-3">
<a href="https://www.lmonkey.com/ask/22988" target="_blank">
<div class="topic_title mb-0 lh-180 ml-n2">【前端】alert-heading继承色这个属性有问题呀老师,子元素本来就继承父元素颜色用不用这个不都是一样吗这个功能作用有什么用了看不出效果的<small class="float-right text-muted" data-toggle="tooltip" data-placement="top" title="0个回复"><i class="fa fas fa-comments"></i> 0</small></div>
</a>
<p class="lh-140 mb-0 topic_info">
•
<strong>笑看今朝</strong>
•
<span data-toggle="tooltip" data-placement="top" title="2020-07-16 13:33:19">
2个月前
</span>
•
<i class="fa fa-thumbs-up" data-toggle="tooltip" data-placement="top" title="获得0个赞"></i> 0
•
<i class="fa fa-eye" data-toggle="tooltip" data-placement="top" title="查看数127"></i> 127
</p>
</div>
3.制定正则规则
# 问题标题
reg = '(?<=<div class="topic_title mb-0 lh-180 ml-n2">)(.*)(?=<small)'
# 提问时间
reg = '(?<=<span data-toggle="tooltip" data-placement="top" title=").*(?=">)'
# 用户
reg = '(?<=<strong>).*(?=</strong>)'
# 问题url
reg = '(?<= <a href=")(.*)(?=" target="_blank">)'
三、数据存取
存为JSON文件
将多个字段列表进行打包,并放入一个list中
page_code = self.get_page(url, headers)
queslist,timelist,authorlist,urllist = self.get_infor(page_code)
return list(zip(queslist,timelist,authorlist,urllist))
调用json.dumps()方法,将传入数据转化为json格式数据
datalist = [{'title': i[0], 'time': i[1], 'author': i[2], 'url': i[3]} for i in data ]
json.dumps(datalist,ensure_ascii=False)
使用with…as语句,将转换后的数据写入json文件中
with open(filename,'w',encoding="utf-8") as f:
f.write(json.dumps(datalist,ensure_ascii=False))
最终的数据呈现如下
[
{
"title": "【Python】请问去哪里下载requests的安装库?",
"time": "2020-10-15 14:15:02",
"author": "君_GV14Do",
"url": "https://www.lmonkey.com/ask/23002"
},
{
"title": "【Python】请问去哪里下载requests的安装库?",
"time": "2020-10-15 13:46:40",
"author": "君_GV14Do",
"url": "https://www.lmonkey.com/ask/23001"
},
{
"title": "【PHP】Composer 镜像老是出错 也没查清原因",
"time": "2020-10-15 10:44:33",
"author": "Claramete",
"url": "https://www.lmonkey.com/ask/23000"
},
{
"title": "【Java】aa",
"time": "2020-10-13 20:00:11",
"author": "好的不学回电话",
"url": "https://www.lmonkey.com/ask/22999"
},
{
"title": "【PHP】源码",
"time": "2020-10-13 11:17:02",
"author": "栗子_DV50zH",
"url": "https://www.lmonkey.com/ask/22998"
},
{
"title": "【Java】as",
"time": "2020-10-06 18:39:54",
"author": "好的不学回电话",
"url": "https://www.lmonkey.com/ask/22997"
},
{
"title": "【PHP】处理公共页头和页脚信息 咋看不了啊",
"time": "2020-09-28 17:36:37",
"author": "N.",
"url": "https://www.lmonkey.com/ask/22996"
},
{
"title": "【PHP】为什么我输出的数据没你那么格式化呢?",
"time": "2020-09-15 15:55:47",
"author": "元曦",
"url": "https://www.lmonkey.com/ask/22995"
},
{
"title": "【PHP】为什么我输出的数据没你那么格式化呢?",
"time": "2020-09-15 15:55:20",
"author": "元曦",
"url": "https://www.lmonkey.com/ask/22994"
},
{
"title": "【PHP】后台首页引入资源提示404",
"time": "2020-09-12 10:05:14",
"author": "山丘_rLiHVR",
"url": "https://www.lmonkey.com/ask/22993"
},
{
"title": "【PHP】composer的课程链接发下?谢谢!",
"time": "2020-09-10 21:50:39",
"author": "元曦",
"url": "https://www.lmonkey.com/ask/22992"
},
{
"title": "【PHP】phpstorm工具的激活码求一个?客服发的显示不全--邮箱422262731@qq.com",
"time": "2020-09-10 14:07:52",
"author": "元曦",
"url": "https://www.lmonkey.com/ask/22991"
},
{
"title": "【前端】老师,您好,怎么获取当前视频的课件呢,谢谢。",
"time": "2020-08-20 17:09:29",
"author": "把青春献给黑夜",
"url": "https://www.lmonkey.com/ask/22990"
},
{
"title": "【前端】PWD=ABC是什么含义呢 老师",
"time": "2020-08-06 22:28:54",
"author": "铭扬工作室? 客服",
"url": "https://www.lmonkey.com/ask/22989"
},
{
"title": "【前端】alert-heading继承色这个属性有问题呀老师,子元素本来就继承父元素颜色用不用这个不都是一样吗这个功能作用有什么用了看不出效果的",
"time": "2020-07-16 13:33:19",
"author": "笑看今朝",
"url": "https://www.lmonkey.com/ask/22988"
}
]
四、完整代码
# -*- coding:utf-8 -*-
# Created by ZhaoWen on 2020/10/15
# 猿来如此 https://www.lmonkey.com/ask
import requests
import re
import json
def save_file(style,filename,data):
filename = filename + '.' + style
if style == 'txt':
with open(filename,'w',encoding='utf-8') as f:
f.write(data)
if style == 'json':
datalist = [{'title': i[0], 'time': i[1], 'author': i[2], 'url': i[3]} for i in data ]
with open(filename,'w',encoding="utf-8") as f:
f.write(json.dumps(datalist,ensure_ascii=False))
class ylrc_spider():
url = 'https://www.lmonkey.com/ask'
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.20 Safari/537.36'
}
def run(self,url=url,headers=headers):
page_code = self.get_page(url, headers)
queslist,timelist,authorlist,urllist = self.get_infor(page_code)
return list(zip(queslist,timelist,authorlist,urllist))
# 获取网页源码
def get_page(self,url,headers):
rep = requests.get(url,headers)
rep.encoding = rep.apparent_encoding
return rep.text
# 问题 时间 作者 url链接
def get_infor(self,page_code):
reg = '(?<=<div class="topic_title mb-0 lh-180 ml-n2">)(.*)(?=<small)'
queslist = re.findall(reg,page_code)
reg = '(?<=<span data-toggle="tooltip" data-placement="top" title=").*(?=">)'
timelist = re.findall(reg,page_code)
reg = '(?<=<strong>).*(?=</strong>)'
authorlist = re.findall(reg,page_code)
reg = '(?<= <a href=")(.*)(?=" target="_blank">)'
urllist = re.findall(reg,page_code)
return queslist,timelist,authorlist,urllist
if __name__ == '__main__':
y_s = ylrc_spider()
save_file('json','ylrc',y_s.run())
五、问题解决
中文写入json文件,输出格式为Unicode编码格式
如
{
"title": "\u3010Python\u3011\u8bf7\u95ee\u53bb\u54ea\u91cc\u4e0b\u8f7drequests\u7684\u5b89\u88c5\u5e93\uff1f",
"time": "2020-10-15 14:15:02",
"author": "\u541b_GV14Do",
"url": "https://www.lmonkey.com/ask/23002"
},
解决办法为
json.dumps(datalist,ensure_ascii=False)
在调用dumps()方法时,给参数ensure_ascii传入False参数
六、参考文档
Python读写txt文本文件:https://www.cnblogs.com/hackpig/p/8215786.html
python 读写中文json的实例详解:https://www.jb51.net/article/127030.htm