使用Requests+正则表达式爬取学习猿地-猿来如此模块页面信息,并保存为JSON格式


一、准备

1.数据

https://www.lmonkey.com/ask


2.数据字段

问题 时间 作者 问题url




二、数据爬取

1.设置预设

'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.20 Safari/537.36'


2.观察网页源码(部分)

        <div class="flex-fill ml-3">
            <a href="https://www.lmonkey.com/ask/22988" target="_blank">
                <div class="topic_title mb-0 lh-180 ml-n2">【前端】alert-heading继承色这个属性有问题呀老师,子元素本来就继承父元素颜色用不用这个不都是一样吗这个功能作用有什么用了看不出效果的<small class="float-right text-muted"  data-toggle="tooltip" data-placement="top" title="0个回复"><i class="fa fas fa-comments"></i>&nbsp;0</small></div>
            </a>
            <p class="lh-140 mb-0 topic_info">
                                &nbsp;&nbsp;
                <strong>笑看今朝</strong>
                &nbsp;&nbsp;
                <span data-toggle="tooltip" data-placement="top" title="2020-07-16 13:33:19">
                                                2个月前
                                            </span>
                &nbsp;&nbsp;
                <i class="fa fa-thumbs-up" data-toggle="tooltip" data-placement="top" title="获得0个赞"></i> 0
                    &nbsp;&nbsp;
                    <i class="fa fa-eye" data-toggle="tooltip" data-placement="top" title="查看数127"></i> 127
            </p>
        </div>

3.制定正则规则

# 问题标题
reg = '(?<=<div class="topic_title mb-0 lh-180 ml-n2">)(.*)(?=<small)'

# 提问时间
reg = '(?<=<span data-toggle="tooltip" data-placement="top" title=").*(?=">)'

# 用户
reg = '(?<=<strong>).*(?=</strong>)'

# 问题url
reg = '(?<= <a href=")(.*)(?=" target="_blank">)'




三、数据存取

存为JSON文件

将多个字段列表进行打包,并放入一个list中
page_code = self.get_page(url, headers)
queslist,timelist,authorlist,urllist = self.get_infor(page_code)
return list(zip(queslist,timelist,authorlist,urllist))

调用json.dumps()方法,将传入数据转化为json格式数据

datalist = [{'title': i[0], 'time': i[1], 'author': i[2], 'url': i[3]} for i in data ]
json.dumps(datalist,ensure_ascii=False)

使用with…as语句,将转换后的数据写入json文件中

with open(filename,'w',encoding="utf-8") as f:
	f.write(json.dumps(datalist,ensure_ascii=False))

最终的数据呈现如下

[
  {
    "title": "【Python】请问去哪里下载requests的安装库?",
    "time": "2020-10-15 14:15:02",
    "author": "君_GV14Do",
    "url": "https://www.lmonkey.com/ask/23002"
  },
  {
    "title": "【Python】请问去哪里下载requests的安装库?",
    "time": "2020-10-15 13:46:40",
    "author": "君_GV14Do",
    "url": "https://www.lmonkey.com/ask/23001"
  },
  {
    "title": "【PHP】Composer 镜像老是出错 也没查清原因",
    "time": "2020-10-15 10:44:33",
    "author": "Claramete",
    "url": "https://www.lmonkey.com/ask/23000"
  },
  {
    "title": "【Java】aa",
    "time": "2020-10-13 20:00:11",
    "author": "好的不学回电话",
    "url": "https://www.lmonkey.com/ask/22999"
  },
  {
    "title": "【PHP】源码",
    "time": "2020-10-13 11:17:02",
    "author": "栗子_DV50zH",
    "url": "https://www.lmonkey.com/ask/22998"
  },
  {
    "title": "【Java】as",
    "time": "2020-10-06 18:39:54",
    "author": "好的不学回电话",
    "url": "https://www.lmonkey.com/ask/22997"
  },
  {
    "title": "【PHP】处理公共页头和页脚信息 咋看不了啊",
    "time": "2020-09-28 17:36:37",
    "author": "N.",
    "url": "https://www.lmonkey.com/ask/22996"
  },
  {
    "title": "【PHP】为什么我输出的数据没你那么格式化呢?",
    "time": "2020-09-15 15:55:47",
    "author": "元曦",
    "url": "https://www.lmonkey.com/ask/22995"
  },
  {
    "title": "【PHP】为什么我输出的数据没你那么格式化呢?",
    "time": "2020-09-15 15:55:20",
    "author": "元曦",
    "url": "https://www.lmonkey.com/ask/22994"
  },
  {
    "title": "【PHP】后台首页引入资源提示404",
    "time": "2020-09-12 10:05:14",
    "author": "山丘_rLiHVR",
    "url": "https://www.lmonkey.com/ask/22993"
  },
  {
    "title": "【PHP】composer的课程链接发下?谢谢!",
    "time": "2020-09-10 21:50:39",
    "author": "元曦",
    "url": "https://www.lmonkey.com/ask/22992"
  },
  {
    "title": "【PHP】phpstorm工具的激活码求一个?客服发的显示不全--邮箱422262731@qq.com",
    "time": "2020-09-10 14:07:52",
    "author": "元曦",
    "url": "https://www.lmonkey.com/ask/22991"
  },
  {
    "title": "【前端】老师,您好,怎么获取当前视频的课件呢,谢谢。",
    "time": "2020-08-20 17:09:29",
    "author": "把青春献给黑夜",
    "url": "https://www.lmonkey.com/ask/22990"
  },
  {
    "title": "【前端】PWD=ABC是什么含义呢  老师",
    "time": "2020-08-06 22:28:54",
    "author": "铭扬工作室? 客服",
    "url": "https://www.lmonkey.com/ask/22989"
  },
  {
    "title": "【前端】alert-heading继承色这个属性有问题呀老师,子元素本来就继承父元素颜色用不用这个不都是一样吗这个功能作用有什么用了看不出效果的",
    "time": "2020-07-16 13:33:19",
    "author": "笑看今朝",
    "url": "https://www.lmonkey.com/ask/22988"
  }
]




四、完整代码

# -*- coding:utf-8 -*-
# Created by ZhaoWen on 2020/10/15
# 猿来如此  https://www.lmonkey.com/ask

import requests
import re
import json

def save_file(style,filename,data):
    filename = filename + '.' + style

    if style == 'txt':
       with open(filename,'w',encoding='utf-8') as f:
           f.write(data)

    if style == 'json':
        datalist = [{'title': i[0], 'time': i[1], 'author': i[2], 'url': i[3]} for i in data ]
        with open(filename,'w',encoding="utf-8") as f:
            f.write(json.dumps(datalist,ensure_ascii=False))


class ylrc_spider():
    url = 'https://www.lmonkey.com/ask'
    headers = {
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.20 Safari/537.36'
    }

    def run(self,url=url,headers=headers):
        page_code = self.get_page(url, headers)
        queslist,timelist,authorlist,urllist = self.get_infor(page_code)
        return list(zip(queslist,timelist,authorlist,urllist))


    # 获取网页源码
    def get_page(self,url,headers):
        rep = requests.get(url,headers)
        rep.encoding = rep.apparent_encoding
        return rep.text

    # 问题 时间 作者 url链接
    def get_infor(self,page_code):
        reg = '(?<=<div class="topic_title mb-0 lh-180 ml-n2">)(.*)(?=<small)'
        queslist = re.findall(reg,page_code)

        reg = '(?<=<span data-toggle="tooltip" data-placement="top" title=").*(?=">)'
        timelist = re.findall(reg,page_code)

        reg = '(?<=<strong>).*(?=</strong>)'
        authorlist = re.findall(reg,page_code)

        reg = '(?<= <a href=")(.*)(?=" target="_blank">)'
        urllist = re.findall(reg,page_code)

        return queslist,timelist,authorlist,urllist


if __name__ == '__main__':
    y_s = ylrc_spider()
    save_file('json','ylrc',y_s.run())




五、问题解决

中文写入json文件,输出格式为Unicode编码格式

  {
    "title": "\u3010Python\u3011\u8bf7\u95ee\u53bb\u54ea\u91cc\u4e0b\u8f7drequests\u7684\u5b89\u88c5\u5e93\uff1f",
    "time": "2020-10-15 14:15:02",
    "author": "\u541b_GV14Do",
    "url": "https://www.lmonkey.com/ask/23002"
  },

解决办法为

json.dumps(datalist,ensure_ascii=False)

在调用dumps()方法时,给参数ensure_ascii传入False参数





六、参考文档

Python爬虫教程-正则实战-猿来如此

Python读写txt文本文件:https://www.cnblogs.com/hackpig/p/8215786.html

python 读写中文json的实例详解:https://www.jb51.net/article/127030.htm



  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值