Python爬虫实战——Quora网站文字自动化爬取和正则匹配筛选信息

网页爬虫与正则匹配

实现原理

利用requests访问网站获取html,用re正则表达式匹配并处理字符

代码

# -*- coding: utf-8 -*-
#上面一行告诉编译器要用的编码格式。这样即使有中文也不会有问题
import re
import requests

response = requests.get('https://www.quora.com/Is-online-education-overrated') #要爬取的网页
f = open("words.txt", "a") #以读写模式创建/打开文件
data = response.text #用文字表示网站源码,编码格式可以改的
title = ' '.join(re.findall('<title>(.*?)</title>',data)) #网页标题
result_list = re.findall('"text": "(.*?)."',data)+re.findall(r'''\\\\\\"text\\\\\\": \\\\\\"(.*?).\\\\\\",''',data)
#这里的正则表达式比较复杂,主要是找“text”元素的内容。根据网页的html规则不同,要找的tag也不同
f.write('\n') #写一行换一行
print(title) #输出标题

for result in result_list: #做一些后续处理
    result = result.replace(r'\u2019', r"'") #手动转特殊字符
    result = result.replace('\\\\\\', "") #去掉双反斜线
    result = result.replace(r'/', "") #去掉单斜线
    result = result.replace(r'\n', "") #去掉换行符
    result = result.replace('\\', "") #去掉单反斜线
    check = result.split() #格式换成list,每个元素是一个回答
    for ele in check: #遍历列表元素,把其它无关字符删掉
        if '"modifiers": {"image": ' in ele or len(ele) >= 13:
            check.remove(ele)
    result = ' '.join(check) #转化成str
    f.write(result+", "+title+"\n") #写到文件里
    print(result) #输出写入的主要内容

f.close() #好习惯保存文件

功能拓展

为了演示正则匹配和爬虫,代码加入了很多无关紧要的代码,而且匹配的并不是很完美。大部分还是比较准确的,比如

Culture of Qualit
"On quality Terry Anderson emphasized that ""learning- knowledge- assessment- and educational experiences will result in high levels of learning by all He also believes that the ""integration of the new tools and affordances of the educational Semantic Web and emerging social software solutions will further enhance and make more accessible and affordable quality online learning experiences"
Since I have titled this observation as
ODeL Xperitu
[from Latin experitu = experienced tested proven] let me say that learning must progress to maturity; to function well as social innovators promoting excellence through Capacity Building and Development. Yes this is the Quality Assurance (QA) principle that defines and determ
"Michael Moore even says that this is a fact of distance education wherein ""teaching is hardly ever an individual act but a process joining together the expertise of a number of specialists."

然而,还是有一些奇奇怪怪的字符没被删掉

#NAME?
#NAME?
#NAME?
#NAME?
#NAME?
For any Query or Enquiry Please Call u2013 Hiren Harwani - 9712186969 (you can join us in Whatsapp also

看小伙伴们能不能自己尝试改进这个程序啦!
另外,如果是爬取中文网页,要注意把编码格式改成utf-8哦
网上其实还有很多其它第三方库,比如beautiful soap 4,也是很值得探究的。

  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

EricFrenzy

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值