网页爬虫与正则匹配
实现原理
利用requests访问网站获取html,用re正则表达式匹配并处理字符
代码
# -*- coding: utf-8 -*-
#上面一行告诉编译器要用的编码格式。这样即使有中文也不会有问题
import re
import requests
response = requests.get('https://www.quora.com/Is-online-education-overrated') #要爬取的网页
f = open("words.txt", "a") #以读写模式创建/打开文件
data = response.text #用文字表示网站源码,编码格式可以改的
title = ' '.join(re.findall('<title>(.*?)</title>',data)) #网页标题
result_list = re.findall('"text": "(.*?)."',data)+re.findall(r'''\\\\\\"text\\\\\\": \\\\\\"(.*?).\\\\\\",''',data)
#这里的正则表达式比较复杂,主要是找“text”元素的内容。根据网页的html规则不同,要找的tag也不同
f.write('\n') #写一行换一行
print(title) #输出标题
for result in result_list: #做一些后续处理
result = result.replace(r'\u2019', r"'") #手动转特殊字符
result = result.replace('\\\\\\', "") #去掉双反斜线
result = result.replace(r'/', "") #去掉单斜线
result = result.replace(r'\n', "") #去掉换行符
result = result.replace('\\', "") #去掉单反斜线
check = result.split() #格式换成list,每个元素是一个回答
for ele in check: #遍历列表元素,把其它无关字符删掉
if '"modifiers": {"image": ' in ele or len(ele) >= 13:
check.remove(ele)
result = ' '.join(check) #转化成str
f.write(result+", "+title+"\n") #写到文件里
print(result) #输出写入的主要内容
f.close() #好习惯保存文件
功能拓展
为了演示正则匹配和爬虫,代码加入了很多无关紧要的代码,而且匹配的并不是很完美。大部分还是比较准确的,比如
Culture of Qualit
"On quality Terry Anderson emphasized that ""learning- knowledge- assessment- and educational experiences will result in high levels of learning by all He also believes that the ""integration of the new tools and affordances of the educational Semantic Web and emerging social software solutions will further enhance and make more accessible and affordable quality online learning experiences"
Since I have titled this observation as
ODeL Xperitu
[from Latin experitu = experienced tested proven] let me say that learning must progress to maturity; to function well as social innovators promoting excellence through Capacity Building and Development. Yes this is the Quality Assurance (QA) principle that defines and determ
"Michael Moore even says that this is a fact of distance education wherein ""teaching is hardly ever an individual act but a process joining together the expertise of a number of specialists."
然而,还是有一些奇奇怪怪的字符没被删掉
#NAME?
#NAME?
#NAME?
#NAME?
#NAME?
For any Query or Enquiry Please Call u2013 Hiren Harwani - 9712186969 (you can join us in Whatsapp also
看小伙伴们能不能自己尝试改进这个程序啦!
另外,如果是爬取中文网页,要注意把编码格式改成utf-8哦
网上其实还有很多其它第三方库,比如beautiful soap 4,也是很值得探究的。