-
爬取
import requests #导入requests模块
1.发送请求
import requests
r=requests.get('http://www.dianping.com/')
2.定制headers
这种情况适用于爬取返回的结果出现“抱歉”“无法访问”等字眼时,这时需要模拟一个界面服务器自行爬取的状态
import requests
headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36"}
r=requests.get("http://www.dianping.com/",headers=headers)
print(r.text)
3.定制URL参数
-
BeautifulSoup 解析
from bs4 import BeautifulSoup #导入BeautifulSoup模块
-
正则表达式 解析
import re #导入re模块
1.爬取网页的时候使用
eg:
import requests
import re
r=[]
p=[]
pattern=re.compile('<span class="short">(.*?)</span>')
for i in range(5):#爬取《小王子》前5页的短评
r.append(requests.get('https://book.douban.com/subject/1084336/comments/hot?p='+str(i+1)))
p.append(re.findall(pattern,r[i].text))
i=1
for item in p:
for item_content in item:
print(str(i)+item_content)
i+=1
2.对文件中的部分字段进行替代
eg:
import re
with open('taglines.list',encoding='utf-8') as fp:#taglines.list是要匹配的文件名
data = fp.read()
pattern = re.compile('# "(.*?)" \((.*?)\)')#原始字段为“# "2091" (2016),需要对‘(’和‘)进行转义”
p= re.findall(pattern,data)
-
词云展示