Python爬虫模拟浏览器
抓取 https://read.douban.com/provider/all 网页中的所有出版社的名称。网页如图:
待抓取网页![抓取的网页图](https://i-blog.csdnimg.cn/blog_migrate/3e694435fba08a51ece9bb603122775f.png)
查看网页源码
代码实现
方法1 :
import re
import urllib.request
from urllib.request import urlopen, Request
pattern = '<div class="name">(.*?)</div>' # 使用正则表达式抽取,输出的东西在()之中
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36'} #模拟浏览器
ret = Request("https://read.douban.com/provider/all", headers=headers)
data = urlopen(ret).read().decode('utf-8')
result = re.compile(pattern).findall(str(data)) #全局匹配
fh = open("C:/Users/Echo/Desktop/result.txt", "w") # 写入文档
for index in range(len(result)):
fh.write(result[index]+"\n")
print(result)
方法2 :使用全局变量,添加之后与一般爬取流程相同:
headers = {"User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)"
" Chrome/83.0.4103.116 Safari/537.36"}
opener = urllib.request.build_opener()
opener.addheaders = [opener]
urllib.request.install_opener(opener)