获取请求页面requests信息:
我拿一个笑话网站举例:
使用chrome浏览器,F12开发人员选项,刷新界面,在network中,找到要请求的文件,找到它的request url 、请求方式、headers等:
我自己写的headers转化成字段的格式,如果有用到的可以粘贴走:
def get_headers(header_raw): header_raw = header_raw.replace(':',"':'") header_raw = header_raw.strip().replace("\n","',\n'") header_raw = "'"+ header_raw + "'" print(header_raw)
python中引用requests模块:
#!/usr/bin/python # -*- coding: UTF-8 -*- import requests import re class Attain_data(): def attain_data(self): self.file_name = 'dump_txt' + '.txt' self.fout = open(self.file_name,"w") self.url = 'http://xiaohua.zol.com.cn/baoxiao/' self.headers = { "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3", "Accept-Encoding": "gzip, deflate", "Accept-Language": "zh-CN,zh;q=0.9,zh-TW;q=0.8,en;q=0.7", "Cache-Control": "no-cache", "Connection": "keep-alive", "Cookie": "gr_user_id=22558fa6-1747-417a-b221-9a3df417655b; ip_ck=1Y+t0YS9v8EuMTIxNDg3LjE1MzkxNDg3MTk%3D; z_pro_city=s_provice%3Dbeijing%26s_city%3Dbeijing; userProvinceId=1; userCityId=478; userCountyId=0; userLocationId=1; lv=1565232767; vn=2; Hm_lvt_ae5edc2bc4fc71370807f6187f0a2dd0=1565232769; _ga=GA1.3.1107658235.1565232769; _gid=GA1.3.1557263593.1565232769; bdshare_firstime=1565232769504; z_day=ixgo20%3D1; questionnaire_pv=1565222403; Hm_lpvt_ae5edc2bc4fc71370807f6187f0a2dd0=1565232835", "Host": "xiaohua.zol.com.cn", "Pragma": "no-cache", "Referer": "http://xiaohua.zol.com.cn/", "Upgrade-Insecure-Requests": "1", "User-Agent": " Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36" } self.res = requests.get(url=self.url) #print(self.res.text) self.pat = re.compile(r'[\u4e00-\u9fa5]+') self.result = self.pat.findall(self.res.text) print(self.result) if __name__ == '__main__': a = Attain_data() a.attain_data() 如果想要筛选抓取内容,可以研究一下正则表达式;