python抓取页面数据

最新推荐文章于 2024-05-14 06:48:47 发布

王大毛__

最新推荐文章于 2024-05-14 06:48:47 发布

阅读量428

点赞数 3

本文链接：https://blog.csdn.net/qq_31114133/article/details/98873121

版权

获取请求页面requests信息：

我拿一个笑话网站举例：

使用chrome浏览器，F12开发人员选项，刷新界面，在network中，找到要请求的文件，找到它的request url 、请求方式、headers等：

我自己写的headers转化成字段的格式，如果有用到的可以粘贴走：

def get_headers(header_raw):


    header_raw = header_raw.replace(':',"':'")
    header_raw = header_raw.strip().replace("\n","',\n'")
    header_raw = "'"+ header_raw + "'"

    print(header_raw)

python中引用requests模块：

#!/usr/bin/python
# -*- coding: UTF-8 -*-

import requests
import re


class Attain_data():

    def attain_data(self):

        self.file_name = 'dump_txt' + '.txt'
        self.fout = open(self.file_name,"w")

        self.url = 'http://xiaohua.zol.com.cn/baoxiao/'
        self.headers = {
            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3",
            "Accept-Encoding": "gzip, deflate",
            "Accept-Language": "zh-CN,zh;q=0.9,zh-TW;q=0.8,en;q=0.7",
            "Cache-Control": "no-cache",
            "Connection": "keep-alive",
            "Cookie": "gr_user_id=22558fa6-1747-417a-b221-9a3df417655b; ip_ck=1Y+t0YS9v8EuMTIxNDg3LjE1MzkxNDg3MTk%3D; z_pro_city=s_provice%3Dbeijing%26s_city%3Dbeijing; userProvinceId=1; userCityId=478; userCountyId=0; userLocationId=1; lv=1565232767; vn=2; Hm_lvt_ae5edc2bc4fc71370807f6187f0a2dd0=1565232769; _ga=GA1.3.1107658235.1565232769; _gid=GA1.3.1557263593.1565232769; bdshare_firstime=1565232769504; z_day=ixgo20%3D1; questionnaire_pv=1565222403; Hm_lpvt_ae5edc2bc4fc71370807f6187f0a2dd0=1565232835",
            "Host": "xiaohua.zol.com.cn",
            "Pragma": "no-cache",
            "Referer": "http://xiaohua.zol.com.cn/",
            "Upgrade-Insecure-Requests": "1",
            "User-Agent": " Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36"
        }
        self.res = requests.get(url=self.url)

        #print(self.res.text)
        self.pat = re.compile(r'[\u4e00-\u9fa5]+')
        self.result = self.pat.findall(self.res.text)
        print(self.result)



if __name__ == '__main__':

    a = Attain_data()
    a.attain_data()

如果想要筛选抓取内容，可以研究一下正则表达式；

王大毛__

关注

3
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python抓取页面数据

获取请求页面requests信息：我拿一个笑话网站举例：使用chrome浏览器，F12开发人员选项，刷新界面，在network中，找到要请求的文件，找到它的request url 、请求方式、headers等：我自己写的headers转化成字段的格式，如果有用到的可以粘贴走：def get_headers(header_raw): header_raw = hea...
复制链接

扫一扫