python抓取页面数据

获取请求页面requests信息:

我拿一个笑话网站举例:

使用chrome浏览器,F12开发人员选项,刷新界面,在network中,找到要请求的文件,找到它的request url 、请求方式、headers等:

我自己写的headers转化成字段的格式,如果有用到的可以粘贴走:

def get_headers(header_raw):


    header_raw = header_raw.replace(':',"':'")
    header_raw = header_raw.strip().replace("\n","',\n'")
    header_raw = "'"+ header_raw + "'"

    print(header_raw)

python中引用requests模块:

#!/usr/bin/python
# -*- coding: UTF-8 -*-

import requests
import re


class Attain_data():

    def attain_data(self):

        self.file_name = 'dump_txt' + '.txt'
        self.fout = open(self.file_name,"w")

        self.url = 'http://xiaohua.zol.com.cn/baoxiao/'
        self.headers = {
            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3",
            "Accept-Encoding": "gzip, deflate",
            "Accept-Language": "zh-CN,zh;q=0.9,zh-TW;q=0.8,en;q=0.7",
            "Cache-Control": "no-cache",
            "Connection": "keep-alive",
            "Cookie": "gr_user_id=22558fa6-1747-417a-b221-9a3df417655b; ip_ck=1Y+t0YS9v8EuMTIxNDg3LjE1MzkxNDg3MTk%3D; z_pro_city=s_provice%3Dbeijing%26s_city%3Dbeijing; userProvinceId=1; userCityId=478; userCountyId=0; userLocationId=1; lv=1565232767; vn=2; Hm_lvt_ae5edc2bc4fc71370807f6187f0a2dd0=1565232769; _ga=GA1.3.1107658235.1565232769; _gid=GA1.3.1557263593.1565232769; bdshare_firstime=1565232769504; z_day=ixgo20%3D1; questionnaire_pv=1565222403; Hm_lpvt_ae5edc2bc4fc71370807f6187f0a2dd0=1565232835",
            "Host": "xiaohua.zol.com.cn",
            "Pragma": "no-cache",
            "Referer": "http://xiaohua.zol.com.cn/",
            "Upgrade-Insecure-Requests": "1",
            "User-Agent": " Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36"
        }
        self.res = requests.get(url=self.url)

        #print(self.res.text)
        self.pat = re.compile(r'[\u4e00-\u9fa5]+')
        self.result = self.pat.findall(self.res.text)
        print(self.result)



if __name__ == '__main__':

    a = Attain_data()
    a.attain_data()

如果想要筛选抓取内容,可以研究一下正则表达式;

 

  • 3
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值