爬虫-python -(1) 简单爬取数据、re解析

1.简单爬取数据

1.爬取百度翻译

进入百度翻译,用网站自带的抓包工具,找到输入请求的xhr。

找到请求url,还有请求方法为post

使用formdata 这种方式,kw:d d为查找的英文单词

import requests

url = "https://fanyi.baidu.com/sug"
s= input("请输入英文单词:")
dat = {"kw":s}
resp = requests.post(url, data=dat)
print(resp)
print(resp.json())

找到搜狗翻译的字典和有道翻译的字典如下

import requests

url = {'百度翻译':"https://fanyi.baidu.com/sug",
       '搜狗翻译':"https://fanyi.sogou.com/reventondc/suggV3",
       '有道翻译':"https://fanyi.youdao.com/translate_o?smartresult=dict&smartresult=rule"
}

formData =input( "请输入(百度翻译、搜狗翻译、有道翻译)):")
s= input("请输入英文单词:")
dat = {
'百度翻译':
{"kw":s},
'搜狗翻译':
{"from": "auto","to": "zh-CHS","client": "web","text": s,"uuid": "8166fd5c-2cf0-4f7e-a0c5-5ed5c8fc1011","pid": "sogou-dict-vr","addSugg":"on"},
'有道翻译':
{"i": s,"from": "AUTO","to": "AUTO","smartresult": "dict","client": "fanyideskweb",
"salt": "16408453493548","sign": "69884ad58e3f6bc3dcbd65cbf80607c2","lts": "1640845349354",
"bv": "2632875b568a3baf568a14dddf2c8f7f","doctype": "json","version": "2.1","keyfrom": "fanyi.web","action": "FY_BY_REALTlME"
}
}
header = {
    "百度翻译":
    {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'},
    "搜狗翻译":
    {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'},
    "有道翻译":
    {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36',
    'Referer': 'http://fanyi.youdao.com/',
    'Cookie': 'OUTFOX_SEARCH_USER_ID=-1154806696@10.168.8.76; OUTFOX_SEARCH_USER_ID_NCOO=1227534676.2988937; JSESSIONID=aaa7LDLdy4Wbh9ECJb_Vw; ___rl__test__cookies=1563334957868'
}
}

resp = requests.post(url[formData], data=dat[formData],headers = header[formData])
print(resp)
print(resp.json())

2.豆瓣排行榜爬取

1.加入URL的参数param
2.修改消息头的UA

import requests
url = "https://movie.douban.com/j/chart/top_list"
param= {
     "type": "24",
     "interval_id": "100:90",
     "action": "",
     "start": 0,
     "limit": 20
}
heard= {
      "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36 Edg/96.0.1054.62"
}
resp = requests.get(url=url, params=param, headers=heard)
print(resp) #返回状态
print(resp.json()) #爬取
print(resp.request.headers['User-Agent'])
resp.close() #关掉resp

2.数据解析

从爬取的数据里面,选取有用的数据,为数据解析。
这里有三种:
1.re解析
2.bs4解析
3.xpath解析

3.re解析

re解析用正则表达式,一种表达式方式对字符串进行匹配的语法规则。
https://tool.oschina.net/regex
正则表达式测试网站

1) 常用元字符

2) 常用限定符(量词)

3) 其他语法

贪婪匹配  .* 
惰性匹配 .*? 

.*? 表示匹配任意数量的重复,但是在能使整个匹配成功的前提下使用最少的重复(惰性匹配)

4. re解析 爬取豆瓣250排行榜

在这里插入图片描述

import requests
import re
import csv

url = 'https://movie.douban.com/top250'
heard = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
    "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110"
    "Safari/537.36 Edg/96.0.1054.62"
    }
resp = requests.get(url,headers=heard)
print(resp)
#保存网页
pageContent = resp.text
resp.close()  #关闭网页

#解析数据
obj = re.compile(r'<li>.*?<div class.*?>.*?<span class="title">'
                 r'(?P<name>.*?)</span>.*?<p class="">.*?<br>.*?'
                 r'(?P<year>\d.*?)&nbsp;/&nbsp.*?<div class="star">.*?'
                 r'<span class="rating_num" property="v:average">'
                 r'(?P<score>.*?)</span>.*?<span>'
                 r'(?P<people>.*?)评价</span>', re.S)   #正则预加载
#开始匹配
res = obj.finditer(pageContent)
#保存csv
f = open("data.csv",mode='w') 
csvwriter = csv.writer(f)
for it in res:
    #print(it.group("name","year",'score','people'))
    dic = it.groupdict()
    csvwriter.writerow(dic.values())
f.close
print("Over!")

改进版将250个全部爬出

import requests
import re
import csv

def getMovie(starter):
        url = 'https://movie.douban.com/top250'
        heard = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
            "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110"
            "Safari/537.36 Edg/96.0.1054.62"
            }
        params= {
        "start": starter,
        "filter": ""
        }
        resp = requests.get(url,params=params,headers=heard)
        print(resp)
        #保存网页
        pageContent = resp.text
        resp.close()  #关闭网页
        return pageContent

#解析数据
obj = re.compile(r'<li>.*?<div class.*?>.*?<span class="title">'
                 r'(?P<name>.*?)</span>.*?<p class="">.*?<br>.*?'
                 r'(?P<year>\d{4}).*?&nbsp;/&nbsp.*?<div class="star">.*?'
                 r'<span class="rating_num" property="v:average">'
                 r'(?P<score>.*?)</span>.*?<span>'
                 r'(?P<people>.*?)评价</span>', re.S)   #正则预加载
#保存csv                 
f = open("data.csv",mode='w',newline='') 
#开始匹配
for starter in range(0,250,25):
    res = obj.finditer(getMovie(starter))
    csvwriter = csv.writer(f)
    for it in res:
        #print(it.group("name","year",'score','people'))
        dic = it.groupdict()
        #print(dic.values())
        csvwriter.writerow(dic.values())
    print(starter+25,"Over!")
f.close()

5.总结

爬虫知识基本是以前没见到也没用过的,接受需要一段时间,明天还有re解析的一个案例,看看能它和bs4一起看完吗。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值