python爬虫

全能大咖9

已于 2023-06-04 12:20:31 修改

阅读量343

点赞数

文章标签： python 爬虫开发语言

于 2023-02-08 23:37:49 首次发布

本文链接：https://blog.csdn.net/m0_57048774/article/details/128944961

版权

常用裤

pip install requests
pip install lxml

参数

#get请求用params  post请求用data
res = requests.get(url=url,params=pas,headers=headers)

数据响应

.text    .json()    .content

持久化存储

 path = r'C:\Users\Administrator\Desktop\爬虫爬取数据/' + info + '.html' #r防止字符号转义
 with open(path,'w',encoding='utf-8') as f:
     try:
        f.write(res2)
      except:
        print(tit,"爬取失败")
 f.close()

存储JSON格式

import json
cun = open(path,'w',encoding='utf-8')
 #dump参数1要存储的数据，参数2存储地址，参数3是否askm进行编码(有中文就不能进行)
 json.dump(res_info,fp=cun,ensure_ascii=False)  #ensure_ascii=False  数据中有中文就不能进行askm编码，为true可以进行编码

数据解析(正则)

import re

#重父级匹配  .*?省略   (?P<title>.*?)获取需要的值，给键title
obj = re.compile(r'<li><a href="(?P<href>.*?)">(?P<title>.*?)</a></li><li><a href=',re.S)
res_info = obj.finditer(res_text)

数据解析(xpath)

from lxml import etree

#本地用parse   网页用HTML
 #tree = etree.HTML(res_text)
 tree = etree.parse('./static/html/06xpath.html')
r1 = tree.xpath('/html/body/main/p/text()')  #一个一个层级
r2 = tree.xpath('/html//p/text()')   #//后代省略多个层级
t1 = tree.xpath('/html//ul/li/a/text()')  #拿到ul下所有a元素
t2 = tree.xpath('/html//ul/li[3]/a/text()') #索引的使用
y1 = tree.xpath('/html//ul/li[3]/a/@href')  #@拿属性
o1 = tree.xpath('/html//div[@class="job"]/text()')  #匹配类
o2 = tree.xpath('/html//b[@id="ltp"]/i/text()')  #匹配id
u1 = tree.xpath('/html//ol/li')
for item in u1:
    i1 = item.xpath('./a/text()')  #继续查找  ./就是li元素

创建文件

import os

if not os.path.exists(r'C:\Users\Administrator\Desktop\爬虫爬取数据\4k美女'):
   os.mkdir(r'C:\Users\Administrator\Desktop\爬虫爬取数据\4k美女')

乱码处理

#第一种 全部数据进行编码
res.encoding = 'utf-8'
#第二种 只对乱码处进行处理
title = title.encode('iso-8859-1').decode('gbk')

全能大咖9

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python爬虫

基本网页抓取
复制链接

扫一扫

python爬虫

常用裤

参数

数据响应

持久化存储

存储JSON格式

数据解析(正则)

数据解析(xpath)

创建文件

乱码处理

“相关推荐”对你有帮助么？