【爬虫】——requests、selenium的常用爬取和保存操作

XC___XC

已于 2024-07-16 19:56:40 修改

阅读量354

点赞数

分类专栏：数据预处理文章标签： python 数据分析

于 2021-05-26 16:27:39 首次发布

本文链接：https://blog.csdn.net/XC___XC/article/details/117295307

版权

数据预处理专栏收录该内容

11 篇文章 0 订阅

订阅专栏

一些简单的API

就是在日常的学习中，经常用到的一些简单的API，在这里做一个小的总结，仅作为笔记使用，方便日后查找。

读取测量点位数据

import re

vertices = []
VERTICES_FILE = "vertices.txt"

with open(VERTICES_FILE, "r", encoding="utf-8") as fr:
    seq = re.compile(",")
    for d in fr.readlines():
        # 读取出来的数据d为字符串，需切割后拼接为列表
        point = seq.split(d.strip())
        # 将列表转为坐标元组,测绘坐标与绘图坐标x,y互换
        point = (float(point[1]), float(point[0]))
        if isinstance(point, tuple):
            vertices.append(point)

需要设置点位数据的文件目录，并将每个点位的X,Y组成一个数组，将所有的点位数据添加到vertices列表中，当然，后面这个输出，可以根据自己的需求改写。

爬取网页

import requests

def gethtml(url):
    try:
        header = {'User-agent': "Mozilla/5.0"}                    }
        html = requests.get(url=url,headers=header)
        html.raise_for_status()
        html.encoding = html.apparent_encoding
        return html.text
    except:
        print("解析异常")

传入url链接，返回html.text。

使用selenium设置无头游览器

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver import ChromeOptions

def creat_web():
    #无头游览器
    chrome_options = Options()
    chrome_options.add_argument('--headless')
    chrome_options.add_argument('--disable-gpu')

    #规避检测
    option = ChromeOptions()  # 实例化一个ChromeOptions对象
    option.add_experimental_option('excludeSwitches', ['enable-automation'])  # 以键值对的形式加入参数

    #生成web对象
    web = webdriver.Chrome(options=option)
    return web

只是使用selenium进行一个无头的设置，就是在运行过程中，取消游览器打开的界面显示。

excel表格数据操作

import os
import pandas as pd

#1.读取指定目录下的多个文件数据,生成器方式返回,num为第几个sheet
def read_excels(read_path,num=0):
    filenames = os.listdir(read_path)
    for each in filenames:
        data = pd.read_excel(read_path + each,sheet_name=num)
        yield data

#2.读取单个文件的数据,num为第几个sheet
def read_excel(path_excel,num=0):
    data = pd.read_excel(path_excel,sheet_name=num)
    return data


#3.提取数据中的指定行列的数据
def data_iloc(data,m1,m2,n1,n2):
    result = data.iloc[m1:m2,n1:n2]
    return result

#4.data中添加一列数据
def add_col(data,name,list):
    data[name] = list


#5.保存data中的所有数据
def to_excel(data,save_path):
    #判断数据类型
    if not isinstance(data,pd.DataFrame):
        data = pd.DataFrame(data)
    data.to_excel(save_path,index=False)
    print('保存成功！！')

这里可以直接使用pandas的内置API，这里相当于做一个笔记，我在平时的数据处理过程中使用较多的。

批量保存网页爬取的文件

def putpic(root = "G://pic//"):       #	root为保存文件的位置
    sum = 0
    for each_alt,each_pic in analysis(baseurl):
        path = root + each_alt + '.jpg'
        try:
            if not os.path.exists(root):
                os.mkdir(root)
            else:
                response = requests.get(each_pic)
                with open(path,'wb') as f:
                    f.write(response.content)
                    time.sleep(0.5)
                    f.close()
                    sum += 1
                    print("第 %d 个图片，保存成功" % sum )
        except:
            pass
    print('爬取完成！')

使用爬虫爬取图片名称和图片链接，通过生成器传递过来，然后依次保存图片。

将多条爬取信息保存为excel

def save(path,page):
	#记录表头
    wb = xlwt.Workbook(encoding='utf-8')
    ws = wb.add_sheet('sheet',cell_overwrite_ok=True)
    num = 1
    col = ['名称','图片','等级','类型','地址','特长','全国排名']
    for i in range(7):
        ws.write(0,i,col[i])    

	#记录数据
    for each in list_data(page,baseurl=baseurl):    #实现生成器传值翻页处理
        print('开始爬取...')
        for element in each:
            print(num,each.index(element),element)
            ws.write(num,each.index(element),element)
        num += 1
        wb.save(path)
        print("保存成功 %d 个医院信息" % (num-1))

同样的方式，先创建一个工作簿，然后记录表头，然后开始写入爬取的数据，通过ws.write(num,each.index(element),element)方式，逐个记录，其中num为一个医院，即某一行信息，index(element)为医院的某个信息的列号，也是通过爬虫爬取网页数据，通过生成器方式逐个传入。

这里就先这样把，以后还有什么笔记，再进行补充。可能有些错误，希望大家评论指正。

XC___XC

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
1
评论
【爬虫】——requests、selenium的常用爬取和保存操作

一些简单的API就是在日常的学习中，经常用到的一些简单的API，在这里做一个小的总结，仅作为笔记使用。读取测量点位数据import revertices = []VERTICES_FILE = "vertices.txt"with open(VERTICES_FILE, "r", encoding="utf-8") as fr: seq = re.compile(",") for d in fr.readlines(): # 读取出来的数据d为字符串，需切割后
复制链接

扫一扫