Python数据分析(二)数据采集与操作

最新推荐文章于 2024-05-11 13:13:45 发布

L是晴子的球迷

最新推荐文章于 2024-05-11 13:13:45 发布

阅读量674

点赞数

分类专栏： python 文章标签：数据分析 python

本文链接：https://blog.csdn.net/qq_40102768/article/details/104688668

版权

python 专栏收录该内容

45 篇文章 0 订阅

订阅专栏

目录：
常用格式的本地数据读写
Python的数据库基本操作
数据库多表连接
爬虫简介
BeautifulSoup解析网页
爬虫框架Scrapy
实战案例：获取国内城市空气质量指数数据

一.常用格式的本地数据读写

 常用的数据分析文件格式：txt，csv，json，xml(Excel)，xls，HDF

1.txt文件读写

  由字符串行组成，每行由EOL(End Of Line)字符隔开，'\n'
  (1)打开文件，注意编码

file_obj = open(filename,access_mode,encoding='utf-8')
access_mode:'r','w'
file_obj.close()

(2)读操作

file_obj.read()    读取整个文件内容
file_obj.readline()     逐行读取
file_obj.readlines()        返回列表，列表中的每个元素是内容
file_obj.close()

(3)写操作

file_obj.write()    将内容写如文件
file_obj.writelines()     将字符串列表内容逐行写入文件
file_obj.close()

(4)with语句：包括异常处理，走动调用文件关闭操作，推荐使用
适用于对资源进行访问的场合，确保无论适用过程中是否发生异常都会执行‘清理’操作，如关闭文件，线程的自动获取与释放

filename='........../.txt'
with open(filename,'r',encoding='utf-8') as f_obj:
     print(f_obj.read())#执行相关操作

2.CSV(Comma-Separated Values)文件读写（Excel文件相似）

以纯文本形式存储的表格数据(以逗号作为分隔符)，通常第一行为列名利用pandas处理快捷方便
Pandas基于Numpy构建，索引在左，数值再右，索引是Panda自动创建的

Panda数据结构：series，类似于一维数组对象；DataFrame，表格型数据结构，每列可以是不同的数据类型，可表示二维或者更高维的数据

(1)读操作

import pandas as pd
filename = '......../.csv'
df = pd.read_csv(filename,encoding='utf-16')      #返回DataFrame类型数据
print(df.head())    #输入行列数据，第一行通常为列名

"""如何读取列数据"""
country_se = df[u'国家']             #u表示后面字符的编码，‘国家’为列名

(2)写操作

filename='......../.csv'
df.to_csv(filename,index=None,encoding='utf-8')

3.JSON(JavaScript Object Notation)文件

语法规则：数据是键值对，由逗号分隔，{ }保存对象，[ ]保存数组
(1)读操作

filename = '...../.json'
with open(filename,'r') as f_obj:
      json.load(f_obj)     #返回dict类型

(2)JSON ---->csv：

"""分别取出JSON的键值,键：year_lst;值：temp_lst"""
import panda as pd
year_se = pd.Series(year_lst,name='year')
temp_se = pd.Series(temp_lst,name='temperature')
result_df = pf.concat([year_se,temp_se],axis=1)     #axis=1表示按行组合
print(0)
#保存csv
result_df.to_csv('....../.csv',index=None)       #index表示是否输出行索引

(3)编码操作

"""写入对象为列表嵌套的字典"""
book_dict=[{},{}]
filename='...../.json'
with open(filename,'w',encoding='utf-8') as f_obj:
     f_obj.write(json.dumps(book_dict,ensure_ascii=False))

二.Python的数据库基本操作

1.SQLite（截图）

import sqlite3
db_name = '....../.sqlite'
conn = sqlite3.connect(db_name)    #连接数据库，如果db_name存在，读取，不存在则创建
conn.cursor()          #获取游标
cursor.execute(sql_str)     #执行一条操作
cursor.executemany(spl_str)   #批量操作
fetchone()       #拿一条记录
fegchall()       #拿所有记录 
conn.commit()    #提交操作
conn.close()

三.数据库多表连接

查询记录时将多个表中的记录链接(join)并返回结果

join方式：
    交叉连接（cross join）：生成两张表的笛卡尔积，返回的记录数为两张表记录数的成绩
    内连接（inner join）：生成两张表交集，返回的记录数为两张表的交集的记录数
    外连接（outer join）：分为左连接和右连接

左连接：left join(A,B) ，返回表A的所有记录，另外表B中匹配的记录有值，没有匹配的记录返回Null
右连接：right join(A,B),返回表B的所有记录     (sqlite3不支持右连接year_se = pd.Series(year_lst,name='year'))

四.爬虫简介

爬虫基本架构
URL管理模块：对计划爬取或者已经爬取的URL进行管理
网页下载模块：将URL管理模块中指定的URL进行访问下载
网也解析模块：甲西网页下载模块中的URL，处理或保存数据，如果解析到要继续爬取的URL，返回管理模块继续循环

import urllib.request

test_url = '...........com'
#通过url下载，还可以通过request访问，还可以通过Cookie访问
response = urllib.request.urlopen(test_url)
print(response.getcode())   #200表示访问成功
print(response.read)

网页解析模块实现方式：
(1)正则表达式，字符串的模糊匹配
(2)html.parser
(3)BeautifulSoup,结构化的网页解析
(4)lxml

五.BeautifulSoup解析网页(待编辑)

六.爬虫框架Scrapy(待编辑)

Scrapy框架项目：影视信息采集与分析
爬取过程：使用start_urls作为初始url生成Request，默认将parse作为回调函数，再parse函数中解析目标url
框架结构：

在这里插入图片描述
Scrapy使用步骤：
安装：pip install scrapy
1.创建工程

scrapy startproject air_quality        #最后一个参数为创建项目名称

2.定义Item，构造爬取的对象

scrapy.Filed()

3.编写spider，爬虫主题

cd air_quality/

scrapy genspider aqi_history_spider https://.........

4.编写配置Pipeline，用于处理爬取的结果
5.执行爬虫

scrapy crawl aqi_history_spider

七.获取国内城市空气质量指数数据

item.py(构造爬取对象)
在这里插入图片描述

import scrapy


class AirQualityItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    city_name = scrapy.Field()  # 城市名称
    record_date = scrapy.Field()  # 检测日期
    aqi_val = scrapy.Field()  # AQI
    range_val = scrapy.Field()  # 范围
    quality_level = scrapy.Field()  # 质量等级
    pm2_5_val = scrapy.Field()  # PM2.5
    pm10_val = scrapy.Field()  # PM10
    so2_val = scrapy.Field()  # SO2
    co_val = scrapy.Field()  # CO
    no2_val = scrapy.Field()  # NO2
    o3_val = scrapy.Field()  # O3
    rank = scrapy.Field()  # 排名

aqi_history_spider.py(爬虫主体)

# -*- coding: utf-8 -*-
import scrapy
from air_quality.items import AirQualityItem
from urllib import parse

base_url = 'https://www.aqistudy.cn/historydata/'
class AqiHistorySpiderSpider(scrapy.Spider):
    name = 'aqi_history_spider'
    allowed_domains = ['aqistudy.cn']
    start_urls = ['https://www.aqistudy.cn/historydata/']

    def parse(self, response):
        """
        解析初始页面，该页面的url为start_urls
        :param response:
        :return:
        """
        #获取所有城市的url
        city_url_list = response.xpath('//div[@class="all"]//div[@class="bottom"]//a//@href')
        for city_url in city_url_list:
            #依次遍历城市URL,获取月份(start_url下一级页面的url)的url
            #.extravr()  获取response对象里面的data
            city_month_url = base_url + city_url.extract()

            #解析月份的url
            request = scrapy.Request(city_month_url,callback=self.parse_city_month)
            yield request

    def parse_city_month(self,response):
        """
        解析城市的月份
        :param response:
        :return:
        """
        month_url_list = response.xpath('//table[@class="table table-condensed '
                                        'table-bordered table-striped table-hover '
                                        'table-responsive"]//a//@href')

        for month_url in month_url_list:
            # 依次遍历月份URL
            city_day_url = base_url + month_url.extract()
            # 解析该城市的每日数据
            request = scrapy.Request(city_day_url, callback=self.parse_city_day)
            yield request

    def parse_city_day(self,response):
        """
        解析城市的日期
        :param response:
        :return:
        """
        #通过url获取城市名称
        url = response.url
        #初始化item.py中的类
        item = AirQualityItem()
        city_url_name = url[url.find('=') + 1:url.find('&')]

        # 解析url中文，city_url_name为字符串，需解析为中文
        # item['city_name'] = city_url_name

        #为item属性赋值
        item['city_name'] = parse.unquote(city_url_name)

        # 获取每日记录
        day_record_list = response.xpath('//table[@class="table table-condensed '
                                         'table-bordered table-striped table-hover '
                                         'table-responsive"]//tr')
        for i, day_record in enumerate(day_record_list):
            if i == 0:
                # 跳过表头
                continue
            td_list = day_record.xpath('.//td')

            item['record_date'] = td_list[0].xpath('text()').extract_first()  # 检测日期
            item['aqi_val'] = td_list[1].xpath('text()').extract_first()  # AQI
            item['range_val'] = td_list[2].xpath('text()').extract_first()  # 范围
            item['quality_level'] = td_list[3].xpath('.//div/text()').extract_first()  # 质量等级
            item['pm2_5_val'] = td_list[4].xpath('text()').extract_first()  # PM2.5
            item['pm10_val'] = td_list[5].xpath('text()').extract_first()  # PM10
            item['so2_val'] = td_list[6].xpath('text()').extract_first()  # SO2
            item['co_val'] = td_list[7].xpath('text()').extract_first()  # CO
            item['no2_val'] = td_list[8].xpath('text()').extract_first()  # NO2
            item['o3_val'] = td_list[9].xpath('text()').extract_first()  # O3
            item['rank'] = td_list[10].xpath('text()').extract_first()  # 排名

            yield item

city_url.extract()
在这里插入图片描述

Pipelines.py（保存为csv格式，若想保存为其他格式，创建json或者其他类就行）

from scrapy.exporters import CsvItemExporter

class AirQualityPipeline(object):
    def open_spider(self, spider):
        self.file = open('air_quality.csv', 'wb')
        self.exporter = CsvItemExporter(self.file)
        self.exporter.start_exporting()

    def close_spider(self, spider):
        self.exporter.finish_exporting()
        self.file.close()

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

setting.py（找到并声明ITEM_PIPELINES,若保存为其他，继续在{}中添加其他类，300为优先级）

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    'air_quality.pipelines.AirQualityPipeline': 300,
}

最后执行爬虫：

scrapy crawl aqi_history_spider

L是晴子的球迷

关注

0
点赞
踩
5

收藏

觉得还不错? 一键收藏
0
评论
Python数据分析(二)数据采集与操作

目录：常用格式的本地数据读写Python的数据库基本操作数据库多表连接爬虫简介BeautifulSoup解析网页爬虫框架Scrapy实战案例：获取国内城市空气质量指数数据一.常用格式的本地数据读写常用的数据分析文件格式：txt，csv，json，xml(Excel)，xls，HDF1.txt文件读写由字符串行组成，每行由EOL(End Of Line)字符隔开，'\n...
复制链接

扫一扫