2-5_Pandas_String_Operations pandas字符串操作

最新推荐文章于 2024-06-18 20:34:53 发布

五角钱的程序员

最新推荐文章于 2024-06-18 20:34:53 发布

阅读量458

点赞数

分类专栏： python系列文章标签： python 数据挖掘数据分析

本文链接：https://blog.csdn.net/youif/article/details/104275463

版权

python系列专栏收录该内容

26 篇文章 5 订阅

订阅专栏

文章目录

5.1字符串操作
5.2 平均气温
5.3 画一下温度和雪期

pandas字符串操作
很明显除了数值型，我们处理的数据还有很多字符类型的，而这部分数据显然也非常重要，因此这个部分我们提一提pandas的字符串处理。

# * coding:utf-8_*_
# 作者     :XiangLin
# 创建时间 :11/02/2020 20:15
# 文件     :2-5_Pandas_String_Operations.py
# IDE      :PyCharm
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import warnings
warnings.filterwarnings('ignore')
plt.style.use("bmh")
plt.rc('font', family='SimHei', size=25) #显示中文
pd.set_option('display.max_columns',1000)
pd.set_option('display.width', 1000)
pd.set_option('display.max_colwidth',1000)

前面看到pandas在处理数值型的时候，各种如鱼得水，偷偷告诉你，pandas处理字符串也相当生猛。
咱们来读一份天气数据。

weather_2012 = pd.read_csv('weather_2012.csv',parse_dates=True,index_col='Date/Time')
print(weather_2012.head())
print(weather_2012.index)

输出：
                     Temp (C)  Dew Point Temp (C)  Rel Hum (%)  Wind Spd (km/h)  Visibility (km)  Stn Press (kPa)               Weather
Date/Time                                                                                                                              
2012-01-01 00:00:00      -1.8                -3.9           86                4              8.0           101.24                   Fog
2012-01-01 01:00:00      -1.8                -3.7           87                4              8.0           101.24                   Fog
2012-01-01 02:00:00      -1.8                -3.4           89                7              4.0           101.26  Freezing Drizzle,Fog
2012-01-01 03:00:00      -1.5                -3.2           88                6              4.0           101.27  Freezing Drizzle,Fog
2012-01-01 04:00:00      -1.5                -3.3           88                7              4.8           101.23                   Fog
DatetimeIndex(['2012-01-01 00:00:00', '2012-01-01 01:00:00', '2012-01-01 02:00:00', '2012-01-01 03:00:00', '2012-01-01 04:00:00', '2012-01-01 05:00:00', '2012-01-01 06:00:00', '2012-01-01 07:00:00', '2012-01-01 08:00:00', '2012-01-01 09:00:00',
               ...
               '2012-12-31 14:00:00', '2012-12-31 15:00:00', '2012-12-31 16:00:00', '2012-12-31 17:00:00', '2012-12-31 18:00:00', '2012-12-31 19:00:00', '2012-12-31 20:00:00', '2012-12-31 21:00:00', '2012-12-31 22:00:00', '2012-12-31 23:00:00'], dtype='datetime64[ns]', name='Date/Time', length=8784, freq=None)

5.1字符串操作

从上面的数据里面可以看到，有 ‘Weather’ 这一列。我们这里假定包含 “Snow” 的才是下雪天。

pandas的str类型提供了一系列方便的函数，比如这里的contains

weather_description = weather_2012['Weather']
is_snowing = weather_description.str.contains('Snow')
# 你看我们contains返回的其实是布尔型的判定结果的dataframe。
# 返回bool型内容的dataframe
print(is_snowing.astype(int)[:5])
is_snowing.astype(int).plot(figsize = (20,6))
plt.show()

w = weather_2012.loc[weather_2012['Weather'].str.contains('Snow'),'Weather'].head()
print(w)

输出：
Date/Time
2012-01-01 00:00:00    0
2012-01-01 01:00:00    0
2012-01-01 02:00:00    0
2012-01-01 03:00:00    0
2012-01-01 04:00:00    0
Name: Weather, dtype: int32
Date/Time
2012-01-02 17:00:00    Snow Showers
2012-01-02 20:00:00    Snow Showers
2012-01-02 21:00:00    Snow Showers
2012-01-02 23:00:00    Snow Showers
2012-01-03 00:00:00    Snow Showers
Name: Weather, dtype: object

在这里插入图片描述

5.2 平均气温

如果我们想知道每个月的温度值中位数，有一个很有用的函数可以调用哈，叫 resample()

weather_med = weather_2012['Temp (C)'].resample('M',how=np.median)
print(weather_med)
weather_2012['Temp (C)'].resample('M',how=np.median).plot(figsize = (20,10),kind = 'bar')
plt.show()

输出：
Date/Time
2012-01-31    -7.05
2012-02-29    -4.10
2012-03-31     2.60
2012-04-30     6.30
2012-05-31    16.05
2012-06-30    19.60
2012-07-31    22.90
2012-08-31    22.20
2012-09-30    16.10
2012-10-31    11.30
2012-11-30     1.05
2012-12-31    -2.85
Freq: M, Name: Temp (C), dtype: float64

在这里插入图片描述
符合预期对吧，7月和8月是温度最高的

你要知道，布尔型的 True 和 False其实是不便于运算的，当然，其实他们就是0和1了，所以我们转成float型去做做运算可好？

print(is_snowing.astype(float)[:5])

Date/Time
2012-01-01 00:00:00    0.0
2012-01-01 01:00:00    0.0
2012-01-01 02:00:00    0.0
2012-01-01 03:00:00    0.0
2012-01-01 04:00:00    0.0
Name: Weather, dtype: float64

然后我们很聪明地用 resample 去找到每个月下雪的比例状况（为嘛感觉在做很无聊的事情，其实哪个月下雪多我们知道的对么…）

me_snow = is_snowing.astype(float).resample('M',how = np.mean)
print(me_snow)
is_snowing.astype(float).resample('M',how = np.mean).plot(figsize = (20,10),kind = 'bar')
plt.show()

输出：
Date/Time
2012-01-31    0.240591
2012-02-29    0.162356
2012-03-31    0.087366
2012-04-30    0.015278
2012-05-31    0.000000
2012-06-30    0.000000
2012-07-31    0.000000
2012-08-31    0.000000
2012-09-30    0.000000
2012-10-31    0.000000
2012-11-30    0.038889
2012-12-31    0.251344
Freq: M, Name: Weather, dtype: float64

在这里插入图片描述
So，你也看到了，加拿大的12月是下雪最多的月份。然后你还能观察到一些其他的端倪，比如你会发现，11月开始突然下雪，接着就雪期漫漫，虽然下雪的概率逐步减小，但是可能要到4月或者5月才会停止。

5.3 画一下温度和雪期

我们把温度和下雪概率放到一起，组成dataframe的2列，然后画个图

temperature = weather_2012['Temp (C)'].resample('M',how=np.median)
is_snowing = weather_2012['Weather'].str.contains('Snow')
snowiness = is_snowing.astype(float).resample('M',how = np.mean)
# 给列取个名字
temperature.name = "Temperature"
snowiness.name = "Snowiness"
print(temperature)
print(snowiness)

输出
Date/Time
2012-01-31    -7.05
2012-02-29    -4.10
2012-03-31     2.60
2012-04-30     6.30
2012-05-31    16.05
2012-06-30    19.60
2012-07-31    22.90
2012-08-31    22.20
2012-09-30    16.10
2012-10-31    11.30
2012-11-30     1.05
2012-12-31    -2.85
Freq: M, Name: Temperature, dtype: float64
Date/Time
2012-01-31    0.240591
2012-02-29    0.162356
2012-03-31    0.087366
2012-04-30    0.015278
2012-05-31    0.000000
2012-06-30    0.000000
2012-07-31    0.000000
2012-08-31    0.000000
2012-09-30    0.000000
2012-10-31    0.000000
2012-11-30    0.038889
2012-12-31    0.251344
Freq: M, Name: Snowiness, dtype: float64

我们用concat完成字符串的拼接
用 concat 把这两列拼接到一列中，组成一个新的dataframe

stats = pd.concat([temperature,snowiness],axis=1)
print(stats)
stats.plot(figsize = (20,10),kind = "bar")
plt.show()

        Temperature  Snowiness
Date/Time                         
2012-01-31        -7.05   0.240591
2012-02-29        -4.10   0.162356
2012-03-31         2.60   0.087366
2012-04-30         6.30   0.015278
2012-05-31        16.05   0.000000
2012-06-30        19.60   0.000000
2012-07-31        22.90   0.000000
2012-08-31        22.20   0.000000
2012-09-30        16.10   0.000000
2012-10-31        11.30   0.000000
2012-11-30         1.05   0.038889
2012-12-31        -2.85   0.251344

在这里插入图片描述
你发现，什么鬼！！！红色的下雪概率呢！！！
是的亲，你这2个维度的幅度是不一样的，所以要分开画哦。

stats.plot(kind='Bar', subplots=True, figsize=(15, 10))
plt.show()

在这里插入图片描述
数据链接：链接：https://pan.baidu.com/s/1caOMOZO0y5xOQD1mSjyLiA
提取码：rpxn
来自七月在线数据挖掘算法
向林
2020年2月12日于重庆城口
好好学习，天天向上,终有所获

五角钱的程序员

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
2-5_Pandas_String_Operations pandas字符串操作

pandas字符串操作很明显除了数值型，我们处理的数据还有很多字符类型的，而这部分数据显然也非常重要，因此这个部分我们提一提pandas的字符串处理。# * coding:utf-8_*_# 作者 :XiangLin# 创建时间 :11/02/2020 20:15# 文件 :2-5_Pandas_String_Operations.py# IDE :PyCha...
复制链接

扫一扫

专栏目录