第二周机器学习笔记

举栗子.

已于 2024-07-13 21:44:07 修改

阅读量1k

点赞数 31

文章标签：机器学习笔记人工智能

于 2024-07-13 21:40:48 首次发布

本文链接：https://blog.csdn.net/wzb2090886937/article/details/140277929

版权

1.pandas索引操作

1.1直接使用行列名称索引(先列后行)

data[列名称][行名称]

import pandas as pd
import numpy  as np
ind = ["2024-2-13","2024-2-12","2024-2-10","2024-2-09"]
col   = ["low","high","open",'volume',"price"]
data=np.random.randint(10,14,(4,5))
test = pd.DataFrame(data,index=ind,columns=col)
print(test["low"]["2024-2-13"])

在这里插入图片描述
！！！注意:
1.写成先行后列，程序就会报错
2.并且只能获取一个位置的值

1.2结合loc或者iloc使用索引

使用loc、iloc可以获取连续位置的值
注意：这里要先行后列了

loc 使用行列索引名称

test.loc['2024-2-13':'2024-2-10','open']

在这里插入图片描述
iloc 使用行列下标

test.iloc[:2,:4]  #前闭后开

在这里插入图片描述
ix 使用下标和索引

2.赋值操作

对某一列的内容重新赋值

test.high=1    # 1  
print(test)

在这里插入图片描述

test["high"]=4  # 2
print(test)

在这里插入图片描述

3.排序

3.1 DataFrame排序

3.1.1 使用df.sort_values(by=, ascending=)

单个键或者多个键排序
参数
- by：指定排序参考的键
- ascending:默认升序
  - ascending=False降序
  - ascending=True 升序

res = test.sort_values(by="high")
print(res)

在这里插入图片描述
!!! 注意排序后得到一个新的DataFrame，原来的并未改变

res = test.sort_values(by=["high","open"])
print(res)

在这里插入图片描述
对多个列索引排序时先对第一个索引的值排序，如果第一个索引的值相同对第二列索引的值进行排序

3.1.2使用df.sort_index对行索引进行排序

res = test.sort_index()
print(res)

在这里插入图片描述

3.2 Series排序

3.1.1使用df.sort_values(ascending=)进行排序

Series只有一列所以不需要 by参数

test = pd.Series(data=[1,3,4,2],index=["2024-2-1","2024-2-13","2024-2-10","2024-3-1"])
print(test)

在这里插入图片描述

res= test.sort_values()
print(res)

在这里插入图片描述

3.1.2使用df.sort_index()

res = test.sort_index()
print(res)

在这里插入图片描述

4.DataFrame的运算

4.1算术运算

add()

print(test)
res = test["high"].add(10)
print(res)

在这里插入图片描述

sub()

print(test)
res = test["high"].sub(10)
print(res)

在这里插入图片描述

4.2逻辑运算

4.2.1逻辑运算符

res = test[(test["high"]>=13) & (test["high"]<14)]   #前后都要加（）
print(res)

在这里插入图片描述

4.2.2逻辑运算函数

query(expr)
- expr:查询字符串

res = test.query("high>=12 & high<14")
print(res)

在这里插入图片描述

isin(values)

res = test[test["high"].isin([12,13])]
print(res)

在这里插入图片描述

4.3统计运算

4.3.1describe

综合分析: 能够直接得出很多统计结果, count , mean , std , min , max 等

test.describe()

在这里插入图片描述

4.3.2统计函数

在这里插入图片描述
对于单个函数去进行统计的时候，坐标轴还是按照默认列“columns” (axis=0, default)，如果要对行“index” 需要指定(axis=1)
注意求平均值median时是先对其进行从小到大排序然后再求平均值

4.3.3累计统计函数

在这里插入图片描述

ans = test["open"].cumsum()
print(ans)

在这里插入图片描述
可以用累计统计函数得到的结果直接画图

import  matplotlib.pyplot as plt
ans.plot()
plt.show()

在这里插入图片描述

4.4自定义运算

apply(func, axis=0)
- func:自定义函数
- axis=0:默认是列，axis=1为行进行运算

res = test[["open","high"]].apply(lambda x : x.max()-x.min(),axis=0)
print(res)

在这里插入图片描述

5.pandas画图

5.1 pandas.DataFrame.plot

DataFrame.plot (kind=‘line’)
kind : str，需要绘制图形的种类
line’ : 折线图
bar’ : 条形图
‘barh’ : 水平条形图（上面的可以叫做竖直条形图）
- 关于“barh”的解释： (barth就是将条形图的x y 翻转一下可以参考下面文档)
- http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.barh.html
- hist’ : 直方图
- pie’ : 饼图
- scatter’ :散点图
还有一些其他的参数：https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.plot.html?
highlight=plot#pandas.DataFrame.plot

5.2 pandas.Series.plot

Series.plt的使用与DataFrame相同
更多细节：https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.plot.html?highlight=plot#pandas.Series.plot

6.pandas的CSV、HDF5、JSON文件的读取

6.1 csv的读取

6.1.1pd.read_csv

pandas.read_csv(filepath_or_buffer, sep =‘,’, usecols )
- filepath_or_buffer:文件路径
- sep :分隔符，默认用","隔开
- usecols:指定读取的列名，列表形式

data = test.to_csv("stock_day.csv",columns=['open','high'])
print(data)

text.csv文件内容：
在这里插入图片描述

6.1.2 pd.to-csv

DataFrame.to_csv(path_or_buf=None, sep=', ’, columns=None, header=True, index=True, mode=‘w’, encoding=None)
- path_or_buf :文件路径
- sep :分隔符，默认用","隔开
- columns :选择需要的列索引
- header :boolean or list of string, default True,是否写进列索引值
  115
- index:是否写进行索引
- mode:‘w’：重写, ‘a’ 追加

data = pd.read_csv("stock_day.csv",usecols=['open'])
print(data)

在这里插入图片描述

6.2 HDF5的读取

6.2.1read_hdf和to_hdf

HDF5的文件后缀为： .h5
HDF5文件的读取和存储需要指定一个键，值为要存储的DataFrame

pandas.read_hdf(path_or_buf，key =None，** kwargs)
从h5文件当中读取数据
- path_or_buffer:文件路径
- key:读取的键
- return:Theselected object
DataFrame.to_hdf(path_or_buf, key, \kwargs)

6.3json的读取

6.3.1 read_json

pandas.read_json(path_or_buf=None, orient=None, typ=‘frame’, lines=False)
- 将JSON格式准换成默认的Pandas DataFrame格式
- orient : string,Indication of expected JSON string format.
  - ‘split’ : dict like {index -> [index], columns -> [columns], data -> [values]}
    - split 将索引总结到索引，列名到列名，数据到数据。将三部分都分开了
  - ‘records’ : list like [{column -> value}, … , {column -> value}]
    - records 以 columns：values 的形式输出
  - ‘index’ : dict like {index -> {column -> value}}
    - index 以 index：{columns：values}… 的形式输出
  - ‘columns’ : dict like {column -> {index -> value}},默认该格式
    - colums 以 columns:{index:values} 的形式输出
  - ‘values’ : just the values array
    - values 直接输出值
- lines : boolean, default False
  - 按照每行读取json对象
- typ : default ‘frame’，指定转换成的对象类型series或者dataframe

6.3.2 to_json

DataFrame.to_json(path_or_buf=None, orient=None, lines=False)
- 将Pandas 对象存储为json格式
- path_or_buf=None：文件地址
- orient:存储的json形式，{‘split’,’records’,’index’,’columns’,’values’}
- lines:一个对象存储为一行

7.高级处理

7.1缺失值处理

首先判断缺失值是用什么表示的
如果是np.nan以外的其他符号（？等等）则需要先改成np.nan再进行缺失值的处理

7.1.1缺失值不是用Nan表示的

df.replace(to_replace=, value=)
- to_replace:替换前的值
- value:替换后的值

7.1.2判断是否有缺失值（nan）

pd.isnull(df), 是缺失值的为True
pd.notnull(df) 不是缺失值的为False

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
data =pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data")
data.replace(to_replace="?",value=np.nan)
np.all(pd.isnull(data))

在这里插入图片描述
其中np.all所有都为True时才为真

7.1.3 存在缺失值nan,并且是np.nan

7.1.3.1删除缺失值

#不修改原数据

movie.dropna(axis=) :可以指定删除nan所在的行或者列

x = data.dropna(axis=1)
ans = np.all(pd.isnull(data))
ans

在这里插入图片描述

可以定义新的变量接受或者用原来的变量名

data = movie.dropna()

ans = test.dropna(axis=1)
ans

在这里插入图片描述

7.1.3.2替换缺失值

test['www'].fillna(12,inplace=True)
test

inplace 取代原来值还是产生新的值
df[].fillna( x, inplace=) x：要修改为

7.2数据离散化

7.2.1.pd.qcut(data,num)

把数据分成num份，并使得每组的数量大致相同

data =pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data")
res = data["2.1"]
ans = pd.qcut(res,1)
print(ans)
w   =ans.value_counts()
print(w)

在这里插入图片描述

7.2.2.pd.cut(data,bins)

可以指定区间 bins

bins =[0,1,2,3,4]
ans2 = pd.cut(res,bins)
print(ans2)
ww   =ans2.value_counts()
print(ww)

在这里插入图片描述
要与Series.value_counts()结合使用

7.2.3.one-hot编码

把每个类别生成一个布尔列，这些列中只有一列可以为这个样本取值为1.其又被称为独热编码。
pandas.get_dummies(data, prefix=None)
prefix 给分组起名字

dummies = pd.get_dummies(ww, prefix="rise")
print(dummies)

在这里插入图片描述

7.2合并

7.2.1数据合并

pd.concat([data1, data2], axis=1）

data1 = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
'key2': ['K0', 'K1', 'K0', 'K1'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['0', '1', '2', '3']})
data2 = pd.DataFrame({'key': ['0', '0', '1', '2'],
'2': ['K', 'K1', 'K', 'K1'],
'3': ['A', 'A', 'A', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
ans = pd.concat([data2,data1],axis=1)
print(ans)

在这里插入图片描述

7.2.2表合并

pd.merge(left, right, how=‘inner’, on=None)
- 可以指定按照两组数据的共同键值对合并或者左右各自
- left : DataFrame
- right : 另一个DataFrame
- on : 指定的共同键（必须要有）
- how:按照什么方式连接

这篇博客写的比较好
https://blog.csdn.net/qq_43874317/article/details/128128362?ops_request_misc=&request_id=&biz_id=102&utm_term=pd.merge%E5%A4%96%E9%93%BE%E6%8E%A5&utm_medium=distribute.pc_search_result.none-task-blog-2_blogsobaiduweb~default-1-128128362.142^v100control&spm=1018.2226.3001.4450

7.3交叉表与透视表

交叉表：计算一列数据对于另外一列数据的分组个数
透视表：指定某一列对另一列的关系

交叉表：交叉表用于计算一列数据对于另外一列数据的分组个数(用于统计分组频率的特殊透视表)
- pd.crosstab(value1, value2)
透视表：透视表是将原有的DataFrame的列分别作为行索引和列索引，然后对指定的列应用聚集函数
- data.pivot_table(）
- DataFrame.pivot_table([], index=[])

7.4分组与聚合

DataFrame.groupby(key, as_index=False)
key:分组的列数据，可以多个

分组，求平均值
Dataframe:

ans = col.groupby(['color'])['price1'].mean()
print(ans)

在这里插入图片描述
Series:

asn = col['price1'].groupby(col["color"]).mean()
asn

在这里插入图片描述

8.Seaborn

8.1 绘制单变量分布

seaborn.distplot(a, bins=None, hist=True, kde=True, rug=False, fit=None, color=None)
(1) a：表示要观察的数据，可以是 Series、一维数组或列表。
(2) bins：用于控制条形的数量。
(3) hist：接收布尔类型，表示是否绘制(标注)直方图。
(4) kde：接收布尔类型，表示是否绘制高斯核密度估计曲线。
(5) rug：接收布尔类型，表示是否在支持的轴方向上绘制rugplot。

import seaborn as sns
import numpy  as np

np.random.seed(0)#确定随机数生成器的种子使得每次生成的数都一样
test = np.random.randn(100)  #从均值为0 方差为1 的正太分布中 选取 100 个数
ans = sns.distplot(test,bins=10,rug=True,hist=True,kde=True)

在这里插入图片描述

8.2 绘制双变量分布

seaborn.jointplot(x, y, data=None,
kind=‘scatter’, stat_func=None, color=None,
ratio=5, space=0.2, dropna=True)
(1) kind：表示绘制图形的类型。
(2) stat_func：用于计算有关关系的统计量并标注图。
(3) color：表示绘图元素的颜色。
(4) size：用于设置图的大小(正方形)。
(5) ratio：表示中心图与侧边图的比例。该参数的值越大，则中心图的占比会越大。
(6) space：用于设置中心图与侧边图的间隔大小。

8.2.1 散点图

import pandas as pd
x = np.random.randn(500)
y = np.random.randn(500)
Data = pd.DataFrame({"x":np.random.randn(500),"y":np.random.randn(500)})
sns.jointplot(x="x",y="y",data=Data,color='b',size=50,kind="scatter",ratio=10,space=1)

在这里插入图片描述

8.2.2 绘制直方图

x = np.random.randn(500)
y = np.random.randn(500)
Data = pd.DataFrame({"x":np.random.randn(500),"y":np.random.randn(500)})
sns.jointplot(x="x",y="y",data=Data,kind="hex")

在这里插入图片描述

8.2.3绘制核密度估计图形

x = np.random.randn(500)
y = np.random.randn(500)
Data = pd.DataFrame({"x":np.random.randn(500),"y":np.random.randn(500)})
sns.jointplot(x="x",y="y",data=Data,kind="kde")

在这里插入图片描述

8.3 绘制成对的双变量分布

dataset = sns.load_dataset("iris")
sns.pairplot(dataset)

在这里插入图片描述

8.4 用分类数据绘图

8.4.1类别散点图

seaborn.stripplot(x=None, y=None, hue=None, data=None, order=None, hue_order=None, jitter=False)

(1) x，y，hue：用于绘制长格式数据的输入。
(2) data：用于绘制的数据集。如果x和y不存在，则它将作为宽格式，否则将作为长格式。
(3) jitter：表示抖动的程度(仅沿类別轴)。当很多数据点重叠时，可以指定抖动的数量或者设为Tue使用默认值

tips = sns.load_dataset("tips")
sns.stripplot(x='x',y='y',data=tips)
sns.stripplot(x='x',y='y',data=tips,jitter=True)#可以减少重叠的部分，但是仍然会有重叠
sns.swarmplot(x='x',y='y',data=tips)#数据不会有重叠的

在这里插入图片描述

8.4.2 类别内的数据分布

箱形图:
箱形图（Box-plot）又称为盒须图、盒式图或箱线图，是一种用作显示一组数据分散情况资料的统计图。因形状如箱子而得名。
在这里插入图片描述
小提琴图:
小提琴图 (Violin Plot) 用于显示数据分布及其概率密度。

8.4.2.1 绘制箱形图

seaborn.boxplot(x=None, y=None, hue=None, data=None, orient=None, color=None, saturation=0.75, width=0.8)
(1)palette：用于设置不同级别色相的颜色变量。----palette=[“r”,“g”,“b”,“y”]
(2)saturation：用于设置数据显示的颜色饱和度。----使用小数表示

8.4.2.2 绘制小提琴图

seaborn.violinplot(x=None, y=None, hue=None, data=None)

8.5 类别内的统计估计

barplot()函数：绘制条形图。
pointplot()函数：绘制点图。
sns.barplot(x=‘day’,y=‘total_bill’,data=data)#条形图
在这里插入图片描述

sns.pointplot(x=‘day’,y=‘total_bill’,data=data)#点图
在这里插入图片描述

8.6案例：NBA球员数据分析

由于没有获取到数据集
代码没有结果，我只是看完视频，理解代码后敲了一下

效率值相关性分析：

data = pd.read_csv(" --")
corr = data_cor.corr()   #获取到两两数据间的关系
sns.heatmap(corr,square=True,linewidth=0.02,annot=False)
#annot 是否在热力图中显示数据
#seaborn中的heatmap函数，是将多维度数值变量按数值大小进行交叉热图显示

衍生变量的一些可视化实践：

data['avg_point']=data['POINTS']/data['MP']
def age_cnt(df):
    if(df.AGE<=24):
        return 'young'
    elif df.AGE>30:
        return "old"
    else:
        return "best"
data['age_cut']=data.apply(lambda x : age_cnt(x),axis=1)  \
#球员薪水与效率值
sns.set_style('darkgrid')#设置seaborn的面板风格
plt.figure(figsize=(20,8),dpi=100)
plt.title("RMP ans SALARY",size=100)
x1 = data.loc[data.age_cut=='old'].RMP
y1 = data.loc[data.age_cut=='old'].SALARY
plt.plot(x1,y1,linestyle='^')
#分析球员的多个数据之间的关系
data2=data.loc[:,'RMP',"SALARY",'TRB','AST','age_cut']
sns.pairplot(data2,hue='age_cut')#按age_cut中的不同类显示不同颜色

球队薪资排行情况：

data_team=data.groupby(by="TEAM").agg({'SALARY':np.mean()})
data_team.sort_values(by="SALARY",ascending=False)
#按照分球队分年龄段，上榜球员排列，入上榜球员数相同，则按效率值降序排列
data_team = data.groupby(by=["TEAM",'age_cut']).agg({'SALARY':np.mean,'RMP':np.mean,'PLAYER':np.size})
data_team.sort_values(by=['PLAYER','RMP'],ascending=False)  #True 为升序排序

球队综合实力分析:

data_grp=data.groupby(by='TEAM',as_index=False).agg({'SALARY':np.mean,'RMP':np.mean,'AGE':np.mean})
data_grp=data_grp.loc[data_grp.AGE>5]
data_grp.sort_values(by='RMP',ascending=False)

利用箱型图和小提琴图分析几只球队的相关数据:

#箱型图
sns.set_style("darkgrid")
plt.figure(figsize=(20,8),dpi=100)
data1 = data_grp[data['TEAM'].isin(['A','b','c'])]
plt.subplot(3,1,1)
sns.boxplt(x='TEAM',y="RMP",data=data1)
plt.subplot(3,1,2)
sns.boxplt(x='TEAM',y="SALARY",data=data1)
plt.subplot(3,1,3)
sns.boxplt(x='TEAM',y="AGE",data=data1)
#小提琴图
sns.set_style("darkgrid")
plt.figure(figsize=(20,8),dpi=100)
data1 = data_grp[data['TEAM'].isin(['A','b','c'])]
plt.subplot(3,1,1)
sns.violinplt(x='TEAM',y="RMP",data=data1)
plt.subplot(3,1,2)
sns.violinplt(x='TEAM',y="SALARY",data=data1)
plt.subplot(3,1,3)
sns.violinplt(x='TEAM',y="AGE",data=data1)

8.7 北京租房数据统计分析

import numpy  as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
``#获取数据
data = pd.read_csv('')
#去除重复值、空值
file_data = data.drop_duplicates() #重复值
data.dropna()


#数据类型转换
data_new=np.array()
data_area=data['面积'].values()
for i in data_area:
    x = i[:-2]
    data_new.append(x)
#通过astype方法把string类型转化为float
data_new=data_new.astype(np.float64)
#把旧的数据替换掉
data.loc[:,'面积']=data_new


#使用 Pandas的 replace(）方法将房间换成室
house_data = data["户型"].values
temp_list=np.array()
for i in house_data:
    i.replace('房间'，'室')
    temp_list.append(i)
data.loc[:,'户型']=temp_list


#房源数量、位置分布分析

new_df=pd.DataFrame({'区域'：data['区域'].unique(),'数量'：np.random.randint(0,10,13)})
group_count=data.groupby(by="区域").count()
new_df.loc[:,'数量']=group_count[:,"小区"]
new_df.sort_values(by="数量",ascending=False)


#计算户型的数量

def count(x):
    ww = np.unique(x)
    res={}
    for i in ww:
        num=0
        for j in x:
            if(i==j)
            num++
        res[i]=num
    return res
house_count=count(data["户型"])

#使用字典推导式将户型数量大于50的元素筛选出来，并将筛选后的结果转换成 DataFrame对象
house_type=dict((key,value) for key,value in house_count.items() if value>50)
df_house=pd.DataFrame({'户型':[x for x in house_count.keys()],"数量"：[x for x in house_count.vlaues()]})


house_type=house_count['户型']
house_type_num=house_count['数量']
plt.barh(range(11),house_type_num,height=0.4)
plt.yticks(range(11),house_type)
plt.xlim(0,2500)  #设置x轴的范围
plt.xlabel("数量")
plt.ylabel('户型')
plt.title('北京地区各户型房屋数量')
for x,y in enumrate(house_type_num):
    plt.text(y+0.3,x-0.2,'%s'%y)
plt.show()
    `