机器学习入门（一）之pandas库

最新推荐文章于 2020-12-26 13:50:29 发布

湘萌Matsuko

最新推荐文章于 2020-12-26 13:50:29 发布

阅读量204

点赞数

分类专栏：机器学习（一）文章标签： pandas

本文链接：https://blog.csdn.net/qq_33905679/article/details/94560918

版权

机器学习（一）专栏收录该内容

2 篇文章 0 订阅

订阅专栏

警告本人：人给我搞晕了我真没看懂。。有时间再回来看QAQ，我不听我不听。

3.pandas的实例应用

相关文件：titanic_survival.csv
文件名：pandas_3.py
总结：读取csv文件，求平均值、总和等，还有pandas的透视表的用法。

import pandas as pd
import numpy as np
titanic_survival = pd.read_csv("titanic_train.csv")
print(titanic_survival.head())

import 引用packet
pandas.read_csv()读取csv文件
变量名(titanic_survival).head()默认显示文件的前五行，变量名.head(1000)显示前1000行

#The Pandas library uses NaN, which stands for "not a number", to indicate a missing value.
#we can use the pandas.isnull() function which takes a pandas series and returns a series of True and False values
age = titanic_survival["Age"]
print(age.loc[0:10])
age_is_null = pd.isnull(age)#判断数据是否为NaN
age_null_true = age[age_is_null]
age_null_count = len(age_null_true)
print(age_null_count)
print(age_null_true) #思考如何打印非null

Name.loc()表示定位，有数字或切片表示定位到哪一行或者那几行
padans.isnull()表示判断是否为null，返回True/False
例：age_null_true = age[age_is_null]表示返回age_is_null==True的值
len()表示判断多少个数据或长度

#The result of this is that mean_age would be nan. This is because any calculations we do with a null value also result in a null value
mean_age = sum(titanic_survival["Age"]) / len(titanic_survival["Age"])
print(mean_age)

sum()表示求总和

#we have to filter out the missing values before we calculate the mean.
good_ages = titanic_survival["Age"][age_is_null == False]
correct_mean_age = sum(good_ages) / len(good_ages)
print(correct_mean_age)

本代码是求去掉了NaN的平均值
例：titanic_survival[“Age”][age_is_null == False]，要判断两个条件

#mean fare for each class
passenger_classes = [1,2,3]
fare_by_class = {}
for this_class in passenger_classes:
    #print(this_class)
    pclass_rows = titanic_survival[titanic_survival["Pclass"] == this_class]
    #print(pclass_rows)
    pclass_fares = pclass_rows["Fare"]
    fare_for_class = pclass_fares.mean()#求平均
    fare_by_class[this_class] = fare_for_class#将其加入到fare_by_class中，元组？
print(fare_by_class)

本代码是求每个舱位（1，2，3）的平均费用，循环是重点要多理解。
变量名(pclass_fares).mean()表示求平均值

#index tells the method which column to group by
#values is the column that we want to apply the calculation to
#aggfunc specifies the calculation we want to perform
passenger_survival = titanic_survival.pivot_table(index = "Pclass",values = "Survived",aggfunc = np.mean)
#pivot_table表示透视表，index索引为pclass，当成什么传进来，统计关系 np.mean求比例
print(passenger_survival)
print("---------------------------------------------------")

passenger_age = titanic_survival.pivot_table(index="Pclass", values="Age")
#默认无function 是求均值
print(passenger_age)

port_stats = titanic_survival.pivot_table(index="Embarked", values=["Fare","Survived"], aggfunc=np.sum)
print(port_stats)

#specifying axis=1 or axis='columns' will drop any columns that have null values
drop_na_columns = titanic_survival.dropna(axis=1)
new_titanic_survival = titanic_survival.dropna(axis=0,subset=["Age", "Sex"])#维度=0，查看age和sex中有没有缺失值
#print new_titanic_survival

#locate one data
row_index_83_age = titanic_survival.loc[83,"Age"]#数值为83在age里找
row_index_1000_pclass = titanic_survival.loc[766,"Pclass"]
print(row_index_83_age)
print(row_index_1000_pclass)

透视表的用法要多看

new_titanic_survival = titanic_survival.sort_values("Age",ascending=False)#ascending 升序
print(new_titanic_survival[0:10])
titanic_reindexed = new_titanic_survival.reset_index(drop=True)#reset_index表示索引值，索引值重新拿出来从0开始的索引值
print('-----------------------')
print(titanic_reindexed.loc[0:10])

sort_values()函数表示排序，ascending = True表示默认升序，False默认降序。
reset_index()函数表示索引值，drop表示重新对索引进行排序显示。

def hundredth_row(column):
    #Etract the hundredth item 第一百行数据返回？？？？
    hundredth_item = column.loc[99]
    return hundredth_item
#Return the hundredth item from each column
hundredth_row = titanic_survival.apply(hundredth_row)
print(hundredth_row)

这里自定义一个函数hundredth_row()，目的提出第100个数据的相关数据。
存在疑问的是，这里数据.apply(函数名可以不加括号就运行吗？)

def not_null_count(column):
    #judge how many rows of data is null
    column_null = pd.isnull(column)
    null = column[column_null]
    return len(null)
    
column_null_count = titani_survival.apply(not_null_count)
print(column_null_count)#当前列里的缺失值

统计各列的缺失值，输出的是每一列的缺失值的个数，是所有。

def is_minor(row):
    if row["Age"] < 18:
        return True
    else:
        return False

minors = titanic_survival.apply(is_minor, axis=1)
#print minors

这个自行理解即可。

def generate_age_label(row):
    age = row["Age"]
    if pd.isnull(age):
        return "unknown"
    elif age < 18:
        return "minor"
    else:
        return "adult"

age_labels = titanic_survival.apply(generate_age_label, axis=1)
print(age_labels)

函数输出了序号和unknown、minor和adult。根据年龄进行的分类。

titanic_survival['age_labels'] = age_labels
age_group_survival = titanic_survival.pivot_table(index="age_labels", values="Survived")
print(age_group_survival)

透视表的使用，index为名字，统计了每个年龄阶段平均的生存率。

4.Series结构

相关文件：fandango_score_comparison.csv（电影评分）
文件名：pandas_4.py
总结：series中的数据处理

之前的结构是.read_csv得到了DataFrame结构，现在我们考虑能不能对这个结构进行分解。其中的一行或者一列叫做series结构。

import pandas as pd
import numpy as np
from pandas import Series
fandango = pd.read_csv('fandango_score_comparison.csv')
series_film = fandango['FILM']
print(type(series_film))#print:<class 'pandas.core.series.Series'>
print(series_film[0:5])
series_rt = fandango['RottenTomatoes']
print(series_rt[0:5])

series_film的series读取方法，且输出series_film的格式为<class ‘pandas.core.series.Series’>。
输出series_film的数据按照之前的方法即可。

film_names = series_film.values#收集每一个值
print(type(film_names))#print:<class 'numpy.ndarray'> 说明series里的结构是ndarray
#print(films_names)
rt_scores = series_rt.values
#print(rt_scores)
series_custom = Series(rt_scores , index = film_names)
series_custom[['Minions (2015)', 'Leviathan (2014)']]
fiveten = series_custom[5:10]#传入string值是否能输出？
print(fiveten)

series型数据.values()用于收集series的每个值，打印的格式为<class ‘numpy.ndarray’> ，输出值的话是连接的
NumPy 最重要的一个特点是其 N 维数组对象 ndarray，它是一系列同类型数据的集合，以 0 下标为开始进行集合中元素的索引。ndarray 对象是用于存放同类型元素的多维数组。ndarray 中的每个元素在内存中都有相同存储大小的区域。引用来自：ndarray解释，runboom.com
Series(值，索引)存在疑问？？？
rt_scores = series_rt.values表示输出series_rt的值
实际上不存在Minions (2015)’, 'Leviathan (2014)，这两个是怎么来的？

original_index = series_custom.index.tolist()
#print(original_index)
sorted_index = sorted(original_index)
#对series进行排序
sorted_by_index = series_custom.reindex(sorted_index)#reindex
#print(sorted_by_index)

reindex()是pandas对象的一个重要方法，其作用是创建一个新索引的新对象。参考：ReIndex重新索引，作者yungeisme

#The values in a Series object are treated as an ndarray, the core data type in NumPy
# Add each value with each other 两个series相加
print(np.add(series_custom, series_custom))
# Apply sine function to each value
np.sin(series_custom)
# Return the highest value (will return a single value not a Series)
np.max(series_custom)

两个series结构具有相同的数据个数、数据类型等，可以用于相加。
numpy.sin()和numpy.max()是numpy库里自带的函数。

#will actually return a Series object with a boolean value for each film
series_custom > 50
series_greater_than_50 = series_custom[series_custom > 50]

criteria_one = series_custom > 50
criteria_two = series_custom < 75
both_criteria = series_custom[criteria_one & criteria_two]
print(both_criteria)

两个判断条件加起来就是，理解一下就ok。

#data alignment same index
rt_critics = Series(fandango['RottenTomatoes'].values, index=fandango['FILM'])
rt_users = Series(fandango['RottenTomatoes_User'].values, index=fandango['FILM'])
rt_mean = (rt_critics + rt_users)/2

print(rt_mean)

可以相加。

相关文件：fandango_score_comparison.csv（电影评分）
文件名：pandas_5.py
总结：pandas里的一些数据处理

import pandas as pd
import numpy as np
#will return a new DataFrame that is indexed by the values in the specified column
#and will drop that column from the DataFrame
#without the FILM column dropped
fandango = pd.read_csv('fandango_score_comparison.csv')
print(type(fandango))
fandango_films = fandango.set_index('FILM', drop=False)#FILM当索引
print(fandango_films.index)

利用.set_index()函数，FILM是索引列，drop暂时未知？

# Slice using either bracket notation or loc[]
fandango_films["Avengers: Age of Ultron (2015)":"Hot Tub Time Machine 2 (2015)"]
fandango_films.loc["Avengers: Age of Ultron (2015)":"Hot Tub Time Machine 2 (2015)"]#数值型的可以切片，string值之间也可以切片，字典
# Specific movie
fandango_films.loc['Kumiko, The Treasure Hunter (2015)']

# Selecting list of movies
movies = ['Kumiko, The Treasure Hunter (2015)', 'Do You Believe? (2015)', 'Ant-Man (2015)']
print(fandango_films.loc[movies])#数值还是可以用

用string类型的值当做字典怎么使用。
即使设置了string类型为索引值，但是用数字仍可。

#The apply() method in Pandas allows us to specify Python logic
#The apply() method requires you to pass in a vectorized operation
#that can be applied over each Series object.
# returns the data types as a Series
types = fandango_films.dtypes
#print types
# filter data types to just floats, index attributes returns just column names
float_columns = types[types.values == 'float64'].index#类型转换？
# use bracket notation to filter columns to just float columns
float_df = fandango_films[float_columns]
#print float_df

类型转换。

rt_mt_user = float_df[['RT_user_norm', 'Metacritic_user_nom']]#先读进来了两个列
rt_mt_user.apply(lambda x: np.std(x), axis=1)#对两个列分别做了变换，对当前的指标对每个指标求标准差是多少。

湘萌Matsuko

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
机器学习入门（一）之pandas库

pandas的实例应用#The Pandas library uses NaN, which stands for "not a number", to indicate a missing value.#we can use the pandas.isnull() function which takes a pandas series and returns a series of Tru...
复制链接

扫一扫