机器学习入门(一)之pandas库

警告本人:人给我搞晕了我真没看懂。。有时间再回来看QAQ,我不听我不听。

3.pandas的实例应用

相关文件:titanic_survival.csv
文件名:pandas_3.py
总结:读取csv文件,求平均值、总和等,还有pandas的透视表的用法。

import pandas as pd
import numpy as np
titanic_survival = pd.read_csv("titanic_train.csv")
print(titanic_survival.head())
  • import 引用packet
  • pandas.read_csv()读取csv文件
  • 变量名(titanic_survival).head()默认显示文件的前五行,变量名.head(1000)显示前1000行
#The Pandas library uses NaN, which stands for "not a number", to indicate a missing value.
#we can use the pandas.isnull() function which takes a pandas series and returns a series of True and False values
age = titanic_survival["Age"]
print(age.loc[0:10])
age_is_null = pd.isnull(age)#判断数据是否为NaN
age_null_true = age[age_is_null]
age_null_count = len(age_null_true)
print(age_null_count)
print(age_null_true) #思考如何打印非null
  • Name.loc()表示定位,有数字或切片表示定位到哪一行或者那几行
  • padans.isnull()表示判断是否为null,返回True/False
  • 例:age_null_true = age[age_is_null]表示返回age_is_null==True的值
  • len()表示判断多少个数据或长度
#The result of this is that mean_age would be nan. This is because any calculations we do with a null value also result in a null value
mean_age = sum(titanic_survival["Age"]) / len(titanic_survival["Age"])
print(mean_age)
  • sum()表示求总和
#we have to filter out the missing values before we calculate the mean.
good_ages = titanic_survival["Age"][age_is_null == False]
correct_mean_age = sum(good_ages) / len(good_ages)
print(correct_mean_age)
  • 本代码是求去掉了NaN的平均值
  • 例:titanic_survival[“Age”][age_is_null == False],要判断两个条件
#mean fare for each class
passenger_classes = [1,2,3]
fare_by_class = {}
for this_class in passenger_classes:
    #print(this_class)
    pclass_rows = titanic_survival[titanic_survival["Pclass"] == this_class]
    #print(pclass_rows)
    pclass_fares = pclass_rows["Fare"]
    fare_for_class = pclass_fares.mean()#求平均
    fare_by_class[this_class] = fare_for_class#将其加入到fare_by_class中,元组?
print(fare_by_class)
  • 本代码是求每个舱位(1,2,3)的平均费用,循环是重点要多理解。
  • 变量名(pclass_fares).mean()表示求平均值
#index tells the method which column to group by
#values is the column that we want to apply the calculation to
#aggfunc specifies the calculation we want to perform
passenger_survival = titanic_survival.pivot_table(index = "Pclass",values = "Survived",aggfunc = np.mean)
#pivot_table表示透视表,index索引为pclass,当成什么传进来,统计关系 np.mean求比例
print(passenger_survival)
print("---------------------------------------------------")

passenger_age = titanic_survival.pivot_table(index="Pclass", values="Age")
#默认无function 是求均值
print(passenger_age)

port_stats = titanic_survival.pivot_table(index="Embarked", values=["Fare","Survived"], aggfunc=np.sum)
print(port_stats)

#specifying axis=1 or axis='columns' will drop any columns that have null values
drop_na_columns = titanic_survival.dropna(axis=1)
new_titanic_survival = titanic_survival.dropna(axis=0,subset=["Age", "Sex"])#维度=0,查看age和sex中有没有缺失值
#print new_titanic_survival

#locate one data
row_index_83_age = titanic_survival.loc[83,"Age"]#数值为83在age里找
row_index_1000_pclass = titanic_survival.loc[766,"Pclass"]
print(row_index_83_age)
print(row_index_1000_pclass)
  • 透视表的用法要多看
new_titanic_survival = titanic_survival.sort_values("Age",ascending=False)#ascending 升序
print(new_titanic_survival[0:10])
titanic_reindexed = new_titanic_survival.reset_index(drop=True)#reset_index表示索引值,索引值重新拿出来从0开始的索引值
print('-----------------------')
print(titanic_reindexed.loc[0:10])
  • sort_values()函数表示排序,ascending = True表示默认升序,False默认降序。
  • reset_index()函数表示索引值,drop表示重新对索引进行排序显示。
def hundredth_row(column):
    #Etract the hundredth item 第一百行数据返回????
    hundredth_item = column.loc[99]
    return hundredth_item
#Return the hundredth item from each column
hundredth_row = titanic_survival.apply(hundredth_row)
print(hundredth_row)
  • 这里自定义一个函数hundredth_row(),目的提出第100个数据的相关数据。
  • 存在疑问的是,这里数据.apply(函数名可以不加括号就运行吗?)
def not_null_count(column):
    #judge how many rows of data is null
    column_null = pd.isnull(column)
    null = column[column_null]
    return len(null)
    
column_null_count = titani_survival.apply(not_null_count)
print(column_null_count)#当前列里的缺失值
  • 统计各列的缺失值,输出的是每一列的缺失值的个数,是所有。
def is_minor(row):
    if row["Age"] < 18:
        return True
    else:
        return False

minors = titanic_survival.apply(is_minor, axis=1)
#print minors
  • 这个自行理解即可。
def generate_age_label(row):
    age = row["Age"]
    if pd.isnull(age):
        return "unknown"
    elif age < 18:
        return "minor"
    else:
        return "adult"

age_labels = titanic_survival.apply(generate_age_label, axis=1)
print(age_labels)
  • 函数输出了序号和unknown、minor和adult。根据年龄进行的分类。
titanic_survival['age_labels'] = age_labels
age_group_survival = titanic_survival.pivot_table(index="age_labels", values="Survived")
print(age_group_survival)
  • 透视表的使用,index为名字,统计了每个年龄阶段平均的生存率。

4.Series结构

相关文件:fandango_score_comparison.csv(电影评分)
文件名:pandas_4.py
总结:series中的数据处理

之前的结构是.read_csv得到了DataFrame结构,现在我们考虑能不能对这个结构进行分解。其中的一行或者一列叫做series结构 。

import pandas as pd
import numpy as np
from pandas import Series
fandango = pd.read_csv('fandango_score_comparison.csv')
series_film = fandango['FILM']
print(type(series_film))#print:<class 'pandas.core.series.Series'>
print(series_film[0:5])
series_rt = fandango['RottenTomatoes']
print(series_rt[0:5])
  • series_film的series读取方法,且输出series_film的格式为<class ‘pandas.core.series.Series’>。
  • 输出series_film的数据按照之前的方法即可。
film_names = series_film.values#收集每一个值
print(type(film_names))#print:<class 'numpy.ndarray'> 说明series里的结构是ndarray
#print(films_names)
rt_scores = series_rt.values
#print(rt_scores)
series_custom = Series(rt_scores , index = film_names)
series_custom[['Minions (2015)', 'Leviathan (2014)']]
fiveten = series_custom[5:10]#传入string值是否能输出?
print(fiveten)
  • series型数据.values()用于收集series的每个值,打印的格式为<class ‘numpy.ndarray’> ,输出值的话是连接的
  • NumPy 最重要的一个特点是其 N 维数组对象 ndarray,它是一系列同类型数据的集合,以 0 下标为开始进行集合中元素的索引。ndarray 对象是用于存放同类型元素的多维数组。ndarray 中的每个元素在内存中都有相同存储大小的区域。引用来自:ndarray解释,runboom.com
  • Series(值,索引)存在疑问??
  • rt_scores = series_rt.values表示输出series_rt的值
  • 实际上不存在Minions (2015)’, 'Leviathan (2014),这两个是怎么来的?
original_index = series_custom.index.tolist()
#print(original_index)
sorted_index = sorted(original_index)
#对series进行排序
sorted_by_index = series_custom.reindex(sorted_index)#reindex
#print(sorted_by_index)
#The values in a Series object are treated as an ndarray, the core data type in NumPy
# Add each value with each other 两个series相加
print(np.add(series_custom, series_custom))
# Apply sine function to each value
np.sin(series_custom)
# Return the highest value (will return a single value not a Series)
np.max(series_custom)

  • 两个series结构具有相同的数据个数、数据类型等,可以用于相加。
  • numpy.sin()和numpy.max()是numpy库里自带的函数。
#will actually return a Series object with a boolean value for each film
series_custom > 50
series_greater_than_50 = series_custom[series_custom > 50]

criteria_one = series_custom > 50
criteria_two = series_custom < 75
both_criteria = series_custom[criteria_one & criteria_two]
print(both_criteria)

  • 两个判断条件加起来就是,理解一下就ok。
#data alignment same index
rt_critics = Series(fandango['RottenTomatoes'].values, index=fandango['FILM'])
rt_users = Series(fandango['RottenTomatoes_User'].values, index=fandango['FILM'])
rt_mean = (rt_critics + rt_users)/2

print(rt_mean)
  • 可以相加。

相关文件:fandango_score_comparison.csv(电影评分)
文件名:pandas_5.py
总结:pandas里的一些数据处理

import pandas as pd
import numpy as np
#will return a new DataFrame that is indexed by the values in the specified column
#and will drop that column from the DataFrame
#without the FILM column dropped
fandango = pd.read_csv('fandango_score_comparison.csv')
print(type(fandango))
fandango_films = fandango.set_index('FILM', drop=False)#FILM当索引
print(fandango_films.index)
  • 利用.set_index()函数,FILM是索引列,drop暂时未知?
# Slice using either bracket notation or loc[]
fandango_films["Avengers: Age of Ultron (2015)":"Hot Tub Time Machine 2 (2015)"]
fandango_films.loc["Avengers: Age of Ultron (2015)":"Hot Tub Time Machine 2 (2015)"]#数值型的可以切片,string值之间也可以切片,字典
# Specific movie
fandango_films.loc['Kumiko, The Treasure Hunter (2015)']

# Selecting list of movies
movies = ['Kumiko, The Treasure Hunter (2015)', 'Do You Believe? (2015)', 'Ant-Man (2015)']
print(fandango_films.loc[movies])#数值还是可以用
  • 用string类型的值当做字典怎么使用。
  • 即使设置了string类型为索引值,但是用数字仍可。
#The apply() method in Pandas allows us to specify Python logic
#The apply() method requires you to pass in a vectorized operation
#that can be applied over each Series object.
# returns the data types as a Series
types = fandango_films.dtypes
#print types
# filter data types to just floats, index attributes returns just column names
float_columns = types[types.values == 'float64'].index#类型转换?
# use bracket notation to filter columns to just float columns
float_df = fandango_films[float_columns]
#print float_df
  • 类型转换。
rt_mt_user = float_df[['RT_user_norm', 'Metacritic_user_nom']]#先读进来了两个列
rt_mt_user.apply(lambda x: np.std(x), axis=1)#对两个列分别做了变换,对当前的指标对每个指标求标准差是多少。
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值