警告本人:人给我搞晕了我真没看懂。。有时间再回来看QAQ,我不听我不听。
3.pandas的实例应用
相关文件:titanic_survival.csv
文件名:pandas_3.py
总结:读取csv文件,求平均值、总和等,还有pandas的透视表的用法。
import pandas as pd
import numpy as np
titanic_survival = pd.read_csv("titanic_train.csv")
print(titanic_survival.head())
- import 引用packet
- pandas.read_csv()读取csv文件
- 变量名(titanic_survival).head()默认显示文件的前五行,变量名.head(1000)显示前1000行
#The Pandas library uses NaN, which stands for "not a number", to indicate a missing value.
#we can use the pandas.isnull() function which takes a pandas series and returns a series of True and False values
age = titanic_survival["Age"]
print(age.loc[0:10])
age_is_null = pd.isnull(age)#判断数据是否为NaN
age_null_true = age[age_is_null]
age_null_count = len(age_null_true)
print(age_null_count)
print(age_null_true) #思考如何打印非null
- Name.loc()表示定位,有数字或切片表示定位到哪一行或者那几行
- padans.isnull()表示判断是否为null,返回True/False
- 例:age_null_true = age[age_is_null]表示返回age_is_null==True的值
- len()表示判断多少个数据或长度
#The result of this is that mean_age would be nan. This is because any calculations we do with a null value also result in a null value
mean_age = sum(titanic_survival["Age"]) / len(titanic_survival["Age"])
print(mean_age)
- sum()表示求总和
#we have to filter out the missing values before we calculate the mean.
good_ages = titanic_survival["Age"][age_is_null == False]
correct_mean_age = sum(good_ages) / len(good_ages)
print(correct_mean_age)
- 本代码是求去掉了NaN的平均值
- 例:titanic_survival[“Age”][age_is_null == False],要判断两个条件
#mean fare for each class
passenger_classes = [1,2,3]
fare_by_class = {}
for this_class in passenger_classes:
#print(this_class)
pclass_rows = titanic_survival[titanic_survival["Pclass"] == this_class]
#print(pclass_rows)
pclass_fares = pclass_rows["Fare"]
fare_for_class = pclass_fares.mean()#求平均
fare_by_class[this_class] = fare_for_class#将其加入到fare_by_class中,元组?
print(fare_by_class)
- 本代码是求每个舱位(1,2,3)的平均费用,循环是重点要多理解。
- 变量名(pclass_fares).mean()表示求平均值
#index tells the method which column to group by
#values is the column that we want to apply the calculation to
#aggfunc specifies the calculation we want to perform
passenger_survival = titanic_survival.pivot_table(index = "Pclass",values = "Survived",aggfunc = np.mean)
#pivot_table表示透视表,index索引为pclass,当成什么传进来,统计关系 np.mean求比例
print(passenger_survival)
print("---------------------------------------------------")
passenger_age = titanic_survival.pivot_table(index="Pclass", values="Age")
#默认无function 是求均值
print(passenger_age)
port_stats = titanic_survival.pivot_table(index="Embarked", values=["Fare","Survived"], aggfunc=np.sum)
print(port_stats)
#specifying axis=1 or axis='columns' will drop any columns that have null values
drop_na_columns = titanic_survival.dropna(axis=1)
new_titanic_survival = titanic_survival.dropna(axis=0,subset=["Age", "Sex"])#维度=0,查看age和sex中有没有缺失值
#print new_titanic_survival
#locate one data
row_index_83_age = titanic_survival.loc[83,"Age"]#数值为83在age里找
row_index_1000_pclass = titanic_survival.loc[766,"Pclass"]
print(row_index_83_age)
print(row_index_1000_pclass)
- 透视表的用法要多看
new_titanic_survival = titanic_survival.sort_values("Age",ascending=False)#ascending 升序
print(new_titanic_survival[0:10])
titanic_reindexed = new_titanic_survival.reset_index(drop=True)#reset_index表示索引值,索引值重新拿出来从0开始的索引值
print('-----------------------')
print(titanic_reindexed.loc[0:10])
- sort_values()函数表示排序,ascending = True表示默认升序,False默认降序。
- reset_index()函数表示索引值,drop表示重新对索引进行排序显示。
def hundredth_row(column):
#Etract the hundredth item 第一百行数据返回????
hundredth_item = column.loc[99]
return hundredth_item
#Return the hundredth item from each column
hundredth_row = titanic_survival.apply(hundredth_row)
print(hundredth_row)
- 这里自定义一个函数hundredth_row(),目的提出第100个数据的相关数据。
- 存在疑问的是,这里数据.apply(函数名可以不加括号就运行吗?)
def not_null_count(column):
#judge how many rows of data is null
column_null = pd.isnull(column)
null = column[column_null]
return len(null)
column_null_count = titani_survival.apply(not_null_count)
print(column_null_count)#当前列里的缺失值
- 统计各列的缺失值,输出的是每一列的缺失值的个数,是所有。
def is_minor(row):
if row["Age"] < 18:
return True
else:
return False
minors = titanic_survival.apply(is_minor, axis=1)
#print minors
- 这个自行理解即可。
def generate_age_label(row):
age = row["Age"]
if pd.isnull(age):
return "unknown"
elif age < 18:
return "minor"
else:
return "adult"
age_labels = titanic_survival.apply(generate_age_label, axis=1)
print(age_labels)
- 函数输出了序号和unknown、minor和adult。根据年龄进行的分类。
titanic_survival['age_labels'] = age_labels
age_group_survival = titanic_survival.pivot_table(index="age_labels", values="Survived")
print(age_group_survival)
- 透视表的使用,index为名字,统计了每个年龄阶段平均的生存率。
4.Series结构
相关文件:fandango_score_comparison.csv(电影评分)
文件名:pandas_4.py
总结:series中的数据处理
之前的结构是.read_csv得到了DataFrame结构,现在我们考虑能不能对这个结构进行分解。其中的一行或者一列叫做series结构 。
import pandas as pd
import numpy as np
from pandas import Series
fandango = pd.read_csv('fandango_score_comparison.csv')
series_film = fandango['FILM']
print(type(series_film))#print:<class 'pandas.core.series.Series'>
print(series_film[0:5])
series_rt = fandango['RottenTomatoes']
print(series_rt[0:5])
- series_film的series读取方法,且输出series_film的格式为<class ‘pandas.core.series.Series’>。
- 输出series_film的数据按照之前的方法即可。
film_names = series_film.values#收集每一个值
print(type(film_names))#print:<class 'numpy.ndarray'> 说明series里的结构是ndarray
#print(films_names)
rt_scores = series_rt.values
#print(rt_scores)
series_custom = Series(rt_scores , index = film_names)
series_custom[['Minions (2015)', 'Leviathan (2014)']]
fiveten = series_custom[5:10]#传入string值是否能输出?
print(fiveten)
- series型数据.values()用于收集series的每个值,打印的格式为<class ‘numpy.ndarray’> ,输出值的话是连接的
- NumPy 最重要的一个特点是其 N 维数组对象 ndarray,它是一系列同类型数据的集合,以 0 下标为开始进行集合中元素的索引。ndarray 对象是用于存放同类型元素的多维数组。ndarray 中的每个元素在内存中都有相同存储大小的区域。引用来自:ndarray解释,runboom.com
- Series(值,索引)存在疑问???
- rt_scores = series_rt.values表示输出series_rt的值
- 实际上不存在Minions (2015)’, 'Leviathan (2014),这两个是怎么来的?
original_index = series_custom.index.tolist()
#print(original_index)
sorted_index = sorted(original_index)
#对series进行排序
sorted_by_index = series_custom.reindex(sorted_index)#reindex
#print(sorted_by_index)
- reindex()是pandas对象的一个重要方法,其作用是创建一个新索引的新对象。参考:ReIndex重新索引,作者yungeisme
#The values in a Series object are treated as an ndarray, the core data type in NumPy
# Add each value with each other 两个series相加
print(np.add(series_custom, series_custom))
# Apply sine function to each value
np.sin(series_custom)
# Return the highest value (will return a single value not a Series)
np.max(series_custom)
- 两个series结构具有相同的数据个数、数据类型等,可以用于相加。
- numpy.sin()和numpy.max()是numpy库里自带的函数。
#will actually return a Series object with a boolean value for each film
series_custom > 50
series_greater_than_50 = series_custom[series_custom > 50]
criteria_one = series_custom > 50
criteria_two = series_custom < 75
both_criteria = series_custom[criteria_one & criteria_two]
print(both_criteria)
- 两个判断条件加起来就是,理解一下就ok。
#data alignment same index
rt_critics = Series(fandango['RottenTomatoes'].values, index=fandango['FILM'])
rt_users = Series(fandango['RottenTomatoes_User'].values, index=fandango['FILM'])
rt_mean = (rt_critics + rt_users)/2
print(rt_mean)
- 可以相加。
相关文件:fandango_score_comparison.csv(电影评分)
文件名:pandas_5.py
总结:pandas里的一些数据处理
import pandas as pd
import numpy as np
#will return a new DataFrame that is indexed by the values in the specified column
#and will drop that column from the DataFrame
#without the FILM column dropped
fandango = pd.read_csv('fandango_score_comparison.csv')
print(type(fandango))
fandango_films = fandango.set_index('FILM', drop=False)#FILM当索引
print(fandango_films.index)
- 利用.set_index()函数,FILM是索引列,drop暂时未知?
# Slice using either bracket notation or loc[]
fandango_films["Avengers: Age of Ultron (2015)":"Hot Tub Time Machine 2 (2015)"]
fandango_films.loc["Avengers: Age of Ultron (2015)":"Hot Tub Time Machine 2 (2015)"]#数值型的可以切片,string值之间也可以切片,字典
# Specific movie
fandango_films.loc['Kumiko, The Treasure Hunter (2015)']
# Selecting list of movies
movies = ['Kumiko, The Treasure Hunter (2015)', 'Do You Believe? (2015)', 'Ant-Man (2015)']
print(fandango_films.loc[movies])#数值还是可以用
- 用string类型的值当做字典怎么使用。
- 即使设置了string类型为索引值,但是用数字仍可。
#The apply() method in Pandas allows us to specify Python logic
#The apply() method requires you to pass in a vectorized operation
#that can be applied over each Series object.
# returns the data types as a Series
types = fandango_films.dtypes
#print types
# filter data types to just floats, index attributes returns just column names
float_columns = types[types.values == 'float64'].index#类型转换?
# use bracket notation to filter columns to just float columns
float_df = fandango_films[float_columns]
#print float_df
- 类型转换。
rt_mt_user = float_df[['RT_user_norm', 'Metacritic_user_nom']]#先读进来了两个列
rt_mt_user.apply(lambda x: np.std(x), axis=1)#对两个列分别做了变换,对当前的指标对每个指标求标准差是多少。