14天数据分析与机器学习实践之Day02——数据分析处理库Pandas应用总结
1.Pandas简介
Pandas(Python Data Analysis Library )是基于NumPy 的一种工具,该工具是为了解决数据分析任务而创建的。Pandas 纳入了大量库和一些标准的数据模型,提供了高效地操作大型数据集所需的工具。pandas提供了大量能使我们快速便捷地处理数据的函数和方法。你很快就会发现,它是使Python成为强大而高效的数据分析环境的重要因素之一。下面简要介绍一下 Pandas 中的常用的数据结构
1. 读取数据
import pandas
titanic_train = pandas.read_csv("titanic_train.csv")
print(type(titanic_train))
print (titanic_train)
print (help(pandas.read_csv))
2. 以表格形式输出
titanic_train.head(3)##3为显示头3行
titanic_train.tail(3)##3为显示尾3行
3. 输出列名
titanic_train.columns#输出列名
#output
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
dtype='object')
4. 输出维度
titanic_train.shape#输出维度
#output
(891, 12)
5.输出第 行
print(titanic_train.loc[0])
#output
PassengerId 1
Survived 0
Pclass 3
Name Braund, Mr. Owen Harris
Sex male
Age 22
SibSp 1
Parch 0
Ticket A/5 21171
Fare 7.25
Cabin NaN
Embarked S
Name: 0, dtype: object
输出3到6行
print(titanic_train.loc[3:6])
#output
PassengerId Survived Pclass \
3 4 1 1
4 5 0 3
5 6 0 3
6 7 0 1
Name Sex Age SibSp Parch \
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0
4 Allen, Mr. William Henry male 35.0 0 0
5 Moran, Mr. James male NaN 0 0
6 McCarthy, Mr. Timothy J male 54.0 0 0
Ticket Fare Cabin Embarked
3 113803 53.1000 C123 S
4 373450 8.0500 NaN S
5 330877 8.4583 NaN Q
6 17463 51.8625 E46 S
6. 输出name列
titanic_col=titanic_train["Name"]
print (titanic_col)
#output
0 Braund, Mr. Owen Harris
1 Cumings, Mrs. John Bradley (Florence Briggs Th...
2 Heikkinen, Miss. Laina
3 Futrelle, Mrs. Jacques Heath (Lily May Peel)
4 Allen, Mr. William Henry
5 Moran, Mr. James
6 McCarthy, Mr. Timothy J
7 Palsson, Master. Gosta Leonard
8 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)
9 Nasser, Mrs. Nicholas (Adele Achem)
10 Sandstrom, Miss. Marguerite Rut
11 Bonnell, Miss. Elizabeth
12 Saundercock, Mr. William Henry
13 Andersson, Mr. Anders Johan
14 Vestrom, Miss. Hulda Amanda Adolfina
15 Hewlett, Mrs. (Mary D Kingcome)
16 Rice, Master. Eugene
17 Williams, Mr. Charles Eugene
18 Vander Planke, Mrs. Julius (Emelia Maria Vande...
19 Masselmani, Mrs. Fatima
20 Fynney, Mr. Joseph J
21 Beesley, Mr. Lawrence
22 McGowan, Miss. Anna "Annie"
23 Sloper, Mr. William Thompson
24 Palsson, Miss. Torborg Danira
25 Asplund, Mrs. Carl Oscar (Selma Augusta Emilia...
26 Emir, Mr. Farred Chehab
27 Fortune, Mr. Charles Alexander
28 O'Dwyer, Miss. Ellen "Nellie"
29 Todoroff, Mr. Lalio
...
861 Giles, Mr. Frederick Edward
862 Swift, Mrs. Frederick Joel (Margaret Welles Ba...
863 Sage, Miss. Dorothy Edith "Dolly"
864 Gill, Mr. John William
865 Bystrom, Mrs. (Karolina)
866 Duran y More, Miss. Asuncion
867 Roebling, Mr. Washington Augustus II
868 van Melkebeke, Mr. Philemon
869 Johnson, Master. Harold Theodor
870 Balkic, Mr. Cerin
871 Beckwith, Mrs. Richard Leonard (Sallie Monypeny)
872 Carlsson, Mr. Frans Olof
873 Vander Cruyssen, Mr. Victor
874 Abelson, Mrs. Samuel (Hannah Wizosky)
875 Najib, Miss. Adele Kiamie "Jane"
876 Gustafsson, Mr. Alfred Ossian
877 Petroff, Mr. Nedelio
878 Laleff, Mr. Kristo
879 Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)
880 Shelley, Mrs. William (Imanita Parrish Hall)
881 Markun, Mr. Johann
882 Dahlberg, Miss. Gerda Ulrika
883 Banfield, Mr. Frederick James
884 Sutehall, Mr. Henry Jr
885 Rice, Mrs. William (Margaret Norton)
886 Montvila, Rev. Juozas
887 Graham, Miss. Margaret Edith
888 Johnston, Miss. Catherine Helen "Carrie"
889 Behr, Mr. Karl Howell
890 Dooley, Mr. Patrick
Name: Name, Length: 891, dtype: object
数据预处理
.sort_values排序("列名",inplace=True,ascending=False)
ascending=False时为降序,反之为增序
数据处理
.pivot_table(index=“基准”,values=“与什么向比较”,aggfunc=np.mean#均值 )
#index tells the method which column to group by
#values is the column that we want to apply the calculation to
#aggfunc specifies the calculation we want to perform
passenger_survival = titanic_survival.pivot_table(index="Pclass", values="Survived", aggfunc=np.mean)
print passenger_survival
删除空值行
针对部分列
#这里的any是针对某一行,subset中有任何一个为空,就删除该行
df = df.dropna(axis=0, how='any',subset=[A','B'])
df.shape
处理特殊字符
主要针对百分号:百分号会导致列的dtype改变,所以思路是先找到object类型的列,然后判断是否含有百分号。
from pandas.api.types import is_object_dtype
object_column_index = [index for index, item in enumerate(df.dtypes) if is_object_dtype(item)]
print("object型的列索引为:%s" %object_column_index)
# 按列处理
for i in object_column_index:
data = df.ix[:10, i].tolist()
for item in data:
if str(item).find("%"):
continue
else:
print("出现了不含有百分号的行")
break
# 替换为对应小数
df.ix[:, i] = df.ix[:, i].map(lambda x: float(str(x).replace('%',''))/100)
df.head()
缺失值填充
均值或者最大最小值,api是现成的。
#观察是否包含缺失值
df.isnull().any()
#也可以直接获取含有缺失值的列索引
null_index = [index for index, item in enumerate(df.isnull().any()) if item==True]
null_index
#填充
for i in null_index:
df.ix[:,i] = df.ix[:,i].fillna(df.ix[:,i].mean())
本文地址:https://blog.csdn.net/qq_45817449/article/details/107347641
希望与广大网友互动??
点此进行留言吧!