Dataquest Data Scientist Path 整理笔记（2）

xiyoudahua

于 2017-06-15 15:41:09 发布

阅读量224

点赞数

本文链接：https://blog.csdn.net/xiyoudahua/article/details/73290728

版权

本文是Dataquest Data Scientist课程的学习笔记，主要涵盖数据分析和可视化方面。介绍了NumPy的ndarray数据结构，如何读取和操作数据，如切片、比较、选择元素和转换值。还涉及Pandas的使用，包括读取CSV文件，选择行和列，进行数学运算，处理缺失值，使用pivot tables，以及Series的排序和过滤等。

摘要由CSDN通过智能技术生成

在Dataquest中学习Data Scientist方向的知识要点整理笔记。

Step 2: Data Analysis And Visualization

ndarray : The core data structure in NumPy, stands for N-dimensional array.
将csv文件读取为numpy array格式

nfl = numpy.genfromtxt("nfl.csv", delimiter = ",", dtype = "U75", skip_header = 1)#以逗号为分隔符，格式统一为U75，跳过标题行

通过numpy.array()建立数组，ndarray.shape读取数组规模，ndarray.dtype读取元素类型。numpy中的元素为统一的一种格式，一些基本类型有bool，int，float，string。

vector = numpy.array([5, 10, 15, 20])
matrix = numpy.array([[5, 10, 15], 
                      [20, 25, 30], 
                      [35, 40, 45]])
print(vector.shape)
print(matrix.shape)
vector.dtype

Numpy中使用nan表示Not a Number，使用na表示Not Available
分别读取list of list和numpy matrix创建矩阵，并读取某个元素

>>list_of_lists = [[5, 10, 15], 
                   [20, 25, 30]]
>>list_of_list[1][2]
30
>>matrix = np.array([[5, 10, 15], 
                     [20, 25, 30]])
>>matrix[1, 2]
30

Slicing array

print(vector[1:3])#返回一个vector，即第2到4个元素
print(matrix[:, 1])#返回一个vector，即第2列
print(matrix[:, 1:3])#返回一个子matrix，即第2到4列

Array Comparisons

>>vector = numpy.array([5, 10, 15, 20])
>>vector == 10
array([False, False, True, False], dtype=bool)

>>(vector == 10)&(vector == 15)
array([False, False, False, False], dtype=bool)

>>(vector == 10)|(vector == 15)
array([False, True, True, False], dtype=bool)

>>matrix = numpy.array([[5, 10, 15], 
                        [20, 25, 30],
                        [35, 40, 45]])
>>matrix == 25
array([[False, False, False],
       [False,  True, False],
       [False, False, False]], dtype=bool)

Selecting Elements

matrix = numpy.array([[5, 10, 15], 
                      [20, 25, 30],
                      [35, 40, 45]])
second_column_25 = (matrix[:,1] == 25)
print(matrix[second_column_25, :])

Replacing Values

>>vector = numpy.array([5, 10, 15, 20])
>>equal_to_ten_or_five = (vector == 10) | (vector == 5)
>>vector[equal_to_ten_or_five] = 50
>>print(vector)
[50 50 15 20]

>>matrix = numpy.array([[5, 10, 15],
                        [20, 25, 30],
                        [35, 40, 45]])
>>second_column_25 = matrix[:,1] == 25
>>matrix[second_column_25, 1] = 10
>>matrix
array([[ 5, 10, 15],
       [20, 10, 30],
       [35, 40, 45]])

转换元素格式 astype()

vector = numpy.array(["1", "2", "3"])
vector = vector.astype(float)

Computing With NumPy

>>vector = numpy.array([5, 10, 15, 20])
>>vector.sum()
50
>>matrix = numpy.array([[5, 10, 15], 
                        [20, 25, 30],
                        [35, 40, 45]])
>>matrix.sum(axis=1)#axis=1为对每行运算，0为对每列运算
array([30, 75, 120])

例
Year WHO region Country Beverage Types Display Value
1986 Western Viet Nam Wine 0
1986 Americas Uruguay Other 0.5
1985 AfricaCte Cte d’Ivoire Wine 1.62
1986 Americas Colombia Beer 4.27
1987 Americas Saint Kitts Beer 1.98

totals = {}#新建字典
is_year = world_alcohol[:,0] == "1985"#
year = world_alcohol[is_year,:]

for country in countries:
    is_country = year[:,2] == country
    country_consumption = year[is_country,:]
    alcohol_column = country_consumption[:,4]
    is_empty = alcohol_column == ''
    alcohol_column[is_empty] = "0"
    alcohol_column = alcohol_column.astype(float)
    totals[country] = alcohol_column.sum()

使用pandas读取csv文件

crime_rates = pandas.read_csv("crime_rates.csv")#读取为pandas的通用数据格式 dataframe

对数据进行分析

first_rows = food_info.head()#前5行
column_names = food_info.columns#读取列名称
dimensions = food_info.shape#读取矩阵大小
num_rows = dimensions[0]#读取行数
num_cols = dimensions[1]#读取列数

选择行和列
pandas使用第一行为列标，行数为行标

food_info.loc[0]#选择第1行
food_info.loc[2:5]#选择第3至6行
food_info.loc[[2,4,6]]#选择第3、5、7行

food_info.iloc[0]#重新排序后选择第1行
food_info.iloc[0:4]#重新排序后选择1至5行

food_info["NDB_No"]#选择列标为"NDB_No"的1列
food_info[["Zinc", "Copper"]]#选择列标为"Zinc"和"Copper"的2列，返回的列顺序与输入的列标顺序一致

endswith() 选择以某个字符串结束的元素

for c in col_names:
    if c.endswith("(g)"):
        gram_columns.append(c)

对DataFrame进行数学运算

max_Iron = food_info["Iron_(mg)"].max()#取最大值
mean_Iron = food_info["Iron_(mg)"].mean()#取平均值
iron_grams = food_info["Iron_(mg)"] / 1000
food_info["Iron_(g)"] = iron_grams#新增1列"Iron_(g)"

排序

food_info.sort_values("Sodium_(mg)", inplace=True, ascending=False)#降序排序，对food_info直接更新

Python中使用None表示“no value”，Pandas中使用NaN表示“not a number”，即缺失值，None和NaN均称为null 值，可使用pandas.isnull()筛选

sex = titanic_survival["sex"]
sex_is_null = pandas.isnull(sex)#返回Ture/False数组，矩阵也适用

passenger_classes = [1, 2, 3]
fares_by_class = {}
for this_class in passenger_classes:
    pclass_rows = titanic_survival[titanic_survival["pclass"] == this_class]
    pclass_fares = pclass_rows["fare"]
    fare_for_class = pclass_fares.mean()
    fares_by_class[this_class] = fare_for_class

Pivot tables

passenger_class_fares = titanic_survival.pivot_table(index="pclass", values="fare", aggfunc=np.mean)#起到与上例中相同的作用，aggfunc默认为mean,可省略
port_stats = titanic_survival.pivot_table(index="embarked", values=["fare","survived"], aggfunc=np.sum)#可对values中的值进行运算

Drop Missing Values

drop_na_rows = titanic_survival.dropna(axis=0)#axis=0或axis='index'会删除所有含null值的行，axis=1或axis='columns'会删除所有含null值的列

iloc[]

first_row_first_column = new_titanic_survival.iloc[0,0]
all_rows_first_three_columns = new_titanic_survival.iloc[:,0:3]
row_index_83_age = new_titanic_survival.loc[83,"age"]
row_index_766_pclass = new_titanic_survival.loc[766,"pclass"]

对行标重新排序

titanic_reindexed = new_titanic_survival.reset_index(drop=True)#将旧的行标删除

DataFrame.apply()：定义一个函数，对每一列/行执行这一函数

#计算每列的null值的个数
def not_null_count(column):
    column_null = pd.isnull(column)
    null = column[column_null]
    return len(null)

column_null_count = titanic_survival.apply(not_null_count)

#根据年龄进行分类
def generate_age_label(row):
    age = row["age"]
    if pd.isnull(age):
        return "unknown"
    elif age < 18:
        return "minor"
    else:
        return "adult"

age_labels = titanic_survival.apply(generate_age_label, axis=1)#axis=1表示对每行执行该函数

Series：core data structure that pandas uses to represent rows and columns.
Series与NumPy vector相似，不同的是Series可以使用非整数的标签

from pandas import Series
series_custom = Series(rt_scores, index = film_names)
fiveten = series_custom[5:11]#Series同样可以用integer index

Series中的排序

sorted_by_index = series_custom.reindex(sorted_index)#根"sorted_index"的顺序对series_custom进行重新排列
sc2 = series_custom.sort_index()#根据index重新排序
sc3 = series_custom.sort_values()#根据values重新排序

Series运算

np.add(series_custom, series_custom)#values翻倍
np.sin(series_custom)#取sin
np.max(series_custom)#取最大值

Comparing And Filtering

series_custom > 50#返回True/False数组
series_greater_than_50 = series_custom[series_custom > 50]#可用"|"、"or"、"&"、"and"组合

选择行
iloc[]可填入以下内容
An integer, e.g. 5.
A list or array of integers, e.g. [4, 3, 0].
A slice object with ints, e.g. 1:7.
A boolean array.

fandango[0:5]#选择前5行
fandango[140:]#选择140行之后
fandango.iloc[50]#选择第50行
fandango.iloc[[45,90]]#选择第45行和90行

set_index()
可输入一列作为行标
参数：inplace = True，表示直接对dataframe进行替换；
　　　drop = False，表示不会删除新增的作为行标的那一列。

fandango_films = fandango.set_index('FILM', drop = False)

pandas.Series.value_counts()

data["Do you celebrate Thanksgiving?"].value_counts()

排序

food_info.sort_values("Sodium_(mg)", inplace=True, ascending=False)#降序排序，对food_info直接更新

xiyoudahua

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Dataquest Data Scientist Path 整理笔记（2）

在Dataquest中学习Data Scientist方向的知识要点整理笔记。Step 2: Data Analysis And Visualization将csv文件读取为numpy array格式nfl = numpy.genfromtxt("nfl.csv", delimiter="," dtype = "U75", skip_header = 1)#以逗号为分隔符，格式
复制链接

扫一扫