List of lists 只适合处理小的data sets,因为list都会放在内存中,内存有限;
Numpy library 可以较好地处理大的data sets
Numpy
官方文档: http://www.numpy.org/
- Convert a list of lists into a ndarray
import numpy as np
f = open("nyc_taxis.csv", "r")
taxi_list = list(csv.reader(f))
taxi = np.array(converted_taxi_list)
- Select from ndarray
ndarray[x_axes,y_axes] : x_axes 对应行,y_axes对应列
# Select a row from ndarray
second_row = taxi[1] # 获取第二行
# Select multiple rows from a ndarray
all_but_first_row = taxi[1:] # 获取第二行开始的所有行
# Selecting a specific item from an ndarray
fifth_row_second_column = taxi[4,1] # 获取第5行第2列的元素
# Selecting a single column
second_column = taxi[:,1] # 不对行进行限定,则逗号前使用: 表示所有行
# Selecting multiple column
second_third_columns = taxi[:,1:3] # 1:3 是一个范围,不包含index = 3 的列
cols = [1,3,5]
second_fourth_sixth_columns = taxi[:, cols] # 也可以直接 taxi[:,[1,3,5]] 取出3列的数
# Selecting a 2D slice
twod_slice = taxi[1:4, :3] # 获得 第二行至第五行(不含)、第一列至第4列(不含)的数组
- Select multiple rows from a ndarray
前提: a.shape == b.shape
vector_a + vector_b: Addition
vector_a - vector_b: Subtraction
vector_a * vector_b: Multiplication (this is unrelated to the vector multiplication used in linear algebra).
vector_a / vector_b: Division
vector_a operator b : # ndarray 加减乘除一个值 ,即数组中的每个元素都与该值进行操作(包括布尔操作)
- CALCULATING STATISTICS FOR 1D NDARRAYS
ndarray.min() to calculate the minimum value
ndarray.max() to calculate the maximum value
ndarray.mean() to calculate the mean average value
ndarray.sum() to calculate the sum of the values
# Max value for an entire 2D Ndarray:
taxi.max()
# Max value for each row in a 2D Ndarray (returns a 1D Ndarray):
taxi.max(axis=1)
# Max value for each column in a 2D Ndarray (returns a 1D Ndarray):
taxi.max(axis=0)
- Boolean Indexing with NumPy
NumPy ndarrays can contain only one datatype --> 在使用np.array() 将a list of lists 转换成一个ndarray前,需要将其中元素转为同一种datatype --> 使用 numpy.genfromtxt() 则可以跳过该步骤,会由程序自行判断选择一种数据类型(若无法转换为number,则会显示NaN)
READING CSV FILES WITH NUMPY
import numpy as np
taxi = np.genfromtxt('nyctaxis.csv', delimiter=',', skip_header=1)
BOOLEAN ARRAYS
# Creating a Boolean array from filtering criteria:
np.array([2,4,6,8]) < 5
# Boolean filtering for 1D ndarray:
a = np.array([2,4,6,8])
filter = a < 5
a[filter]
# Boolean filtering for 2D ndarray:
tip_amount = taxi[:,12]
tip_bool = tip_amount > 50
top_tips = taxi[tip_bool, 5:14]
ASSIGNING VALUES
we inserted the definition of the boolean array directly into the selection. This “shortcut” is the conventional way to write boolean indexing --> boolean filter 直接写在方括号内
# Assigning values in a 2D ndarray using indices:
taxi[28214,5] = 1
taxi[:,0] = 16
taxi[1800:1802,7] = taxi[:,7].mean()
# Assigning values using Boolean arrays:
taxi[taxi[:, 5] == 2, 15] = 1
jfk = taxi[taxi[:,6] == 2]
jfk_count = jfk.shape[0]
laguardia = taxi[taxi[:,6] == 3]
laguardia_count = laguardia.shape[0]
newark = taxi[taxi[:,6] == 5] # 取index = 6 的列值=5 的行
newark_count = newark.shape[0] # 取行数
trip_mph = taxi[:,7] / (taxi[:,8] / 3600) # 根据Index=7和8的列计算生成一个1DArray
cleaned_taxi = taxi[trip_mph < 100]
mean_distance = cleaned_taxi[:, 7].mean()
mean_length = cleaned_taxi[:, 8].mean()
mean_total_amount = cleaned_taxi[:, 13].mean()
Resource:
Reading a CSV file into NumPy
Indexing and selecting data
Shortcoming of Numpy :
- The lack of support for column names forces us to frame questions as multi-dimensional array operations.
- Support for only one data type per ndarray makes it more difficult to work with data that contains both numeric and string data.
- There are lots of low level methods, but there are many common analysis patterns that don’t have pre-built methods.
Pandas
参考: 官方文档
pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language
Pandas is not so much a replacement for NumPy as an extension of NumPy. The underlying code for pandas uses the NumPy library extensively, which means the concepts you’ve been learning will come in handy as you begin to learn more about pandas
The primary data structure in pandas is called a dataframe. Dataframes are the pandas equivalent of a Numpy 2D ndarray, with a few key differences:
Axis values can have string labels, not just numeric ones.
Dataframes can contain columns with multiple data types: including integer, float, and string.
参考:上面标题链接
用途:读取csv文件,并抽取行列值
返回值:dataframe
注意方法参数的使用:
如:
index_col : int, str, sequence of int / str, or False, default None
Column(s) to use as the row labels of the DataFrame, either given as string name or column index. If a sequence of int / str is given, a MultiIndex is used.
Note: index_col=False can be used to force pandas to not use the first column as the index, e.g. when you have a malformed file with delimiters at the end of each line.
实验:
数据 : z5jw
- Series is the pandas type for one-dimensional objects. Anytime you see a 1D pandas object, it will be a series. Anytime you see a 2D pandas object, it will be a dataframe
import pandas as pd # 引包惯例:别名为 pd
# pandas.core.frame.DataFrame
eg_data_0 = pd.read_csv("/Personal Source-HJL/Python/test_data/HN_posts_year_to_Sep_26_2016.csv", index_col=0) # 指定第1列ud为row label
eg_data_1 = pd.read_csv("/Personal Source-HJL/Python/test_data/HN_posts_year_to_Sep_26_2016.csv") # 不指定row label,自动分配一个rownum作为lable(0,1,2,...)
#dataframe.info(),dataframe.shape, dataframe.head(), dataframe.tail()
print(eg_data_0.info(),'\n')
print(eg_data_0.shape,'\n')
print(eg_data_0.head(2),'\n')
print(eg_data_1.info(),'\n')
print(eg_data_1.shape,'\n')
print(eg_data_1.head(2),'\n')
# pandas.core.series.Series : choose a column
title = eg_data_0["title"]
#series.type(),series.shape, series.head(),series.tail()
print(type(title))
print(title.shape,'\n')
print(title.head(2),'\n')
print(title.tail(2),'\n')
print(title.value_counts(),'\n') # displays each unique non-null value in a column and their counts in order,dataframe不适用
print(type(title.value_counts()),'\n') # 返回的是一个Series,可以在Series基础上选择出需要的内容
print(title.value_counts()[["Employee benefits at Basecamp","How to build stable systems"]],'\n')
# pandas.core.series.Series : choose 2 row
row_1 = eg_data_1.loc[1:2]
print(type(row_1),'\n')
print(row_1.shape,'\n')
print(row_1)
- Different label selection methods-
loc[row_label, column_label]
Select by Label | Explicit Syntax | Shorthand Convention |
---|---|---|
Single column from dataframe | df.loc[:,“col1”] | df[“col1”] |
List of columns from dataframe | df.loc[:,[“col1”,“col7”]] | df[[“col1”,“col7”]] |
Slice of columns from dataframe | df.loc[:,“col1”:“col4”] | |
Single row from dataframe | df.loc[“row4”] | |
List of rows from dataframe | df.loc[[“row1”, “row8”]] | |
Slice of rows from dataframe | df.loc[“row3”:“row5”] | df[“row3”:“row5”] |
Single item from series | s.loc[“item8”] | s[“item8”] |
List of items from series | s.loc[[“item1”,“item7”]] | s[[“item1”,“item7”]] |
Slice of items from series | s.loc[“item2”:“item4”] | s[“item2”:“item4”] |
- DATA EXPLORATION METHODS
# Describing a series object:
revs = f500["revenues"]
summary_stats = revs.describe()
# Unique value counts for a column:
country_freqs = f500['country'].value_counts()
- ASSIGNMENT WITH PANDAS
#Creating a new column:
top5_rank_revenue["year_founded"] = 0
#Replacing a specific value in a dataframe:
f500.loc["Dow Chemical","ceo"] = "Jim Fitterling"
- BOOLEAN INDEXING IN PANDAS
# Filtering a dataframe down on a specific value in a column:
kr_bool = f500["country"] == "South Korea"
top_5_kr = f500[kr_bool].head()
# Updating values using Boolean filtering:
f500.loc[f500["previous_rank"] == 0, "previous_rank"] = np.nan
prev_rank_after = f500["previous_rank"].value_counts(dropna=False).head()
import numpy as np
prev_rank_before = f500["previous_rank"].value_counts(dropna=False).head()
f500.loc[f500["previous_rank"] == 0,"previous_rank"] = np.nan
prev_rank_after = f500["previous_rank"].value_counts(dropna=False).head()
f500["rank_change"] = f500["previous_rank"] - f500["rank"]
rank_change_desc = f500["rank_change"].describe()
top_3_countries = f500["country"].value_counts().head(3)
industry_usa = f500.loc[f500["country"] == "USA","industry"].value_counts().head(2)
sector_china = f500.loc[f500["country"] == "China","sector"].value_counts().head(3)
mean_employees_japan = f500.loc[f500["country"] == "Japan","employees"].mean()
参考: https://pandas.pydata.org/pandas-docs/stable/user_guide/gotchas.html
https://stackoverflow.com/questions/tagged/pandas