Python Numpy & Pandas

List of lists 只适合处理小的data sets,因为list都会放在内存中,内存有限;
Numpy library 可以较好地处理大的data sets

Numpy

官方文档: http://www.numpy.org/

  • Convert a list of lists into a ndarray
import numpy as np
f = open("nyc_taxis.csv", "r")
taxi_list = list(csv.reader(f))
taxi = np.array(converted_taxi_list)
  • Select from ndarray

ndarray[x_axes,y_axes] : x_axes 对应行,y_axes对应列

# Select a row from ndarray
second_row = taxi[1] # 获取第二行

# Select multiple rows from a ndarray
all_but_first_row = taxi[1:] # 获取第二行开始的所有行

# Selecting a specific item from an ndarray
fifth_row_second_column = taxi[4,1] # 获取第5行第2列的元素

# Selecting a single column
second_column = taxi[:,1] # 不对行进行限定,则逗号前使用: 表示所有行

# Selecting multiple column

second_third_columns = taxi[:,1:3] # 1:3 是一个范围,不包含index = 3 的列
cols = [1,3,5]
second_fourth_sixth_columns = taxi[:, cols] # 也可以直接 taxi[:,[1,3,5]] 取出3列的数

# Selecting a 2D slice
twod_slice = taxi[1:4, :3] # 获得 第二行至第五行(不含)、第一列至第4列(不含)的数组

  • Select multiple rows from a ndarray
    前提: a.shape == b.shape

vector_a + vector_b: Addition
vector_a - vector_b: Subtraction
vector_a * vector_b: Multiplication (this is unrelated to the vector multiplication used in linear algebra).
vector_a / vector_b: Division

vector_a operator b : # ndarray 加减乘除一个值 ,即数组中的每个元素都与该值进行操作(包括布尔操作)

在这里插入图片描述

  • CALCULATING STATISTICS FOR 1D NDARRAYS

ndarray.min() to calculate the minimum value
ndarray.max() to calculate the maximum value
ndarray.mean() to calculate the mean average value
ndarray.sum() to calculate the sum of the values

# Max value for an entire 2D Ndarray:

taxi.max()

# Max value for each row in a 2D Ndarray (returns a 1D Ndarray):

taxi.max(axis=1)

# Max value for each column in a 2D Ndarray (returns a 1D Ndarray):

taxi.max(axis=0)
  • Boolean Indexing with NumPy

NumPy ndarrays can contain only one datatype --> 在使用np.array() 将a list of lists 转换成一个ndarray前,需要将其中元素转为同一种datatype --> 使用 numpy.genfromtxt() 则可以跳过该步骤,会由程序自行判断选择一种数据类型(若无法转换为number,则会显示NaN

READING CSV FILES WITH NUMPY

numpy.genfromtxt()

import numpy as np
taxi = np.genfromtxt('nyctaxis.csv', delimiter=',', skip_header=1)

BOOLEAN ARRAYS

# Creating a Boolean array from filtering criteria:

np.array([2,4,6,8]) < 5

# Boolean filtering for 1D ndarray:

a = np.array([2,4,6,8])
filter = a < 5
a[filter]

# Boolean filtering for 2D ndarray:

tip_amount = taxi[:,12] 
tip_bool = tip_amount > 50
top_tips = taxi[tip_bool, 5:14]

ASSIGNING VALUES

we inserted the definition of the boolean array directly into the selection. This “shortcut” is the conventional way to write boolean indexing --> boolean filter 直接写在方括号内

# Assigning values in a 2D ndarray using indices:

taxi[28214,5] = 1
taxi[:,0] = 16
taxi[1800:1802,7] = taxi[:,7].mean()

# Assigning values using Boolean arrays:

taxi[taxi[:, 5] == 2, 15] = 1
jfk = taxi[taxi[:,6] == 2]
jfk_count = jfk.shape[0]

laguardia = taxi[taxi[:,6] == 3]
laguardia_count = laguardia.shape[0]

newark = taxi[taxi[:,6] == 5] # 取index = 6 的列值=5 的行
newark_count = newark.shape[0] # 取行数

trip_mph = taxi[:,7] / (taxi[:,8] / 3600) # 根据Index=7和8的列计算生成一个1DArray

cleaned_taxi = taxi[trip_mph < 100]
mean_distance = cleaned_taxi[:, 7].mean()

mean_length = cleaned_taxi[:, 8].mean()

mean_total_amount = cleaned_taxi[:, 13].mean()

Resource:

Reading a CSV file into NumPy
Indexing and selecting data

Shortcoming of Numpy :

  • The lack of support for column names forces us to frame questions as multi-dimensional array operations.
  • Support for only one data type per ndarray makes it more difficult to work with data that contains both numeric and string data.
  • There are lots of low level methods, but there are many common analysis patterns that don’t have pre-built methods.

Pandas

参考: 官方文档

pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language

Pandas is not so much a replacement for NumPy as an extension of NumPy. The underlying code for pandas uses the NumPy library extensively, which means the concepts you’ve been learning will come in handy as you begin to learn more about pandas

The primary data structure in pandas is called a dataframe. Dataframes are the pandas equivalent of a Numpy 2D ndarray, with a few key differences:

Axis values can have string labels, not just numeric ones.
Dataframes can contain columns with multiple data types: including integer, float, and string.

在这里插入图片描述

参考:上面标题链接
用途:读取csv文件,并抽取行列值
返回值:dataframe
注意方法参数的使用:
如:

index_col : int, str, sequence of int / str, or False, default None
    Column(s) to use as the row labels of the DataFrame, either given as string name or column index. If a sequence of int / str is given, a MultiIndex is used.
    Note: index_col=False can be used to force pandas to not use the first column as the index, e.g. when you have a malformed file with delimiters at the end of each line.

实验:

数据 : z5jw

在这里插入图片描述

  • Series is the pandas type for one-dimensional objects. Anytime you see a 1D pandas object, it will be a series. Anytime you see a 2D pandas object, it will be a dataframe

Series

DataFrame

import pandas as pd # 引包惯例:别名为 pd 

# pandas.core.frame.DataFrame
eg_data_0 = pd.read_csv("/Personal Source-HJL/Python/test_data/HN_posts_year_to_Sep_26_2016.csv", index_col=0) # 指定第1列ud为row label
eg_data_1 = pd.read_csv("/Personal Source-HJL/Python/test_data/HN_posts_year_to_Sep_26_2016.csv") # 不指定row label,自动分配一个rownum作为lable(0,1,2,...)

#dataframe.info(),dataframe.shape, dataframe.head(), dataframe.tail()
print(eg_data_0.info(),'\n') 
print(eg_data_0.shape,'\n')
print(eg_data_0.head(2),'\n')

print(eg_data_1.info(),'\n') 
print(eg_data_1.shape,'\n') 
print(eg_data_1.head(2),'\n')

 # pandas.core.series.Series : choose a column 

title = eg_data_0["title"]
#series.type(),series.shape, series.head(),series.tail()
print(type(title)) 
print(title.shape,'\n')
print(title.head(2),'\n')
print(title.tail(2),'\n')

print(title.value_counts(),'\n') # displays each unique non-null value in a column and their counts in order,dataframe不适用
print(type(title.value_counts()),'\n') # 返回的是一个Series,可以在Series基础上选择出需要的内容
print(title.value_counts()[["Employee benefits at Basecamp","How to build stable systems"]],'\n')

# pandas.core.series.Series : choose 2 row
row_1 = eg_data_1.loc[1:2]

print(type(row_1),'\n')
print(row_1.shape,'\n')
print(row_1)

  • Different label selection methods-
    loc[row_label, column_label]
Select by LabelExplicit SyntaxShorthand Convention
Single column from dataframedf.loc[:,“col1”]df[“col1”]
List of columns from dataframedf.loc[:,[“col1”,“col7”]]df[[“col1”,“col7”]]
Slice of columns from dataframedf.loc[:,“col1”:“col4”]
Single row from dataframedf.loc[“row4”]
List of rows from dataframedf.loc[[“row1”, “row8”]]
Slice of rows from dataframedf.loc[“row3”:“row5”]df[“row3”:“row5”]
Single item from seriess.loc[“item8”]s[“item8”]
List of items from seriess.loc[[“item1”,“item7”]]s[[“item1”,“item7”]]
Slice of items from seriess.loc[“item2”:“item4”]s[“item2”:“item4”]
  • DATA EXPLORATION METHODS
# Describing a series object:

revs = f500["revenues"]
summary_stats = revs.describe()

# Unique value counts for a column:

country_freqs = f500['country'].value_counts()
  • ASSIGNMENT WITH PANDAS
#Creating a new column:

top5_rank_revenue["year_founded"] = 0

#Replacing a specific value in a dataframe:

f500.loc["Dow Chemical","ceo"] = "Jim Fitterling"
  • BOOLEAN INDEXING IN PANDAS
# Filtering a dataframe down on a specific value in a column:

kr_bool = f500["country"] == "South Korea"
top_5_kr = f500[kr_bool].head()

# Updating values using Boolean filtering:

f500.loc[f500["previous_rank"] == 0, "previous_rank"] = np.nan
prev_rank_after = f500["previous_rank"].value_counts(dropna=False).head()


import numpy as np
prev_rank_before = f500["previous_rank"].value_counts(dropna=False).head()

f500.loc[f500["previous_rank"] == 0,"previous_rank"] = np.nan 

prev_rank_after = f500["previous_rank"].value_counts(dropna=False).head()

f500["rank_change"] = f500["previous_rank"] - f500["rank"]

rank_change_desc = f500["rank_change"].describe()

top_3_countries = f500["country"].value_counts().head(3)
industry_usa = f500.loc[f500["country"] == "USA","industry"].value_counts().head(2)
sector_china = f500.loc[f500["country"] == "China","sector"].value_counts().head(3)
mean_employees_japan = f500.loc[f500["country"] == "Japan","employees"].mean()


参考: https://pandas.pydata.org/pandas-docs/stable/user_guide/gotchas.html

https://stackoverflow.com/questions/tagged/pandas

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值