Python Numpy & Pandas

daoxu_hjl

于 2021-03-27 20:52:10 发布

阅读量161

点赞数

分类专栏： Python

本文链接：https://blog.csdn.net/daoxu_hjl/article/details/96861990

版权

Python 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

List of lists 只适合处理小的data sets,因为list都会放在内存中，内存有限；
Numpy library 可以较好地处理大的data sets

Numpy

官方文档： http://www.numpy.org/

Convert a list of lists into a ndarray

import numpy as np
f = open("nyc_taxis.csv", "r")
taxi_list = list(csv.reader(f))
taxi = np.array(converted_taxi_list)

Select from ndarray

ndarray[x_axes,y_axes] : x_axes 对应行，y_axes对应列

# Select a row from ndarray
second_row = taxi[1] # 获取第二行

# Select multiple rows from a ndarray
all_but_first_row = taxi[1:] # 获取第二行开始的所有行

# Selecting a specific item from an ndarray
fifth_row_second_column = taxi[4,1] # 获取第5行第2列的元素

# Selecting a single column
second_column = taxi[:,1] # 不对行进行限定，则逗号前使用: 表示所有行

# Selecting multiple column

second_third_columns = taxi[:,1:3] # 1:3 是一个范围，不包含index = 3 的列
cols = [1,3,5]
second_fourth_sixth_columns = taxi[:, cols] # 也可以直接 taxi[:,[1,3,5]] 取出3列的数

# Selecting a 2D slice
twod_slice = taxi[1:4, :3] # 获得 第二行至第五行（不含）、第一列至第4列（不含）的数组

Select multiple rows from a ndarray
前提： a.shape == b.shape

vector_a + vector_b: Addition
vector_a - vector_b: Subtraction
vector_a * vector_b: Multiplication (this is unrelated to the vector multiplication used in linear algebra).
vector_a / vector_b: Division

vector_a operator b : # ndarray 加减乘除一个值，即数组中的每个元素都与该值进行操作(包括布尔操作）

在这里插入图片描述

CALCULATING STATISTICS FOR 1D NDARRAYS

ndarray.min() to calculate the minimum value
ndarray.max() to calculate the maximum value
ndarray.mean() to calculate the mean average value
ndarray.sum() to calculate the sum of the values

# Max value for an entire 2D Ndarray:

taxi.max()

# Max value for each row in a 2D Ndarray (returns a 1D Ndarray):

taxi.max(axis=1)

# Max value for each column in a 2D Ndarray (returns a 1D Ndarray):

taxi.max(axis=0)

Boolean Indexing with NumPy

NumPy ndarrays can contain only one datatype --> 在使用np.array() 将a list of lists 转换成一个ndarray前，需要将其中元素转为同一种datatype --> 使用 numpy.genfromtxt() 则可以跳过该步骤，会由程序自行判断选择一种数据类型（若无法转换为number，则会显示NaN）

READING CSV FILES WITH NUMPY

numpy.genfromtxt()

import numpy as np
taxi = np.genfromtxt('nyctaxis.csv', delimiter=',', skip_header=1)

BOOLEAN ARRAYS

# Creating a Boolean array from filtering criteria:

np.array([2,4,6,8]) < 5

# Boolean filtering for 1D ndarray:

a = np.array([2,4,6,8])
filter = a < 5
a[filter]

# Boolean filtering for 2D ndarray:

tip_amount = taxi[:,12] 
tip_bool = tip_amount > 50
top_tips = taxi[tip_bool, 5:14]

ASSIGNING VALUES

we inserted the definition of the boolean array directly into the selection. This “shortcut” is the conventional way to write boolean indexing --> boolean filter 直接写在方括号内

# Assigning values in a 2D ndarray using indices:

taxi[28214,5] = 1
taxi[:,0] = 16
taxi[1800:1802,7] = taxi[:,7].mean()

# Assigning values using Boolean arrays:

taxi[taxi[:, 5] == 2, 15] = 1

jfk = taxi[taxi[:,6] == 2]
jfk_count = jfk.shape[0]

laguardia = taxi[taxi[:,6] == 3]
laguardia_count = laguardia.shape[0]

newark = taxi[taxi[:,6] == 5] # 取index = 6 的列值=5 的行
newark_count = newark.shape[0] # 取行数

trip_mph = taxi[:,7] / (taxi[:,8] / 3600) # 根据Index=7和8的列计算生成一个1DArray

cleaned_taxi = taxi[trip_mph < 100]
mean_distance = cleaned_taxi[:, 7].mean()

mean_length = cleaned_taxi[:, 8].mean()

mean_total_amount = cleaned_taxi[:, 13].mean()

Resource:

Reading a CSV file into NumPy
Indexing and selecting data

Shortcoming of Numpy :

The lack of support for column names forces us to frame questions as multi-dimensional array operations.
Support for only one data type per ndarray makes it more difficult to work with data that contains both numeric and string data.
There are lots of low level methods, but there are many common analysis patterns that don’t have pre-built methods.

Pandas

参考：官方文档

pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language

Pandas is not so much a replacement for NumPy as an extension of NumPy. The underlying code for pandas uses the NumPy library extensively, which means the concepts you’ve been learning will come in handy as you begin to learn more about pandas

The primary data structure in pandas is called a dataframe. Dataframes are the pandas equivalent of a Numpy 2D ndarray, with a few key differences:

Axis values can have string labels, not just numeric ones.
Dataframes can contain columns with multiple data types: including integer, float, and string.

在这里插入图片描述

pandas.read_csv

参考：上面标题链接
用途：读取csv文件，并抽取行列值
返回值：dataframe
注意方法参数的使用:
如：

index_col : int, str, sequence of int / str, or False, default None
    Column(s) to use as the row labels of the DataFrame, either given as string name or column index. If a sequence of int / str is given, a MultiIndex is used.
    Note: index_col=False can be used to force pandas to not use the first column as the index, e.g. when you have a malformed file with delimiters at the end of each line.

实验：

数据 : z5jw

在这里插入图片描述

Series is the pandas type for one-dimensional objects. Anytime you see a 1D pandas object, it will be a series. Anytime you see a 2D pandas object, it will be a dataframe

Series

DataFrame

import pandas as pd # 引包惯例：别名为 pd 

# pandas.core.frame.DataFrame
eg_data_0 = pd.read_csv("/Personal Source-HJL/Python/test_data/HN_posts_year_to_Sep_26_2016.csv", index_col=0) # 指定第1列ud为row label
eg_data_1 = pd.read_csv("/Personal Source-HJL/Python/test_data/HN_posts_year_to_Sep_26_2016.csv") # 不指定row label,自动分配一个rownum作为lable(0,1,2,...)

#dataframe.info(),dataframe.shape, dataframe.head(), dataframe.tail()
print(eg_data_0.info(),'\n') 
print(eg_data_0.shape,'\n')
print(eg_data_0.head(2),'\n')

print(eg_data_1.info(),'\n') 
print(eg_data_1.shape,'\n') 
print(eg_data_1.head(2),'\n')

 # pandas.core.series.Series : choose a column 

title = eg_data_0["title"]
#series.type(),series.shape, series.head(),series.tail()
print(type(title)) 
print(title.shape,'\n')
print(title.head(2),'\n')
print(title.tail(2),'\n')

print(title.value_counts(),'\n') # displays each unique non-null value in a column and their counts in order，dataframe不适用
print(type(title.value_counts()),'\n') # 返回的是一个Series，可以在Series基础上选择出需要的内容
print(title.value_counts()[["Employee benefits at Basecamp","How to build stable systems"]],'\n')

# pandas.core.series.Series : choose 2 row
row_1 = eg_data_1.loc[1:2]

print(type(row_1),'\n')
print(row_1.shape,'\n')
print(row_1)

Different label selection methods-
loc[row_label, column_label]

Select by Label	Explicit Syntax	Shorthand Convention
Single column from dataframe	df.loc[:,“col1”]	df[“col1”]
List of columns from dataframe	df.loc[:,[“col1”,“col7”]]	df[[“col1”,“col7”]]
Slice of columns from dataframe	df.loc[:,“col1”:“col4”]
Single row from dataframe	df.loc[“row4”]
List of rows from dataframe	df.loc[[“row1”, “row8”]]
Slice of rows from dataframe	df.loc[“row3”:“row5”]	df[“row3”:“row5”]
Single item from series	s.loc[“item8”]	s[“item8”]
List of items from series	s.loc[[“item1”,“item7”]]	s[[“item1”,“item7”]]
Slice of items from series	s.loc[“item2”:“item4”]	s[“item2”:“item4”]

DATA EXPLORATION METHODS

# Describing a series object:

revs = f500["revenues"]
summary_stats = revs.describe()

# Unique value counts for a column:

country_freqs = f500['country'].value_counts()

ASSIGNMENT WITH PANDAS

#Creating a new column:

top5_rank_revenue["year_founded"] = 0

#Replacing a specific value in a dataframe:

f500.loc["Dow Chemical","ceo"] = "Jim Fitterling"

BOOLEAN INDEXING IN PANDAS

# Filtering a dataframe down on a specific value in a column:

kr_bool = f500["country"] == "South Korea"
top_5_kr = f500[kr_bool].head()

# Updating values using Boolean filtering:

f500.loc[f500["previous_rank"] == 0, "previous_rank"] = np.nan
prev_rank_after = f500["previous_rank"].value_counts(dropna=False).head()


import numpy as np
prev_rank_before = f500["previous_rank"].value_counts(dropna=False).head()

f500.loc[f500["previous_rank"] == 0,"previous_rank"] = np.nan 

prev_rank_after = f500["previous_rank"].value_counts(dropna=False).head()

f500["rank_change"] = f500["previous_rank"] - f500["rank"]

rank_change_desc = f500["rank_change"].describe()

top_3_countries = f500["country"].value_counts().head(3)
industry_usa = f500.loc[f500["country"] == "USA","industry"].value_counts().head(2)
sector_china = f500.loc[f500["country"] == "China","sector"].value_counts().head(3)
mean_employees_japan = f500.loc[f500["country"] == "Japan","employees"].mean()

参考： https://pandas.pydata.org/pandas-docs/stable/user_guide/gotchas.html

https://stackoverflow.com/questions/tagged/pandas

daoxu_hjl

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Python Numpy & Pandas

List of lists 只适合处理小的data sets,因为list都会放在内存中，内存有限；Numpy library 可以较好地处理大的data setsNumpy官方文档： http://www.numpy.org/Convert a list of lists into a ndarrayimport numpy as npf = open("nyc_taxis.cs...
复制链接

扫一扫

专栏目录