【pandas cookbook学习笔记】第一、二章

最新推荐文章于 2021-11-24 17:17:50 发布

萝卜丝皮尔

最新推荐文章于 2021-11-24 17:17:50 发布

阅读量205

点赞数

分类专栏：数据分析Python基础篇

本文链接：https://blog.csdn.net/qq_43448491/article/details/114555248

版权

数据分析Python基础篇专栏收录该内容

23 篇文章 0 订阅

订阅专栏

本文介绍了MySQL中的存储过程和游标的使用。存储过程用于保存和重复执行复杂的SQL操作，可以处理业务规则和智能逻辑。游标则允许在存储过程中逐行处理数据，特别适用于循环和条件判断。同时，文中提到了如何创建和使用游标，并展示了在数据处理中游标的运用。此外，还探讨了数据类型和Pandas库在数据分析中的应用，包括DataFrame和Series的数据操作、数据信息查看以及数据处理方法。

摘要由CSDN通过智能技术生成

2021/3/8

《MySQL必知必会》

【180/254】
存储过程（procedure）被保存在服务器上以供使用。
存储过程(有点像MySQL的函数)的创建、删除、调用。
存储过程并不显示结果，而是把结果返回给你指定的变量。
它不能通过一个参数返回多个行和列；
MySQL中的变量全部都是以@开头，展示变量用select @variable1_name,@variable2_name;
存储过程的价值体现在业务规则和智能处理。
declare total decimal(8,2)定义变量total;

create procedure odertotal(
	in onumber int,
	in taxable boolean, #布尔值
	out ototal decimal(8,2)
) comment 'Obtain order total, optionally adding tax'
#comment值不是必需的，但如果给出，将在show procedure status的结果中显示
begin
	-- declare variable for total
	declare total decimal(8,2); #定义变量，及其数据类型
	--- declare taxrate percentage 
	declare taxrate int default 6; #可选默认值为6
	-- get the order total
	select sum(item_price*quantity)
	from orderitems
	where order_num=onumber
	into total;
	-- is this taxable?
	if taxable then 
		-- yes,so add taxrate to the total
		select total+(total/100*taxrate) into total;
	end if;
	-- and finally, save to our variable
	select total into ototal;
end;

2021/3/9

《MySQL必知必会》

【187/254】

为什么使用游标？游标是什么？

游标只存在于存储过程和函数，当存储过程处理完成后，游标消失。

使用游标
declare语句次序：它在游标或句柄之前，句柄必须在游标之后定义，否则报错。

create procedure processorders()
begin 
	-- declare local variable
	declare done boolean default 0;
	declare o int;
	declare t decimal(8,2);
	-- declare the cursor
	declare odernumbers cursor
	for 
	select oder_num from orders;
	-- declare contiune handler
	declare continue handler for sqlstate '02000' set done=1;
	-- create a table to store the results
	create table if not exist odertotals # 如果它不存在的话，创建一个新表
		(oder_num int , total decimal(8,2));
	-- open the cursor
	open ordernumbers;
	-- loop through all rows
	repeat
		-- get oder number
		fetch odernumbers into o
		-- get the total for this order
		call ordertotal(o,1,t);
		-- insert order and total into ordertotals
		insert into ordertotals(order_num,total)
		values(o,t);
		-- end of looop
		until done end repeat;
	-- close the cursor 
	close odernumbers;
end;
# 此存储过程不返回数据，但是创建并填充了另一个表
# 用select语句查看
select * from odertotals;

【194/254】
触发器（trigger）背景：有些处理需要在某个表发生更改时自动进行；
在这里插入图片描述
其他MySQL语句不支持触发器。

Pandas Cookbook

DataFrame : a two-diemensional dataset;
Series : a one-diemensional dataset;

method chaining

by convention 按照惯例；
alignment 队列；
dataframe取前几行，取后几行

数据类型：

展示数据信息的一些方法：

.dtypes attribute will show the column as an object( or O on the series)
Almost all of pandas data types are built from Numpy.
.info method prints the data type information in addition to the count of not-null values , also lists the amount of memory used by the DataFrame

2021/3/10

a=[1,2,3]
b=['apple','pear','nut']
data={'iid':a,'fruit':b}
import pandas as pd
data=pd.DataFrame(data)
data.info() # .info() used for DataFrame, not list
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   iid     3 non-null      int64 
 1   fruit   3 non-null      object
dtypes: int64(1), object(1)
memory usage: 176.0+ bytes
a.info()
Traceback (most recent call last):
  File "D:\PyCharm2020\python2020\lib\site-packages\IPython\core\interactiveshell.py", line 3427, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-11-ea9be6f27180>", line 1, in <module>
    a.info()
AttributeError: 'list' object has no attribute 'info'
data.dtypes
Out[12]: 
 # .info() prints information on the screen, but .dtypes attribute returns a Pandas series if you needed to use the data
iid       int64
fruit    object
dtype: object
data_dtypes=data.dtypes
type(data_dtypes)
Out[14]: pandas.core.series.Series
data_info=data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   iid     3 non-null      int64 
 1   fruit   3 non-null      object
dtypes: int64(1), object(1)
memory usage: 176.0+ bytes
type(data_info)
Out[16]: NoneType
#--------------------------------------------

a single column of data is a series.

# how to select one column from one datafarme
data.iid 
# method1 : use attribute access to accomplish the same task
Out[17]: 
0    1
1    2
2    3
Name: iid, dtype: int64
data["iid"] 
# method2 : pass a column name as a string to the indexing operator to select a series of data
Out[18]: 
0    1
1    2
2    3
Name: iid, dtype: int64
# .loc and .iloc attributes to pull out a Series
# .loc pulls out a Series by column name , or say , label-based
data.loc[:,'fruit']
Out[19]: 
0    apple
1     pear
2      nut
Name: fruit, dtype: object
# .iloc pulls out a Series by positon , or say , positional-based
data.iloc[:,1]
Out[20]: 
0    apple
1     pear
2      nut
Name: fruit, dtype: object

monospace font : 等宽字体

data['fruit'].index
Out[21]: RangeIndex(start=0, stop=3, step=1)
data['fruit'].dtype
Out[22]: dtype('O')
data['fruit'].dtypes
Out[23]: dtype('O')
data['fruit'].name
Out[24]: 'fruit'
data['fruit'].size
Out[25]: 3
type(data['fruit'])
Out[26]: pandas.core.series.Series
data['fruit'].apply(type).unique() # chain unique method onto the result
Out[27]: array([<class 'str'>], dtype=object)
data['fruit'].apply(type) # look at a Series that has the type of every member
Out[28]: 
0    <class 'str'>
1    <class 'str'>
2    <class 'str'>
Name: fruit, dtype: object

follow Python naming conventions
the index operator : [

data.head(2)
Out[29]: 
   iid  fruit
0    1  apple
1    2   pear
data.sample(n=2)
Out[30]: 
   iid fruit
1    2  pear
2    3   nut
data.fruit.sample(2)
Out[31]: 
0    apple
1     pear
Name: fruit, dtype: object
data.fruit.value_counts()
# calculate the frequencies
Out[32]: 
pear     1
nut      1
apple    1
Name: fruit, dtype: int64
data.fruit.shape # .shape attributions
Out[33]: (3,)
len(data.fruit) # the built-in len function
Out[34]: 3
data.fruit.unique() # .unique method
Out[35]: array(['apple', 'pear', 'nut'], dtype=object)
data.fruit.count() # not return the count of items, rather the number of non-missing values
Out[36]: 3
# basic summary statistics 
data.iid.min()
Out[37]: 1
data.iid.max()
Out[38]: 3
data.iid.std()
Out[39]: 1.0
data.iid.mean()
Out[40]: 2.0
data.iid.median()
Out[41]: 2.0

be rounded to the nearest thousand : 圆整到千

data.fruit.isna()
# .isna method can be used to determine whether each individual value is missing or not, and its result is a Series
Out[44]: 
0    False
1    False
2    False
Name: fruit, dtype: bool
# to remove the entries in Series element with missing values, use the .dropna method
data.fruit.value_counts(normalize=True)
# setting normalize parameter to True for the relative frequencies
Out[45]: 
pear     0.333333
nut      0.333333
apple    0.333333
Name: fruit, dtype: float64
data.fruit.hasnans
# whether there exists missing values in the Series
Out[46]: False
data.fruit.notna()
Out[47]: 
0    True
1    True
2    True
Name: fruit, dtype: bool

在这里插入图片描述
作用在Series上的operator，做向量化的运算，而不是操作循环。

相比于符号算子，调用方法的好处：
dunder

# chaining Series methods
data.fruit.values_count().head(2).T
# .T propetry means transpose
# count the number of missing values 
data.fruit.isna().sum()
# pandas treats False and True as 0 and 1,and .sum() returns the number of missing values

对含缺失值的列做数据类型转换

2021/3/12

[64/627]

# insert one new column into a specific location in a DataFrame with the .insert method
# it takes the integer position of the new column as its first argument, the name of the new column as its second, the values as the third 
# .get_loc index method to find the integer location of the column name
# .insert method modifies the calling DataFrame in-place, so there won't be an assignment statement, and it also returns None.
#Goal: the profit = gross-budget, and insert it after gross column
profit_index=movies.columns.get_loc("gross")+1
movies.insert(loc=profit_index,column="profit",value=movies["gross"]-movies["budget"])

select multiple DataFrame columns

# pass in a list with a single item, then we get back a DataFrame
# pass in a string with the column name, then we get back a Series
type(data["cone2"])
Out[22]: pandas.core.series.Series
type(data[["cone2"]])
Out[23]: pandas.core.frame.DataFrame
#---
# use .loc to put out a column by name
data
Out[26]: 
   id cone
0   1    a
1   2    b
2   3    c
3   4    d
type(data.loc[:,["cone"]])
Out[27]: pandas.core.frame.DataFrame
type(data.loc[:,"cone"])
Out[28]: pandas.core.series.Series

在这里插入图片描述

you’d better save all your column names to a list variable first, and then choose them .
selecting columns with methods

# use the .select_dtypes method to select only the integer columns
movies.select_dtypes(include="int").head() # take out the first 5 rows
movies.select_dtypes(include="number").head() # select all the numeric columns
movies.select_dtypes(include=["int","object"]) # select the integer and string columns
movies.select_dtypes(exclude="float") # exclude only floating-point columns
# an alternative method to select columns is with .filter method, and search column names or index labels based on which parameter is used 
movies.filter(like="fb") # something like ambiguous search

在这里插入图片描述

.select_dtypes() 中数据类型的详细解释
给字段排序的一些规矩

[75/627]

c1=["cone"]
c2=["ctwo"]
c0=["id"]
c=(c0+c2+c1)
c
Out[35]: ['id', 'ctwo', 'cone']
type(c)
Out[36]: list
data[c] # movies[['movie_title','director_name']] creates a new DataFrame
Out[37]: 
   id    ctwo cone #new order
0   1   apple    a
1   2  banana    b
2   3    pear    c
3   4   peach    d
data
Out[38]: 
   id cone    ctwo
0   1    a   apple
1   2    b  banana
2   3    c    pear
3   4    d   peach

Summarizing a DataFrame

some details about .shape etc as follows:

在这里插入图片描述
关于聚集函数中的 skipna参数

Chaining DataFrame methods
One of the keys to method chaining is to know the exact object being returned each step of the chain.In pandas, this will nearly always be a DataFrame, Series, or scalar value.
使用.isnull() .sum() any()判断一个数据框里面有没有空值，有几个空值.

parenthese n.括号
对于某些聚集方法无法对含缺失值的列做聚集，解决策略是把缺失值补上（如果是字符串，补成“ ”），再call method.
DataFrame operation
when an arithmetic or comparison operator is used with a DataFrame, each value of each column gets the operations applied to it .
if the DataFrame does not contain homogeneous data(e.g. contains both numeric and object data types), then the operation is likely to fail.
关于数据圆整：

检测两个数据框是否相等：

the .equals method determines if all the elements and indexes between two DataFrames are exactly the same and returns a Boolean.
Comparing missing values

missing values do not compare equally with one another.
thus, instead of using == to find missing numbers, use the .isna method
比较两个数据框的正确姿势：
Transposing the direction of a DataFrame operation
some details about axis parameter:

在这里插入图片描述

# use .loc to select rows based on the index label 
# if there were still missing values, we could have run the .fillna(0) method to fill all the remaining values with 0

萝卜丝皮尔

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
【pandas cookbook学习笔记】第一、二章

2021/3/8《MySQL必知必会》【180/254】存储过程（procedure）被保存在服务器上以供使用。存储过程(有点像MySQL的函数)的创建、删除、调用。存储过程并不显示结果，而是把结果返回给你指定的变量。它不能通过一个参数返回多个行和列；MySQL中的变量全部都是以@开头，展示变量用select @variable1_name,@variable2_name;存储过程的价值体现在业务规则和智能处理。declare total decimal(8,2)定义变量total;cr
复制链接

扫一扫

专栏目录