1 ---- series和读取外部数据（pandas）

qq_44647559

于 2021-05-10 09:32:24 发布

阅读量79

点赞数

分类专栏： # 3 python三大库（已完结）

本文链接：https://blog.csdn.net/qq_44647559/article/details/116586368

版权

3 python三大库（已完结）专栏收录该内容

20 篇文章 0 订阅

订阅专栏

【前情提要】为什么要学习pandas
because
（1）numpy能够帮助我们处理数据，能够结合matplotlib解决数据分析的问题，那么pandas学习的目的是什么呢？
（2）numpy主要解决数值型数据。
（3）而数据除了数值外，还有字符串，还有时间序列等.
（4）比如：我们通过爬虫获取到了存储在数据库中的数据
（5）比如：之前YouTube的例子中除了数值外还有国家的信息，视频的分类（tag）信息，标题信息等
so
（1）所以，numpy能够帮助我们处理数值，但是pandas除了处理数值之外（基于numpy），还能够帮助我们处理其他类型的数据

注意：
（1）Series     --- 一维数据，带标签数组
（2）DataFrame  --- 二维数据，Series容器

【问题1】pandas之Series创建

import pandas as pd

# （1） 
t1 = pd.Series([1,2,31,12,3,4])
print(t1)
print(type(t1))
print('**(1)**'*10)






# (2) 添加索引 index
t2 = pd.Series([1,2,31,12,3,4],index=list('abcdef'))      # 将字符串转换为列表类型 ---- list( )
print(t2)
print(t2.dtype)           # t2----整数类型 ---- dtype: int64
print(t2.astype('float'))    # 指定类型为float
print('**(2)**'*10)






# （3）字典：键--->索引    值--->值
temp_dict = {'name':'xiaohong','age':30,'tel':10086}
t3 = pd.Series(temp_dict)
print(t3)
print(t3.dtype)           # t3----字符串类型 ---- dtype: object
print('**(3)**'*10)





# （4）
import string

t1 = string.ascii_uppercase[:10]
print(t1)
t2 = list(string.ascii_uppercase[:10])
print(t2)
t3 = pd.Series(np.arange(10),index=list(string.ascii_uppercase[:10]))
print(t3)

print(type(t3))
print(t3.dtype)

print('**(4)**'*10)





# (5)
'''
（1）通过字典创建一个Series，注意其中的索引就是字典的键
（2）重新给其指定的其他索引后，如果能够对应上，就取其值；如果不能，就为nan
（3）numpy中nan为float，pandas会自动根据数据类更改Series的dtype类型
'''
import string 

t = {string.ascii_uppercase[i]:i for i in range(10)}  # 创建字典
print(t)

t1 = pd.Series(t)
print(t1)

t2 = pd.Series(t,index=list(string.ascii_uppercase[5:15]))  # 取索引： [5，15)
print(t2)

print('**(5)**'*10)

0     1
1     2
2    31
3    12
4     3
5     4
dtype: int64
<class 'pandas.core.series.Series'>
**(1)****(1)****(1)****(1)****(1)****(1)****(1)****(1)****(1)****(1)**
a     1
b     2
c    31
d    12
e     3
f     4
dtype: int64
int64
a     1.0
b     2.0
c    31.0
d    12.0
e     3.0
f     4.0
dtype: float64
**(2)****(2)****(2)****(2)****(2)****(2)****(2)****(2)****(2)****(2)**
name    xiaohong
age           30
tel        10086
dtype: object
object
**(3)****(3)****(3)****(3)****(3)****(3)****(3)****(3)****(3)****(3)**
ABCDEFGHIJ
['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']
A    0
B    1
C    2
D    3
E    4
F    5
G    6
H    7
I    8
J    9
dtype: int32
<class 'pandas.core.series.Series'>
int32
**(4)****(4)****(4)****(4)****(4)****(4)****(4)****(4)****(4)****(4)**
{'A': 0, 'B': 1, 'C': 2, 'D': 3, 'E': 4, 'F': 5, 'G': 6, 'H': 7, 'I': 8, 'J': 9}
A    0
B    1
C    2
D    3
E    4
F    5
G    6
H    7
I    8
J    9
dtype: int64
F    5.0
G    6.0
H    7.0
I    8.0
J    9.0
K    NaN
L    NaN
M    NaN
N    NaN
O    NaN
dtype: float64
**(5)****(5)****(5)****(5)****(5)****(5)****(5)****(5)****(5)****(5)**

【问题2】pandas之Series切片和索引

'''
（1）切片：直接传入start或者步长即可
（2）索引：一个的时候直接传入序号或者index，多个的时候传入序号或者index的列表
'''

# （1）
import pandas as pd

t1 = {'age':30,'name':'xiaohong','tel':10086}
print(t1)
t2 = pd.Series(t1)
print(t2)
print(t2['age'])
print(t2['tel'])
print('*'*30)

print(t2[0])
print(t2[1])
print(t2[2])
print('*'*30)

print(t2[:2])         # （切片）连续：前2行[0,2)
print(t2[[1,2]])      # （切片）不连续：第1行，第2行
print('**（1）**'*10)





#（2）布尔索引
t = pd.Series([1,2,31,12,3,4])
print(t)

t1 = t[t>10]
print(t1)
print('**（2）**'*10)







# （3）pandas之Series的索引和值
'''
（1）python3.6之后，字典会按键值排序
（2）list（）强制类型转换为列表
（3）Series对象本质上由两个数组构成
     一个数组构成对象的键（index，索引），一个数组构成对象的值（values），键->值
'''
t = {'age':30,'name':'xiaohong','tel':10086}
t1 = pd.Series(t)
print(t1)

print(t1.index)                 # t1.index       ------- Index(['age', 'name', 'tel'], dtype='object')
print(type(t1.index))           # type(t1.index) ------- pandas.core.indexes.base.Index
print(len(t1.index))
print(list(t1.index))           # 强制转换为列表 list()
print(list(t1.index)[:2])       # 切片 list(t1.index)[:2]
for i in t1.index:
    print(i)
print('*'*30)

print(t1.values)                # t1.values       ------- [30 'xiaohong' 10086]
print(type(t1.values))          # type(t1.values) ------- numpy.ndarray -------numpy中的数组类型

print('**（3）**'*10)






# （4）可以查看官方文档：百度搜索。比如：pandas series where
'''
（1）ndarry的很多方法都可以运用于Series类型，比如argmax，clip
（2）Series具有很多where方法，但是结果和ndarry不同
'''
t1 = pd.Series(range(5))
print(t1)

t2 = t1.where(t1>0)        # t1>0，不变；t1<=0，替换为nan
print(t2)

t3 = t1.where(t1>1,10)     # t1>1，不变；t1<=1，替换为10
print(t3)
print('**（4）**'*10)









# （5）
import string

t = pd.Series(range(10),index=list(string.ascii_uppercase[:10]))
print(t)

print(t[2:10:2])      # [2,10)  步长为2
print(t[1])           # print:1
print(t[[2,3,6]])     # 第2行，第3行，第6行
print(t[t>4])         # 切片
print(t['F'])
print(t[['A','F','G']])

print('**（5）**'*10)

{'age': 30, 'name': 'xiaohong', 'tel': 10086}
age           30
name    xiaohong
tel        10086
dtype: object
30
10086
******************************
30
xiaohong
10086
******************************
age           30
name    xiaohong
dtype: object
name    xiaohong
tel        10086
dtype: object
**（1）****（1）****（1）****（1）****（1）****（1）****（1）****（1）****（1）****（1）**
0     1
1     2
2    31
3    12
4     3
5     4
dtype: int64
2    31
3    12
dtype: int64
**（2）****（2）****（2）****（2）****（2）****（2）****（2）****（2）****（2）****（2）**
age           30
name    xiaohong
tel        10086
dtype: object
Index(['age', 'name', 'tel'], dtype='object')
<class 'pandas.core.indexes.base.Index'>
3
['age', 'name', 'tel']
['age', 'name']
age
name
tel
******************************
[30 'xiaohong' 10086]
<class 'numpy.ndarray'>
**（3）****（3）****（3）****（3）****（3）****（3）****（3）****（3）****（3）****（3）**
0    0
1    1
2    2
3    3
4    4
dtype: int64
0    NaN
1    1.0
2    2.0
3    3.0
4    4.0
dtype: float64
0    10
1    10
2     2
3     3
4     4
dtype: int64
**（4）****（4）****（4）****（4）****（4）****（4）****（4）****（4）****（4）****（4）**
A    0
B    1
C    2
D    3
E    4
F    5
G    6
H    7
I    8
J    9
dtype: int64
C    2
E    4
G    6
I    8
dtype: int64
1
C    2
D    3
G    6
dtype: int64
F    5
G    6
H    7
I    8
J    9
dtype: int64
5
A    0
F    5
G    6
dtype: int64
**（5）****（5）****（5）****（5）****（5）****（5）****（5）****（5）****（5）****（5）**

【问题3】pandas之读取外部数据
（1）若数据存在csv中 ------- pd.read_csv（）
       读取到两种数据类型：Series类型，或者DataFrame类型

（2）若对于数据库，比如mysql  ----- pd.read_sql（sql_sentence，connection）

（3）若对于excel数据 ------ pd.read_excel（）

（4）若对于json数据 ----- pd.read_json（）

（5）若对于html数据 ---- pd.read_html（）

（3）若对于mongodb数据库的数据呢？

# （1）若数据存在csv中，读取到 “DataFrame类型”
'''
（1）pd.read_csv（） ----- .csv数据

（1）pdread_excel（）-----excel数据(.xlsx)
（2）pd.read_json（） --- .json数据(爬虫)
（3）pd.read_html（） --- .html数据（爬虫）

（1）pd.read_sql（sql_sentence，connection） ----- mysql数据库数据 ---- 待学习
（2）mongodb数据库的数据 ---- 待学习
'''
import pandas as pd

df = pd.read_csv('./code2/dogNames2.csv')          # pandas 读取csv中的文件 ------ 得到“DataFrame类型”
print(df)










# （2）mongodb数据   ----- 先安装pymongo
'''
（1）这些数据在他的mongodb数据库里面，现在是连接数据库读取里面的数据
（2）连数据库用Django就行
（3）自己去学mangoDB数据库
'''
# case1 ---- 读取到[{}，{}，{}]  ---- 列表里面套了个字典，每个字典都是一个电影的详情

from pymongo import MongoClient

client = MongoClient()
collection = client['douban']['tv1']
data = list(collection.find())

print(data)




# case2 ----- 读取第一个电影的详情
from pymongo import MongoClient
import pandas as pd

client = MongoClient()
collection = client['douban']['tv1']
data = list(collection.find())

t1 = data[0]    # 去列表的第一个元素 ---------- 第一个电影的数据
t1 = pd.Series(t1)
print(t1)

     Row_Labels  Count_AnimalName
0         RENNY                 1
1        DEEDEE                 2
2     GLADIATOR                 1
3        NESTLE                 1
4          NYKE                 1
...         ...               ...
4159    ALEXXEE                 1
4160  HOLLYWOOD                 1
4161      JANGO                 2
4162  SUSHI MAE                 1
4163      GHOST                 3

[4164 rows x 2 columns]





'\n\n'