数据分析工程师_第02讲Pandas教程(上篇)

这是一篇关于数据分析工程师使用Pandas的教程,主要介绍了Series数据结构。内容涵盖Series的构造和初始化,包括booling indexing进行条件判断索引,Series的赋值以及如何处理数据缺失。此外,提到了DataFrame作为二维数据结构,类似Excel表格。
摘要由CSDN通过智能技术生成

数据分析工程师_第02讲Pandas教程(上篇)

pandas是一个专门用于数据分析的python工具库

Pandas简介

  • python数据分析处理的一个package
  • 基于numpy(对“矩阵”做科学计算)
  • 有一种用python去操作Excel/SQL的感觉

目录

  • series
  • DataFrame
  • Index
  • csv文件读写

数据结构Series

import numpy as np
import pandas as pd
# json.loads()解码python json格式
import json

jsonStr = '{"name":"aspiring", "age": 17, "hobby": ["money","power", "read"],"parames":{"a":1,"b":2}}'

jsonData = json.loads(jsonStr)
print(jsonData)

print(type(jsonData))
print(jsonData['hobby'])


{'name': 'aspiring', 'age': 17, 'hobby': ['money', 'power', 'read'], 'parames': {'a': 1, 'b': 2}}
<class 'dict'>
['money', 'power', 'read']
# 读json文件
# json.load()加载python json格式文件

path1 = 'data/example.json'
open(path1,'r',encoding='utf-8').readline()
'{ "a": "Mozilla\\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\\/535.11 (KHTML, like Gecko) Chrome\\/17.0.963.78 Safari\\/535.11", "c": "US", "nk": 1, "tz": "America\\/New_York", "gr": "MA", "g": "A6qOVH", "h": "wfLQtf", "l": "orofrog", "al": "en-US,en;q=0.8", "hh": "1.usa.gov", "r": "http:\\/\\/www.facebook.com\\/l\\/7AQEFzjSi\\/1.usa.gov\\/wfLQtf", "u": "http:\\/\\/www.ncbi.nlm.nih.gov\\/pubmed\\/22415991", "t": 1331923247, "hc": 1331822918, "cy": "Danvers", "ll": [ 42.576698, -70.954903 ] }\n'
# python数据分析书籍例子
import json

path2 = 'data/example.json'

records = [json.loads(line) for line in open(path2, 'r', encoding='utf-8')]

records[0]

records[0]['tz']
'America/New_York'
构造和初始化series
s = pd.Series([7, 'Beijing', 3.14, -12345, 'HanXiaoyang'])
s
0              7
1        Beijing
2           3.14
3         -12345
4    HanXiaoyang
dtype: object
s.values
array([7, 'Beijing', 3.14, -12345, 'HanXiaoyang'], dtype=object)
s.index
RangeIndex(start=0, stop=5, step=1)
s[1]
'Beijing'
s
0              7
1        Beijing
2           3.14
3         -12345
4    HanXiaoyang
dtype: object

pandas会默认用0到n作为Series的index,但是我们也可以自己指定index。index可以类比理解为dic当中的key。

s = pd.Series([7, 'Beijing', 3.14, -12345, 'HanXiaoyang'], index=['A', 'B', 'C', 'D', 'E'])
s
A              7
B        Beijing
C           3.14
D         -12345
E    HanXiaoyang
dtype: object
s['A']
7
s[ ['A','D','B'] ]
A          7
D     -12345
B    Beijing
dtype: object

我们可以用list来构建Series,同时可以指定index。实际上我们还可以用dic来初始化Series,因为Series本来就是key-value的结构。

cities = {
   'Beijing':55000, 'ShangHai':60000, 'Shenzhen':50000, 'Hangzhou':30000, 'Guangzhou':40000, 'Suzhou':None}
cities
{'Beijing': 55000,
 'Guangzhou': 40000,
 'Hangzhou': 30000,
 'ShangHai': 60000,
 'Shenzhen': 50000,
 'Suzhou': None}
apt = pd.Series(cities, name='income')
apt
Beijing      55000.0
Guangzhou    40000.0
Hangzhou     30000.0
ShangHai     60000.0
Shenzhen     50000.0
Suzhou           NaN
Name: income, dtype: float64
# 索引
apt['Guangzhou']
40000.0
apt[1]
40000.0
apt[1:]
Guangzhou    40000.0
Hangzhou     30000.0
ShangHai     60000.0
Shenzhen     50000.0
Suzhou           NaN
Name: income, dtype: float64
apt[:-1]
Beijing      55000.0
Guangzhou    40000.0
Hangzhou     30000.0
ShangHai     60000.0
Shenzhen     50000.0
Name: income, dtype: float64
apt[[3,4,1]]
ShangHai     60000.0
Shenzhen     50000.0
Guangzhou    40000.0
Name: income, dtype: float64
apt[ ['ShangHai', 'Shenzhen', 'Guangzhou'] ]
ShangHai     60000.0
Shenzhen     50000.0
Guangzhou    40000.0
Name: income, dtype: float64
# 简单的计算
# 广播特性
3*apt
Beijing      165000.0
Guangzhou    120000.0
Hangzhou      90000.0
ShangHai     180000.0
Shenzhen     150000.0
Suzhou            NaN
Name: income, dtype: float64
apt/2.5
Beijing      22000.0
Guangzhou    16000.0
Hangzhou     12000.0
ShangHai     24000.0
Shenzhen     20000.0
Suzhou           NaN
Name: income, dtype: float64
# list不可以直接做数学运算
my_list = [2,4,6,8,10]
my_list/2
---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-29-39aba40a404f> in <module>()
----> 1 my_list/2


TypeError: unsupported operand type(s) for /: 'list' and 'int'
apt[1:]
Guangzhou    40000.0
Hangzhou     30000.0
ShangHai     60000.0
Shenzhen     50000.0
Suzhou           NaN
Name: income, dtype: float64
apt[:-1]
Beijing      55000.0
Guangzhou    40000.0
Hangzhou     30000.0
ShangHai     60000.0
Shenzhen     50000.0
Name: income, dtype: float64
# 基于索引去做计算的
apt[1:] + apt[:-1]
Beijing           NaN
Guangzhou     80000.0
Hangzhou      60000.0
ShangHai     120000.0
Shenzhen     100000.0
Suzhou            NaN
Name: income, dtype: float64
# in判断index是否存在
'Hangzhou' in apt
True
'Chongqing' in apt
False
# apt['Chongqing'] 不OK的
print(apt.get('Chongqing'))
None
print(apt.get('Guangzhou'))
40000.0
booling indexing/条件判断索引
apt>=40000
Beijing       True
Guangzhou     True
Hangzhou     False
ShangHai      True
Shenzhen      True
Suzhou       False
Name: income, dtype: bool
#条件索引
apt[apt>=40000]
Beijing      55000.0
Guangzhou    40000.0
ShangHai     60000.0
Shenzhen     50000.0
Name: income, dtype: float64
# 统计计算
apt.mean()
47000.0
apt.median()
50000.0
apt.max()
60000.0
apt.min()
30000.0
Series赋值
apt
Beijing      55000.0
Guangzhou    40000.0
Hangzhou     30000.0
ShangHai     60000.0
Shenzhen     50000.0
Suzhou           NaN
Name: income, dtype: float64
apt['Shenzhen'] = 70000
apt
Beijing      55000.0
Guangzhou    40000.0
Hangzhou     30000.0
ShangHai     60000.0
Shenzhen     70000.0
Suzhou           NaN
Name: income, dtype: float64
#条件赋值
apt[apt<=40000] = 45000
apt
Beijing      55000.0
Guangzhou    45000.0
Hangzhou     45000.0
ShangHai     60000.0
Shenzhen     70000.0
Suzhou           NaN
Name: income, dtype: float64
type(apt)
pandas.core.series.Series
#更多高级的数学运算
np.log(apt)
Beijing      10.915088
Guangzhou    10.714418
Hangzhou     10.714418
ShangHai     11.002100
Shenzhen     11.156251
Suzhou             NaN
Name: income, dtype: float64
cars = pd.Series({
   'Beijing':350000, 'ShangHai':400000, 'Shenzhen':300000, \
                 'Tianjin':200000, 'Guangzhou':250000, 'Chongqing':150000
                 })
cars
Beijing      350000
Chongqing    150000
Guangzhou    250000
ShangHai     400000
Shenzhen     300000
Tianjin      200000
dtype: int64
expense = cars + 10*apt
expense
Beijing       900000.0
Chongqing          NaN
Guangzhou     700000.0
Hangzhou           NaN
ShangHai     1000000.0
Shenzhen     1000000.0
Suzhou             NaN
Tianjin            NaN
dtype: float64
数据缺失
'Hangzhou' in apt
True
'Hangzhou' in cars
False
apt
Beijing      55000.0
Guangzhou    45000.0
Hangzhou     45000.0
ShangHai     60000.0
Shenzhen     70000.0
Suzhou           NaN
Name: income, dtype: float64
# bool结果返回
apt.notnull()
Beij
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值