工作环境准备机数据分析基础

最新推荐文章于 2023-10-18 06:15:00 发布

张金玉

最新推荐文章于 2023-10-18 06:15:00 发布

阅读量377

点赞数

分类专栏： python人工智能进阶篇

本文链接：https://blog.csdn.net/xsjzdrxsjzdr/article/details/81003440

版权

python人工智能进阶篇专栏收录该内容

3 篇文章 0 订阅

订阅专栏

科学计算库 NumPy

数学科学

数据科学涉及到的操作 by David Donoho

1. 数据探索与准备

• 数据操作、清洗等

2. 数据展现形式与转化

• 不同格式的数据操作,表格型、图像、文本等

3. 关于数据的计算

• 通过编程语言(Python或R)计算分析数据

4. 数据建模

• 预测、聚类等机器学习模型

5. 数据可视化与展示

• 绘图、交互式、动画等

6. 数据科学涉及到的学科知识

数据分析：

用适当的统计分析方法对收集来的大量数据进行分析,提取有用信息和形成结论对数据加以详细研究和概括总结的过程

数据分析的目的：从数据中挖掘规律，验证猜想，进行预测

工作环境准备

安装Anaconda

• Anaconda是Python的一个科学计算发行版,内置了数百个Python经常会使

用的库,也包括做机器学习或数据挖掘的库,如Scikit-learn、NumPy、

SciPy和Pandas等,其中可能有一些是TensorFlow的依赖库

• Anaconda提供了一个编译好的环境可以直接安装

• Anaconda自动集成了最新版的MKL(Math Kernel Library)库,加速矩阵

运算和线性代数运算

• Anaconda https://www.continuum.io/downloads

• 根据操作系统下载对应版本的64位的Python3.x版,推荐3.5后版本，不要用2.7python版本（官方停止更新，性能不好）

编码问题在3.5版本不会用到

会安装Python

安装anaconda:选择add anaconda to the system PATH ENVIRONMENT VARIABLE

Anaconda有的时候缺少软件包，则用下面命令安装

Python包管理

• 安装:pip install xxx, conda install xxx

• 卸载:pip uninstall xxx, conda uninstall xxx

• 升级:pip install –upgrade xxx, conda update xxx

• 详细用法: https://pip.pypa.io/en/stable/reference/

Python虚拟环境

• Virtualenv: https://virtualenv.pypa.io/en/stable/userguide/

• conda 虚拟环境: https://conda.io/docs/using/envs.html

多版本Python管理

• conda管理: https://conda.io/docs/py2or3.html

MacBook-Pro:~$ python

Python 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 12:04:33)

[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] on darwin

Type "help", "copyright", "credits" or "license" for more information.

IDE

• Jupyter notebook

1. Anaconda自带,无需单独安装

2. 记录思考过程,实时查看运行过程 3. 基于web的在线编辑器(本地)

4. .ipynb文件分享

5. 可交互式

6. 记录历史运行结果

7. 支持Markdown, Latex

• IPython

1. Anaconda自带,无需单独安装 2. Python的交互式命令行 Shell

• IDE -- 没有最好的,只有最适合自己的 (以下选一个就可以)

PyCharm社区版,部分免费,>可满足不涉及web的开发,适合大多数开发者

https://www.jetbrains.com/pycharm/download/

Eclipse + PyDev,完全免费,适合熟悉Eclipse或Java的开发者

1. Eclipse, https://eclipse.org/downloads/

2. PyDev插件, https://marketplace.eclipse.org/content/pydev- python-ide-eclipse

Spyder,完全免费,适合熟悉Matlab的开发者 https://github.com/spyder-ide/spyder

PyCharm配置

• 下载 https://www.jetbrains.com/pycharm/

PyCharm新建工程，选择目录在anaconda/bin/python

python3.6和anaconda不需要同时安装，anaconda包含了python3.6,如果有python2.7,则需要卸载2.7

文件变量设置

代码查看

zhangjinyudeMacBook-Pro:~ zhangjinyu$ pwd

/Users/zhangjinyu

zhangjinyudeMacBook-Pro:~ zhangjinyu$ cd /Users/zhangjinyu/Desktop/人工智能python/工作环境准备及数据分析基础/lect01_codes/lect01_eg

zhangjinyudeMacBook-Pro:lect01_eg zhangjinyu$ jupyter notebook

[I 18:24:46.792 NotebookApp] JupyterLab beta preview extension loaded from /anaconda3/lib/python3.6/site-packages/jupyterlab

[I 18:24:46.792 NotebookApp] JupyterLab application directory is /anaconda3/share/jupyter/lab

[I 18:24:46.797 NotebookApp] Serving notebooks from local directory: /Users/zhangjinyu/Desktop/人工智能python/工作环境准备及数据分析基础/lect01_codes/lect01_eg

[I 18:24:46.797 NotebookApp] 0 active kernels

[I 18:24:46.798 NotebookApp] The Jupyter Notebook is running at:

[I 18:24:46.798 NotebookApp] http://localhost:8888/?token=9d8cb41b88dbc2bc83810cbdf8d96fbc6060dc8fea0a96c4

[I 18:24:46.798 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).

[C 18:24:46.800 NotebookApp]

Copy/paste this URL into your browser when you connect for the first time,

to login with a token:

http://localhost:8888/?token=9d8cb41b88dbc2bc83810cbdf8d96fbc6060dc8fea0a96c4

[I 18:24:47.097 NotebookApp] Accepting one-time-token-authenticated connection from ::1

这时会弹出对话框在浏览器中

http://localhost:8888/tree

List 列表: ---->[1, 'a', 2, 'b']

Set 集合: ---->my_set = {1, 2, 3}

Dictionary 字典: ---->

d = {'小象学院': 'http://www.chinahadoop.cn/',

'百度': 'https://www.baidu.com/',
'阿里巴巴': 'https://www.alibaba.com/',
'腾讯': 'https://www.tencent.com/'}

Tuple :元组 ----> t = (1, 'a', 2, 'b')

1. 条件表达式

import math

def get_log(x):

# 普通写法

if x > 0:

y = math.log(x)

else:

y = float('nan')

return y

优化：

y = math.log(x) if x>0 else float('nan')

调用

x = 5

log_val1 = get_log(x)

# 使用条件表达式

log_val2 = math.log(x) if x > 0 else float('nan')

print(log_val1)

print(log_val2)

2. 列表推导式

print('找出1000内的偶数(for循环)：')

l1 = []

for i in range(1000):

if i % 2 == 0:

l1.append(i)

print(l1)

#优化:

print('找出1000内的偶数(列表推导式)：')

l2 = [i for i in range(1000) if i % 2 == 0]

print(l2)

3. Python常用容器类型

#列表

l = [1, 'a', 2, 'b']

print(type(l))

print('修改前：', l)

# 修改list的内容

l[0] = 3

print('修改后：', l)

# 末尾添加元素

l.append(4)

print('添加后：', l)

# 遍历list

print('遍历list(for循环)：')

for item in l:

print(item)

# 通过索引遍历list

print('遍历list(while循环)：')

i = 0

while i != len(l):

print(l[i])

i += 1

# 列表合并

print('列表合并(+)：', [1, 2] + [3, 4])

# 列表重复

print('列表重复(*)：', [1, 2] * 5)

# 判断元素是否在列表中

print('判断元素存在(in)：', 1 in [1, 2])

元组

t = (1, 'a', 2, 'b')

print(type(t))

#元组的内容不能修改，否则会报错

# t[0] = 3

# 遍历tuple

print('遍历list(for循环)：')

for item in t:

print(item)

# 通过索引遍历tuple

print('遍历tuple(while循环)：')

i = 0

while i != len(t):

print(t[i])

i += 1

# 解包 unpack

a, b, _, _ = t

print('unpack: ', c)

# 确保unpack接收的变量个数和tuple的长度相同，否则报错

# 经常出现在函数返回值的赋值时

# a, b, c = t

字典

d = {'百度': 'https://www.baidu.com/',

'阿里巴巴': 'https://www.alibaba.com/',

'腾讯': 'https://www.tencent.com/'}

print('通过key获取value: ', d['小象学院'])

# 遍历key

print('遍历key: ')

for key in d.keys(): #与python2不同

print(key)

# 遍历value

print('遍历value: ')

for value in d.values(): #与python2不同

print(value)

# 遍历item

print('遍历item: ') ＃同时返回key和value

for key, value in d.items():

print(key + ': ' + value)

# format输出格式

print('format输出格式：')

for key, value in d.items():

print('{}的网址是{}'.format(key, value))

#set

print('创建set:')

my_set = {1, 2, 3}

print(my_set)

my_set = set([1, 2, 3, 2])

print(my_set)

print('添加单个元素:')

my_set.add(3)

print('添加3', my_set)

my_set.add(4)

print('添加4', my_set)

print('添加多个元素：')

my_set.update([4, 5, 6])

print(my_set)

#Counter

。类似于数据中的多重集－－－－普通元素不重复，多重集即集合里面的元素有重复

。import collections

。update()更新内容，注意是做“加法"，不是"替换"

。访问内容［key］

。注意和dict的区别：如果Counter中不存在key值，返回0，而dict会报错

。elements()方法，返回所有元素

。most_common()方法，返回前n多的数据

import collections

c1 = collections.Counter(['a', 'b', 'c', 'a', 'b', 'b'])

c2 = collections.Counter({'a':2, 'b':3, 'c':1})

c3 = collections.Counter(a=2, b=3, c=1)

print(c1)

print(c2)

print(c3)

运行结果：

Counter({'b': 3, 'a': 2, 'c': 1})

Counter({'b': 3, 'a': 2, 'c': 1})

#更新Counter内容

# 注意这里是做“加法”，不是“替换”

c1.update({'a': 4, 'c': -2, 'd': 4})

print(c1)

运行结果：

Counter({'a': 6, 'd': 4, 'b': 3, 'c': -1})

#访问内容

print('a=', c1['a'])

print('b=', c1['b'])

# 对比和dict的区别

print('a=', c1['a'])
print('b=', c1['b'])
# 对比和dict的区别

print('e=', c1['e'])

运行结果：

a= 6
b= 3
e= 0

#element()方法：访问每个元素

for element in c1.elements():

print(element)

d
d
d
d
b
b
b
a
a
a
a
a
a

Counter还有个用处，统计词频，即哪些关键字出现的最多，即计数

most_common()方法

c1.most_common(3)

[('a', 6), ('d', 4), ('b', 3)]

5. defaultdict

。在Python中如果访问字典里不存在的键，会出现KeyError异常。有些时候，字典中每个键都存在默认值是很方便的

。defaultdict是Python内建dict类的一个子类，第一个参数为default_factory属性提供初始值，默认为None.它覆盖一个方法并添加一个可写实例变量。它的其他功能与dict相同，但会为一个不存在的键提供默认值，从而避免KeyError异常

。Python map(函数)

。map(function, sequence)

。可用于数据清洗

# 统计每个字母出现的次数

s = 'chinadoop'

# 使用Counter

print(collections.Counter(s))

#使用dict

counter = {}

for c in s:

if c not in counter:

counter[c] = 1

else:

counter[c] += 1

print(counter.items())

# 使用defaultdict

counter2 = collections.defaultdict(int)

for c in s:

counter2[c] += 1

print(counter2.items())

# 记录相同元素的列表

colors = [('yellow', 1), ('blue', 2), ('yellow', 3), ('blue', 4), ('red', 1)]

d = collections.defaultdict(list)

for k, v in colors:

d[k].append(v)

print(d.items())

6. map()函数

import math

print('示例1，获取两个列表对应位置上的最小值：')

l1 = [1, 3, 5, 7, 9]

l2 = [2, 4, 6, 8, 10]

mins = map(min, l1, l2)

print(mins)

# map()函数操作时，直到访问数据时才会执行

for item in mins:

print(item)

print('示例2，对列表中的元素进行平方根操作：')

squared = map(math.sqrt, l2)

print(squared)

print(list(squared))

7. 匿名函数 lambda

匿名函数lambda

。简单的函数操作

。返回值是func类型

。可结合map()完成数据清洗操作

my_func = lambda a, b, c: a * b

# print(my_func)

# print(my_func(1, 2, 3))

# 结合map

print('lambda结合map')

l1 = [1, 3, 5, 7, 9]

l2 = [2, 4, 6, 8, 10]

result = map(lambda x, y: x * 2 + y, l1, l2)

print(list(result))

8. Python操作CSV数据文件

import csv

with open('grades.csv') as csvfile:

grades_data = list(csv.DictReader(csvfile))

print('记录个数：', len(grades_data))

print('前2条记录：', grades_data[:2])

print('列名：', list(grades_data[0].keys()))

avg_assign1 = sum([float(row['assignment1_grade']) for row in grades_data]) / len(grades_data)

print('assignment1平均分数：', avg_assign1)

assign1_sub_month = set(row['assignment1_submission'][:7] for row in grades_data)

print(assign1_sub_month)

科学计算库NumPy

NumPy = Numerical Python

。高性能科学计算和数据分析的基础包，提供多维数组对象

。ndarray,多维数组（矩阵),具有矢量运算能力，快速，节省空间

。矩阵运算，无需循环，可完成类似Matlab中的矢量运算

。线性代数，随机数生成

。import bumpy as np

SciPy

。在NumPy库的基础上增加了众多的数学，科学及工程常用的库函数

。线性代数，常微分方程求解，信号处理，图像处理，稀疏矩阵等

。import spicy as sp

ndarray, N维数组对象(矩阵)

。ndim属性，维度个数

。shape属性，各维度大小

。dtype属性，数据类型

创建ndarray

。np.array(collection), collection为序列型对象(list),嵌套序列(list of list)

。np.zeros, np.ones, np.empty指定大小的全0或全1数组

。注意：第一个参数是元组，用来制定大小，如(3,4)

。empty不是总是返回全0，有时返回的是未初始的随值

。np.arange()类似range()

ndarray数据类型

。dtype,类型名＋位数，如flat64, int32

。转换数组类型

。astype

索引与切片

。一维数组的索引与Python的列表索引功能相似

。多维数组的索引

。arr[r1:r2, c1:c2]

。arr[1,1]等价arr[1][1]

。[:]代表某个维度的数据

。条件索引

。布尔值多维数组 arr[condition] condition可以是多个条件组合

。注意，多个条件组合要使用&|, 而不是and or

转置

。arr.transpose()或arr.T

数据叠加

。vstack(), hstack()

常用的统计方法

。arr.mean(), arr.sum(),

。arr.max(), arr.min()

。arr.std(), arr.var()

。arr.argmax(), arr.argmin()

。arr.cumsum(), arr.cumprod()

。注意多维的话要制定统计的维度，否则默认是全部维度上做统计

array的拷贝操作

。arr1 = arr2

。arr1内数据的更改会影响arr2

。建议使用arr1 = arr2.copy()

arr.all()和arr.any()

。all(),全部满足条件

。any(), 至少有一个元素满足条件

arr.unique()

。找到唯一值并返回排序结果

向量化(vectorization)

。获得执行速度更快，更紧凑的代码策略

。基本思路: "一次"在一个复杂对象上进行操作，或者向其应用某个函数，而不是痛哟在对象的耽搁元素上循环来进行

。在Python级别上，函数式编程工具map, filter和reduce提供了向量化的手段

。在NumPy级别上，在ndarray对象上进行的循环由经过高度优化的代码负责，大部分代码用C语言编写，元快于纯Python

。矢量间运算，相同大小的数组间的运算应用在元素上

。矢量和标量运算，”广播“ －将标量 “广播”到各个元素

张金玉

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录