data-science入门

概述

对于data-science,有许多功能强大的library可供使用,比如Numpy, Matplotlib, NetworkX等等,都是我们实用的工具


Numpy

numpy array声明及对象查看

Numpy是python的一个扩展程序库,我们主要用它来做维度数组以及矩阵的运算

首先要在python中将它import进来,为方便以后的调用,我们将它作为import as np:

import numpy as np

numpy中最有用的自然是numoy array,如下:

numpy_array=np.array([[1,3,5],[2,4,6]])
print(numpy_array)
print("The shape of this array is:", numpy_array.shape)
first_element = numpy_array[0, 0]
first_column = numpy_array[:, 0]
second_row = numpy_array[1, :]
print("The first element is", first_element)
print("The first column is", first_column)
print("The second row is", second_row)

结果是:

[[1 3 5]
 [2 4 6]]
The shape of this array is: (2, 3)
The first element is 1
The first column is [1 2]
The third row is [2 4 6]

其中值得说明的是:

1. 实例化对象的时候直接np.array()即可

2. 数组的shape是知该array是几×几的

3. 数组用[]表示,不像c语言是{}

4. 访问数组的元素是array[i,j]

5. 访问数组的行是array[i,:]

6. 访问数组的列是array[:,j]


numpy array 运算

array之间的运算是element-wise的(每个元素对应运算),如:

zeros = np.zeros(5)
ones = np.ones(5)
print(zeros + ones)
print(np.exp(zeros) + ones)

结果是:

[ 1.  1.  1.  1.  1.]
[ 2.  2.  2.  2.  2.]
注意: exp是指数函数,默认底数为e


numpy array 生成随机array

可以生成随机的array,只需要指定元素的low和high,以及size,如:

A = np.random.randint(low=0, high=10, size=(5, 5))
b = np.random.randint(low=0, high=10, size=(5, 1))
print("A =\n", A)
print("b =\n", b)
print("The vectorial product Ab is\n", A.dot(b))
print("The solution to Ax = b is x =\n", np.linalg.solve(A, b))

一个可能的结果是:

A =
 [[6 6 6 9 6]
 [9 3 3 2 8]
 [2 4 5 1 3]
 [0 3 6 6 5]
 [4 6 7 7 1]]
b =
 [[4]
 [3]
 [0]
 [8]
 [8]]
The vectorial product Ab is
 [[162]
 [125]
 [ 52]
 [ 97]
 [ 98]]
The solution to Ax = b is x =
 [[ 1.36801132]
 [-4.63906582]
 [ 3.57891012]
 [ 0.90021231]
 [-0.99150743]]

注意:

1. Ab表达方式: A.dot(b)

2.Ax=b解方程: np.linalg.solve(A,b)  linalg=linear algebra


numpy array 计算平均数和方差

可以用mean来计算平均数

        用var()来计算方差(variance)

print("Mean:", np.mean(b))
print("Variance:", np.var(b))

结果是:

Mean: 4.6
Variance: 9.44
当然这两个函数还有一连串optional的参数: 详见 点击打开链接


numpy array 随机种子

我们都知道随机种子随机出来的随机数是伪随机数,也就是说每次reset一遍种子,随机出来的结果会相同

于是可以利用这一点来生成相同的随机数,如下:

np.random.seed(0)
print(np.random.permutation(10))
# Reset the seed
np.random.seed(0)
print(np.random.permutation(10))
# The two above permutations are the same
# The sequence of the random generator is now initialized
print(np.random.permutation(10))


Matplotlib

matplotlib是numpy的可视化操作界面,用于绘制图形数据

首先还是把它import进来

import matplotlib.pyplot as plt
然后是调用plot方法就可以把点加入到图形里啦, 具体参数 http://matplotlib.org/api/pyplot_api.html?highlight=plot#matplotlib.pyplot.plot

当然需要其他语句画出横轴数轴,指定大小之类的,如

  • title
  • x and y labels
  • legend
  • caption

例子:

x = np.linspace(start=0.01, stop=10, num=800)
# Generate some data
sin = np.sin(x)
lin = -2 + x
log = np.log(x)
# Create a new figure
plt.figure(figsize=(10, 8))
# Plot the data
plt.plot(x, sin, 'r')
plt.plot(x, lin, 'g')
plt.plot(x, log, 'b+')
# Customize the plot
plt.title("Some functions", fontsize=18)
plt.xlabel("x", fontsize=20)
plt.ylabel("f(x)", fontsize=20)
plt.legend(["sinusoid", "linear", "logarithm"], loc='lower right', fontsize=14)
plt.grid()
# Display the result
plt.show()

就可以画出这般的一个图像了!


NetworkX

顾名思义,主要用于对复杂网络进行创建,操作和学习

首先import进来:

import networkx as nx

然后我们可以从本地的.nx文件load一张图进来(比如通过adjlist的方式_邻接表)

G = nx.read_adjlist('data/adjacency_list.nx')

然后直接可以画出来

nx.draw_networkx(G)
plt.axis('off'); # 不显示数轴

然后是最重要的几个修改函数

加入点: G.add_node('g')

加入边: G.add_edge('g', 'a')

返回图的直径: nx.diameter(G)

求图中两指定点之间的最短路: nx.shortest_path(G, 'g', 'd')

more examples : https://networkx.readthedocs.io/en/stable/examples/index.html.


后面会写一写应用numpy进行数据分析的实例吧:)


  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Data Science (The MIT Press Essential Knowledge) By 作者: John D. Kelleher – Brendan Tierney ISBN-10 书号: 0262535432 ISBN-13 书号: 9780262535434 出版日期: 2018-04-13 pages 页数: 280 A concise introduction to the emerging field of data science, explaining its evolution, relation to machine learning, current uses, data infrastructure issues, and ethical challenges. The goal of data science is to improve decision making through the analysis of data. Today data science determines the ads we see online, the books and movies that are recommended to us online, which emails are filtered into our spam folders, and even how much we pay for health insurance. This volume in the MIT Press Essential Knowledge series offers a concise introduction to the emerging field of data science, explaining its evolution, current uses, data infrastructure issues, and ethical challenges. It has never been easier for organizations to gather, store, and process data. Use of data science is driven by the rise of big data and social media, the development of high-performance computing, and the emergence of such powerful methods for data analysis and modeling as deep learning. Data science encompasses a set of principles, problem definitions, algorithms, and processes for extracting non-obvious and useful patterns from large datasets. It is closely related to the fields of data mining and machine learning, but broader in scope. This book offers a brief history of the field, introduces fundamental data concepts, and describes the stages in a data science project. It considers data infrastructure and the challenges posed by integrating data from multiple sources, introduces the basics of machine learning, and discusses how to link machine learning expertise with real-world problems. The book also reviews ethical and legal issues, developments in data regulation, and computational approaches to preserving privacy. Finally, it considers the future impact of data science and offers principles for success in data science projects.

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值