GNN python packages

置顶天狼啸月1990

已于 2023-04-09 18:40:07 修改

阅读量1.2k

点赞数 1

于 2022-10-25 11:33:34 首次发布

本文链接：https://blog.csdn.net/qq_33419476/article/details/127510370

版权

GNN algorithms 专栏收录该内容

17 篇文章 1 订阅

订阅专栏

python fundamental functions

np.random.permutation(x)函数

平均：numpy.mean(data, axis=0)函数

n次方: numpy.power(x1, n)函数

numpy.logical_not(x)函数

numpy.random.choice(a)函数

numpy.argmax(a, axis=None, out=None)函数

np.array()函数、np.asarray()函数

np.load(path, allow_pickle=True)函数 and np.save()函数

pandas package

pandas.DataFrame()函数

dataframe.iterrows()函数

sklearn package

sklearn.model_selection packages

sklearn.model_selection.train_test_split()函数

sklearn metrics 函数

sklearn.metrics.roc_curve(y_true, y_score, pos_label=None)函数

sklearn.metrics.f1_score(y_true, y_pred)函数

sklearn.metrics.normalized_mutual_info_score(labels_true, labels_pred)

sklearn.metrics.adjusted_mutual_info_score(labels_true, labels_pred)

sklearn.metircs.adjusted_rand_score(labels_true, labels_pred)

sklearn.cluster packages

sklearn.cluster.KMeans 函数

sklearn.cluster.DBSCAN 函数

sklearn.neighbors packages

sklearn.neighbors.KNeighborsClassifier()函数

sklearn.linear_model packages

sklearn.linear_model.LinearRegression()函数

sklearn.manifold packages

argparse package

demo: argparse解析参数

FinEvent: argparse解析参数

argparse报错(一): SystemExit: 2

scipy packages

scipy.io packages

scipy.io.loadmat('filepath')函数

scipy.io.savemat()函数

scipy.sparse package

sparse.coo_matrix()函数

coo的row和col属性

.tocoo()函数

sparse.csr_matrix()函数

csr_matrix构造方式(一): 二维数组或矩阵(其他矩阵转化)

csr_matrix构造方式(二): 三元组方式(密集矩阵构建)

csr_matrix构造方式(三): 空csr_matrix + 赋值

sparse.lil_matrix()函数

scipy.sparse operation

scipy.sparse.todense()函数 -> numpy矩阵

scipy.sparse.vstack()函数

scipy.sparse.isspmatrix_coo(x)函数

scipy.sparse.diags()函数

scipy.sparse.linalg.eigsh(A, k)函数

scipy.sparse.save_npz()和load_npz()存读.npz文件

Spacy package

networkx package

networkx 创建图方式(一): 空图

networkx.to_numpy_matrix(G)函数返回图邻接矩阵A

networkx 创建图方式(二): 字典转化

networkx.adjacency_matrix()函数

error(1): nx.adjacency_matrix()计算邻接矩阵与真实结果不一致

networkx.get_node_attributes(G, 'tweet_id')函数

json.dump(dict, file, indent=2)函数

json.load()函数

random package

random.randit(low, high=None, size=None, dtype='l')函数

random.sample(list, length)函数

time package

time.time()函数

datetime package

datetime.fromisocalender()函数

copy package

python gc package

python fundamental functions

enumerate()函数

它用于将一个可遍历的数据对象(如list)组合为一个索引序列，同时列出数据和数据下标。

os package

# 获取当前目录
os.getcwd()

# 获取上级目录
os.path.abspath(os.path.dirname(os.getcwd()))

# 获取上上级目录
os.path.abspath(os.path.join(os.getcwd(), "../.."))

numpy package

numpy 创建ndarray数组

numpy.zeros()函数

zeros(shape, dtype=float, order='C')

根据形状和类型返回一个全是0的数组

np.zeros(5)
array([ 0.,  0.,  0.,  0.,  0.])

np.zeros((2, 1))
array([[ 0.],
       [ 0.]])

numpy.empty()函数

numpy.empty(shape, dtype=float, order=‘C’)

np.empty()函数，根据给定的维度和数值类型，返回一个ndarray数组，其元素不进行初始化。

import numpy as np
np.empty([2, 2])

[[9.90263869e+067, 8.01304531e+262],
[2.60799828e-310, 0.00000000e+000]]

numpy.eye()函数

返回一个单位矩阵。

numpy.eye(N,M=None,k=0,dtype=<class ‘float’>,order=‘C)
paras:
N:int型，表示的是输出的行数
M：int型，可选项，输出的列数，如果没有就默认为N
k：int型，可选项，对角线的下标，默认为0表示的是主对角线，负数表示的是低对角，正数表示的是高对角。

返回的是一个二维的ndarray数组(N,M)，对角线的地方为1，其余的地方为0.

ar3 = np.eye(3,4,k=2)
ar4 = np.eye(3)
print('ar3:')
print(ar3)
print('ar4')
print(ar4)

np.random.permutation(x)函数

randomly permute a sequence, or returen a permuted range.

x: int or array_like

输入一个数或者数组，生成一个随机序列，对多维数组来说是多维随机打乱而不是1维

>>np.random.permutation([1, 4, 9, 12, 15])
array([15,  1,  9,  4, 12])

>>arr = np.arange(9).reshape((3, 3))
array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])
>>np.random.permutation(arr)
array([[6, 7, 8],
       [0, 1, 2],
       [3, 4, 5]]) 

>>permutation = list(np.random.permutation(10))
[5, 1, 7, 6, 8, 9, 4, 0, 2, 3]
>>Y = np.array([[1,1,1,1,0,0,0,0,0,0]])
>>Y_new = Y[:, permutation]
array([[0, 1, 0, 0, 0, 0, 0, 1, 1, 1]])

numpy.matrix()函数

x = np.matrix([[1, 3, 5], [2, 4, 6]])

numpy.arange()函数

b = np.arange(12).reshape([4, 3])

numpy数组 calculation

加：numpy.num()函数

np.sum(,axis=0)，按行求和，结果行数为1，列数不变

sum(1)，数组按行求和，保留列形式。

乘: numpy.matmul()函数

numpy.matmul(a, b, out=None)

表示两个numpy数组的矩阵相乘

import numpy as np
a = [[1, 0], [0, 1]]
b = [[4, 1], [2, 2]]
np.matmul(a, b)

array([[4, 1],
[2, 2]])

乘：numpy.dot()函数

如果a和b都是一维数组，那么它返回的就是向量的内积。

import numpy as np
 
c = np.arange(0,9)
d = c[::-1]
 
np.dot(c,d) 
Out[35]: 84

如果a和b都是二维数组，那么它返回的就是矩阵乘法。

import numpy as np
 
a = [[1, 0], [0, 1]]
b = [[4, 1], [2, 2]]
np.dot(a, b)
Out[1]: 
array([[4, 1],
       [2, 2]])

除：numpy.divide()函数

数组对应位置元素做除法

平均：numpy.mean(data, axis=0)函数

表示numpy矩阵以行为基准求平均

import numpy as np
X = np.array([[1, 2], [4, 5], [7, 8]])
print np.mean(X, axis=0, keepdims=True)
print np.mean(X, axis=1, keepdims=True)

---------------------------------------
                 [[ 1.5]
 [[ 4.  5.]]      [ 4.5]    
                  [ 7.5]]

n次方: numpy.power(x1, n)函数

x1数组的元素分别求n次方。n可以使数字，也可以是数组，但要与x1的列数相同

>>> x1 = range(6)
>>> x1
[0, 1, 2, 3, 4, 5]
>>> np.power(x1, 3)
array([  0,   1,   8,  27,  64, 125])

numpy.logical_not(x)函数

返回x逻辑非后的布尔值

numpy.random.choice(a)函数

从a(一维数据)中随机抽取数字，返回指定大小(size)的数组。

numpy.argmax(a, axis=None, out=None)函数

返回最大值的索引。

若axis=1，表明按行比较，输出每行中最大值的索引。
若axis=0，则输出每列中最大值的索引。
若axis参数不出现时，此时将数组平铺，找出其中最大的那个值的索引

numpy.isinf()函数

np.isinf(n)，判断是否为无穷大数字，返回True or False。

np.isinf()函数是带有python示例的math.isinf()方法。

    Input:
    a = 10
    b = float('inf')
 
    # function call
    print(math.isinf(a))
    print(math.isinf(b))
 
    Output:
    False
    True

python calculation

python.pow()，返回x的y次方

numpy 数组操作 operation

numpy.squeeze()函数

场景：在机器学习和深度学习中，通常算法的结果是可以表示向量的数组（即包含两对或以上的方括号形式[[]]），如果直接利用这个数组进行画图可能显示界面为空（见后面的示例）。我们可以利用squeeze（）函数将表示向量的数组转换为秩为1的数组，这样利用matplotlib库函数画图时，就可以正常的显示结果了。

numpy.squeeze(a,axis = None)

从数组的形状中删除单维度条目，即把shape中1的维度去掉。

a  = np.arange(10).reshape(1,10)
a.shape

'''
array([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
(1, 10)
'''
---------------------------------------
b = np.squeeze(a)
b.shape
'''
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
(10,)
'''

numpy.where()函数

np.where(condition, x, y)，满足条件condition，输出x，不满足输出y

>>> aa = np.arange(10)
>>> np.where(aa,1,-1)
array([-1,  1,  1,  1,  1,  1,  1,  1,  1,  1])  # 0为False，所以第一个输出-1
>>> np.where(aa > 5,1,-1)
array([-1, -1, -1, -1, -1, -1,  1,  1,  1,  1])

>>> np.where([[True,False], [True,True]],    # 官网上的例子
			 [[1,2], [3,4]],
             [[9,8], [7,6]])
array([[1, 8],
	   [3, 4]])

np.where(condition)，只有条件condition，输出满足条件元素的索index(==numpy.nonzero)。这里的索引以tuple形式给出。

>>> a = np.array([2,4,6,8,10])

>>> np.where(a > 5)				# 返回索引index，元组tuple形式
'''(array([2, 3, 4]),) '''
>>> np.where(a > 5)[0]				# 加[0]，提取相应数组
'''array([2, 3, 4])'''


>>> a[np.where(a > 5)]  			# 等价于 a[a>5]
array([ 6,  8, 10])

>>> np.where([[0, 1], [1, 0]])
(array([0, 1]), array([1, 0]))

numpy.stack() 函数

沿着axis连接数组的序列。

import numpy as np

a = [[0,-1,1],[2,1,3]]
print(a)
c = np.stack(a, axis=0)
print(c)

'''
[[0, -1, 1], [2, 1, 3]]
[[ 0 -1  1]
 [ 2  1  3]]
'''

numpy.newaxis()函数

np.newaxis()函数的功能是增加新的维度

x[:, np.newaxis] ，放在后面，会给列上增加1个维度
x[np.newaxis, :] ，放在前面，会给行上增加1个维度，可简化为x[np.newaxis]

x = np.array([1, 2, 3, 4])
print(x.shape)  # (4,)

x_row = x[newaxis]  # 行增加一个维度,(1,4)
x_add = x[:, np.newaxis]  # 列增加一个维度, (4,1)
print(x_add.shape)
print(x_add)
>>>
(4,)
(4, 1)
[[1]
 [2]
 [3]
 [4]]

np.array()函数、np.asarray()函数

np.array，复制对象，创建一个新的内存空间
np.asarray，不复制，实际上是个指针，只是指向对象内存空间。

np.asarray将结构数据转化为ndarray，比如将list列表转换为nd.array数组

import numpy as np
a = np.array([0,1,2,-1,-2])
'''
array([ 0,  1,  2, -1, -2])
'''
np.array(a>0)
'''
array([False,  True,  True, False, False])
'''

np.array()函数，将列表list转换成array，才能进行切片slice操作。

anchor_positives = [(0, 2), (0, 3), (2, 3)]
anchor_positives[:,0]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [139], in <cell line: 2>()
      1 anchor_positives = [(0, 2), (0, 3), (2, 3)]
----> 2 anchor_positives[:,0]

TypeError: list indices must be integers or slices, not tuple

需要将list转换成np.array()数组

anchor_positives = np.array([(0, 2), (0, 3), (2, 3)])
anchor_positives[:,0]
---------------------------------
array([0, 0, 2])

numpy.flatten()函数

flatten只能适用于numpy.ndarray对象，返回一个一维数组，普通的list不适用。

a.flatten()，a是个数组，a.flatten()把a降到一维，默认是按行的方向降。
a.flatten().A，a是矩阵，降维后还是矩阵，矩阵.A(==.getA())变成了数组。

>>> a = [[1,3],[2,4],[3,5]]
>>> a = mat(a)
>>> y = a.flatten()
>>> y
matrix([[1, 3, 2, 4, 3, 5]])
>>> y = a.flatten().A
>>> y
array([[1, 3, 2, 4, 3, 5]])
>>> shape(y)
(1, 6)
>>> shape(y[0])
(6,)
>>> y = a.flatten().A[0]
>>> y
array([1, 3, 2, 4, 3, 5])

numpy.ix_ 函数

就是输入两个一维数组，产生笛卡尔积的映射关系(实质上生成一个二维坐标 -> extract Graph上的一个子邻接矩阵！！)

arr = np.arange(32).reshape((8,4))
arr[np.ix_([1,5,7,2],[0,3,1,2])]

'''
np.ix_函数，将数组[1,5,7,2]和数组[0,3,1,2]产生笛卡尔积，
就是得到
(1,0)，(1,3)，(1,1)，(1,2)；
(5,0)，(5,3)，(5,1)，(5,2)；
(7,0)，(7,3)，(7,1)，(7,2)；
(2,0)，(2,3)，(2,1)，(2,2)；
就是按照坐标(1,0)，(1,3)，(1,1)，(1,2)取得 arr 所对应的元素4，7，5，6，
按照坐标(5,0)，(5,3)，(5,1)，(5,2)取得 arr 所对应的元素20，23，21，22
'''

-->扩展: 笛卡尔积本质

邻接矩阵A中包含Graph中所有类型节点的结构信息structual information
index -> 表示每个点的索引，其顺序就表示单类型节点的0,1,2,3,...
笛卡尔积 -> 就是生成二维坐标，由x列表和y列表组成。通过笛卡尔积抽取出来的实质上是这两类型的邻接矩阵，它是大邻接矩阵A的一个子阵。
笛卡尔积的索引虽然不是连续的，但它实际上用的是顺序，单类型节点的顺序其实是连续的。

numpy 文件 file

npy文件

npy文件是指numpy文件，用numpy.laod()函数读入，转换成pandas.DataFrame()查看。

np.load(path, allow_pickle=True)函数 and np.save()函数

以NumPy专用的二进制类型读取数据

以“.npy”格式将数组保存到二进制文件中。
allow_pickle: 可选，布尔值，允许使用 Python pickles 保存对象数组，Python 中的 pickle 用于在保存到磁盘文件或从磁盘文件读取之前，对对象进行序列化和反序列化。

# save np.array()数组
np.save(save_path_i + '/all_vali_nmi.npy', np.asarray(all_vali_nmi))

# load np.array()数组
df_np_part1 = np.load(p_part1, allow_pickle=True)

pandas package

pandas.DataFrame()函数

通过列表list创建

li = [
    [1, 2, 3, 4],
    [2, 3, 4, 5]
]

# DataFRame对象里面包含两个索引， 行索引(0轴， axis=0)， 列索引(1轴， axis=1)
d1 = pd.DataFrame(data=li, index=['A', 'B'], columns=['views', 'loves', 'comments', 'tranfers'])
print(d1)

通过numpy创建

narr = np.arange(8).reshape(2, 4)
# DataFRame对象里面包含两个索引， 行索引(0轴， axis=0)， 列索引(1轴， axis=1)
d2 = pd.DataFrame(data=narr, index=['A', 'B'], columns=['views', 'loves', 'comments', 'tranfers'])
print(d2)

通过字典dict创建

dict = {
    'views': [1, 2, ],
    'loves': [2, 3, ],
    'comments': [3, 4, ]

}
d3 = pd.DataFrame(data=dict, index=['粉条', "粉丝"])
print(d3)

dataframe.iterrows()函数

返回值为元组(index, row)

sklearn package

sklearn.model_selection packages

sklearn.model_selection.train_test_split()函数

train_test_split()函数可以用于将矩阵随机划分为训练子集和测试子集。

# 划分训练集和测试集，比例为8:2
x = df_model_cb.drop(['日剂量'],axis=1)
y = df_model_cb['日剂量']

seed_index=df_seeds.loc[0,'seed']
tran_x, test_x, tran_y, test_y = train_test_split(x, y, test_size=0.2, random_state=seed_index)

sklearn metrics 函数

sklearn.metrics.roc_curve(y_true, y_score, pos_label=None)函数

主要参数：

y_true：真实的样本标签，默认为{0，1}或者{-1，1}。如果要设置为其它值，则 pos_label 参数要设置为特定值。例如要令样本标签为{1，2}，其中2表示正样本，则pos_label=2。
y_score：对每个样本的预测结果。
pos_label：正样本的标签，需要用一个数字或字符串指出。

返回值的计算

roc_curve() 函数有3个返回值：

fpr：False positive rate。
tpr：True positive rate。
thresholds

>>> import numpy as np
>>> from sklearn import metrics
>>> y = np.array([1, 1, 2, 2])
>>> scores = np.array([0.1, 0.4, 0.35, 0.8])
>>> fpr, tpr, thresholds = metrics.roc_curve(y, scores, pos_label=2)
>>> fpr
array([0, 0.5, 0.5, 1])
>>> tpr
array([0.5, 0.5, 1, 1])
>>> thresholds
array([0.8, 0.4, 0.35, 0.1])

sklearn.metrics.f1_score(y_true, y_pred)函数

sklearn.metrics.f1_score(y_true, y_pred, labels=None, 
                pos_label=1, average='binary', sample_weight=None,
                zero_division='warn')
'''
F1 = 2 * (precision * recall) / (precision + recall)
'''

>>> from sklearn.metrics import f1_score
>>> y_true = [0, 1, 2, 0, 1, 2]
>>> y_pred = [0, 2, 1, 0, 0, 1]
>>> f1_score(y_true, y_pred, average='macro')
0.26...
>>> f1_score(y_true, y_pred, average='micro')
0.33...
>>> f1_score(y_true, y_pred, average='weighted')
0.26...
>>> f1_score(y_true, y_pred, average=None)
array([0.8, 0. , 0. ])
>>> y_true = [0, 0, 0, 0, 0, 0]
>>> y_pred = [0, 0, 0, 0, 0, 0]
>>> f1_score(y_true, y_pred, zero_division=1)
1.0...

sklearn.metrics.normalized_mutual_info_score(labels_true, labels_pred)

计算两个聚类之间的归一化互信息。

互信息(Mutual Information)是度量两个事件集合之间的相关性(mutual dependence)，就是交叉的部分。

>>> from sklearn.metrics.cluster import normalized_mutual_info_score
>>> normalized_mutual_info_score([0, 0, 1, 1], [0, 0, 1, 1])
... 
1.0
>>> normalized_mutual_info_score([0, 0, 1, 1], [1, 1, 0, 0])
... 
1.0

sklearn.metrics.adjusted_mutual_info_score(labels_true, labels_pred)

计算两个聚类之间的调整归一化信息

>>> from sklearn.metrics.cluster import adjusted_mutual_info_score
>>> adjusted_mutual_info_score([0, 0, 1, 1], [0, 0, 1, 1])
... 
1.0
>>> adjusted_mutual_info_score([0, 0, 1, 1], [1, 1, 0, 0])
... 
1.0

sklearn.metircs.adjusted_rand_score(labels_true, labels_pred)

计算两个聚类之间的兰德系数

1 from sklearn import metrics
2 labels_true = [0, 0, 0, 1, 1, 1]
3 labels_pred = [0, 0, 1, 1, 2, 2]
4 metrics.adjusted_rand_score(labels_true, labels_pred)

sklearn.cluster packages

sklearn.cluster.KMeans 函数

参考：3. sklearn的K-Means的使用 - hyc339408769 - 博客园

class sklearn.cluster.KMeans(n_clusters=8, *, init='k-means++', n_init=10, max_iter=300, tol=0.0001, precompute_distances='deprecated', verbose=0, random_state=None, copy_x=True, n_jobs='deprecated', algorithm='auto')

K-Means参数

n_clusters : 聚类的个数k，default：8.
init : 初始化的方式，default：k-means++
n_init : 运行k-means的次数，最后取效果最好的一次, 默认值: 10
max_iter : 最大迭代次数, default: 300
tol : 收敛的阈值, default: 1e-4
n_jobs : 多线程运算, default=None，None代表一个线程，-1代表启用计算机的全部线程。
algorithm : 有“auto”, “full” or “elkan”三种选择。"full"就是我们传统的K-Means算法， “elkan”是我们讲的elkan K-Means算法。默认的"auto"则会根据数据值是否是稀疏的，来决定如何选择"full"和“elkan”。一般数据是稠密的，那么就是“elkan”，否则就是"full"。一般来说建议直接用默认的"auto"。

from sklearn.cluster import KMeans
import numpy as np
X = np.array([[1, 2], [1, 4], [1, 0],[4, 2], [4, 4], [4, 0]])
kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
kmeans.labels_ #输出原始数据的聚类后的标签值
>>> array([0, 0, 0, 1, 1, 1], dtype=int32)
kmeans.predict([[0, 0], [4, 4]]) #根据已经建模好的数据，对新的数据进行预测
>>> array([0, 1], dtype=int32)
kmeans.cluster_centers_ #输出两个质心的位置。
>>> array([[1., 2.],[4., 2.]])

sklearn.cluster.DBSCAN 函数

class sklearn.cluster.DBSCAN(eps=0.5, *, min_samples=5, metric='euclidean', metric_params=None, algorithm='auto', leaf_size=30, p=None, n_jobs=None)

DBSCAN参数：

eps：float, default=0.5 输入数据。两个样本之间的最大距离，其中一个被视为另一个样本的邻域内。
min_samples：int, default=5 一个点被视为核心点的邻域内的样本数(或总权重)。这包括要该点本身
metric：string, or callable, default=’euclidean’ 在计算特征数组中实例之间的距离时使用的度量
algorithm：{‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’}, default=’auto’ NearestNeighbors模块用于计算点态距离和寻找最近邻的算法。
p：float，default=None，用于计算点间距离的Minkowski度量的幂。
n_jobs：int，default=None，要并行运行的线程数。

>>> from sklearn.cluster import DBSCAN
>>> import numpy as np
>>> X = np.array([[1, 2], [2, 2], [2, 3],
...               [8, 7], [8, 8], [25, 80]])
>>> clustering = DBSCAN(eps=3, min_samples=2).fit(X)
>>> clustering.labels_
array([ 0,  0,  0,  1,  1, -1])
>>> clustering
DBSCAN(eps=3, min_samples=2)

sklearn.neighbors packages

sklearn.neighbors.KNeighborsClassifier()函数

KNN分类算法，核心思想是如果一个样本在特征空间中的k个最相似(即特征空间中最邻近)的样本中的大多数属于某一个类别，则该样本也属于这个类别

class sklearn.neighbors.KNeighborsClassifier(n_neighbors=5, weights=’uniform’,
                                           algorithm=’auto’, leaf_size=30,
                                           p=2, metric=’minkowski’,
                                           metric_params=None,
                                           n_jobs=None, **kwargs)

参数：

n_neighbors：寻找的邻居数，默认是5。也就是K值

weights：预测中使用的权重函数。可能的取值：‘uniform’：统一权重，即每个邻域中的所有点均被加权。‘distance’：权重点与其距离的倒数，在这种情况下，查询点的近邻比远处的近邻具有更大的影响力。[callable]：用户定义的函数，该函数接受距离数组，并返回包含权重的相同形状的数组。

algorithm：用于计算最近邻居的算法：“ ball_tree”将使用BallTree,“ kd_tree”将使用KDTree,“brute”将使用暴力搜索。“auto”将尝试根据传递给fit方法的值来决定最合适的算法。注意：在稀疏输入上进行拟合将使用蛮力覆盖此参数的设置。

leaf_size:叶大小传递给BallTree或KDTree。这会影响构造和查询的速度，以及存储树所需的内存。最佳值取决于问题的性质。默认30。

p：Minkowski距离的指标的功率参数。当p = 1时，等效于使用manhattan_distance（l1）和p=2时使用euclidean_distance（l2）。对于任意p，使用minkowski_distance（l_p）。默认是2。
metric：树使用的距离度量。默认度量标准为minkowski，p = 2等于标准欧几里德度量标准。

metric_params：度量函数的其他关键字参数。
n_jobs：并行计算数

import numpy as np
import matplotlib.pyplot as plt
# 导入KNN分类器
from sklearn.neighbors import KNeighborsClassifier
from sklearn import datasets
from sklearn.model_selection import train_test_split

# 载入鸢尾花数据集
# iris是一个对象类型的数据，其中包括了data（鸢尾花的特征）和target（也就是分类标签）
iris = datasets.load_iris()

# 将样本与标签分开
x = iris['data']
y = iris['target']
print(x.shape, y.shape)  # (150, 4) (150,)

# 划分数据集
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2)  # 8:2
print(x_train.shape, x_test.shape, y_train.shape, y_test.shape)

# (120, 4) (30, 4) (120,) (30,)

#使用KNeighborsClassifier来训练模型，这里我们设置参数k(n_neighbors)=5, 使用欧式距离(metric=minkowski & p=2)：

clf = KNeighborsClassifier(n_neighbors=5, p=2, metric="minkowski")
clf.fit(x_train, y_train)  # fit可以简单的认为是表格存储

# KNeighborsClassifier()

y_predict = clf.predict(x_test)
y_predict.shape  # (30,)

acc = sum(y_predict == y_test) / y_test.shape[0]
acc
#0.933

sklearn.linear_model packages

sklearn.linear_model.LinearRegression()函数

基于最小二乘法的线性回归。

lr = sklearn.linear_model.LinearRegression(fit_intercept=True, normalize=False, copy_X=True, n_jobs=1)

参数：

fit_intercept: 布尔型，默认为true。是否对训练数据进行中心化,即是否需要截距b值，若果为False，则不需要。
normalize 布尔型，默认为false。是否对数据进行归一化处理。
copy_X 布尔型，默认为true。是否对X复制，如果选择false，则直接对原数据进行覆盖。（即经过中心化，归一化后，是否把新数据覆盖到原数据上），true则赋值X。
n_jobs 整型，默认为1。计算时设置的任务个数(number of jobs)。如果选择-1则代表使用所有的CPU。这一参数的对于目标个数>1（n_targets>1）且足够大规模的问题有加速作用。

返回值：

coef_ 数组型变量，形状为(n_features,)或(n_targets, n_features)。对于线性回归问题计算得到的feature的系数，即权重向量。如果输入的是多目标问题，则返回一个二维数组(n_targets, n_features)；如果是单目标问题，返回一个一维数组 (n_features,)。
intercept_ 数组型变量。线性模型中的独立项，即b值。

#导包
import numpy as np
from sklearn import datasets , linear_model
from sklearn.metrics import mean_squared_error , r2_score
from sklearn.model_selection import train_test_split
import matplotlib
import matplotlib.pyplot as plt

#加载糖尿病数据集
diabetes = datasets.load_diabetes()
X = diabetes.data[:,np.newaxis ,2] #diabetes.data[:,2].reshape(diabetes
#.data[:,2].size,1)
y = diabetes.target
X_train , X_test , y_train ,y_test = train_test_split(X,y,test_size=0.2,random_state=42)
#导入模型，模型参数默认
LR = linear_model.LinearRegression()
#训练模型
LR.fit(X_train,y_train)
#预测模型LR.predict(X_test),此时输出类别数据
#打印截距
print('intercept_:%.3f' % LR.intercept_)
#打印模型系数
print('coef_:%.3f' % LR.coef_)
#打印均方误差值
print('Mean squared error: %.3f' % mean_squared_error(y_test,LR.predict(X_test)))##((y_test-LR.predict(X_test))**2).mean()
#打印R-平方
print('Variance score: %.3f' % r2_score(y_test,LR.predict(X_test)))
#1-((y_test-LR.predict(X_test))**2).sum()/((y_test - y_test.mean())**2).sum
#打印准确率accuracy
print('score: %.3f' % LR.score(X_test,y_test))
plt.scatter(X_test , y_test ,color ='green')
plt.plot(X_test ,LR.predict(X_test) ,color='red',linewidth =3)
plt.show()

sklearn.manifold packages

manifold learning流形学习是一种非线性降维手段。

原因：多维数据难于可视化，但2维或3维数据很容易通过图表展示数据本身的内部结构。所以，高维绘图必须通过某种方式降维，类似PCA，Independent Component Analysis、Linear Discriminant Analysis。

manifold learning可以看作一种类似PCA的线性框架，可以对数据中的非线性结构敏感。

argparse package

argparse是python用于解析命令行参数和选项的标准模块，用于代替已经过时地 optparse模块。

python argparse用于更方便地进行超参数的保存和修改，可以简化成下面四个步骤：

1) import argparse # 导入argparse模块

2) parser = argparse.ArgumentParser() # 创建一个参数解析对象ArgumentParser

3) parser.add_argument() # 然后向该对象中添加参数和选项，每个add_argument方法只能添加一个参数或选项。

4) parser.parse_args() # 最后调用parse_args()方法进行解析，解析成功后即可调用。该方法只能调用一次，第二次就会报错！

demo: argparse解析参数

# demo.py
import argparse

# 创建ArgumentParser()对象
parser = argparse.ArgumentParser()

# 添加参数
parser.add_argument('-o', '--output', action='store_true', 
    help="shows output")
# action = `store_true` 会将output参数记录为True
# type 规定了参数的格式
# default 规定了默认值
parser.add_argument('--lr', type=float, default=3e-5, help='select the learning rate, default=1e-3') 

parser.add_argument('--batch_size', type=int, required=True, help='input batch size')  
# 使用parse_args()解析函数
args = parser.parse_args()

if args.output:
    print("This is some output")
    print(f"learning rate:{args.lr} ")

FinEvent: argparse解析参数

def args_register():
    parser = argparse.ArgumentParser()
    parser.add_argument('--n_epochs', default=50, type=int, help='Number of initial-training/maintenance-training epochs.')
    parser.add_argument('--window_size', default=3, type=int, help='Maintain the model after predicting window_size blocks.')
    parser.add_argument('--patience', default=5, type=int, help='Early stop if perfermance did not improve in the last patience epochs.')
    parser.add_argument('--margin', default=3, type=float, help='Margin for computing triplet losses')
    parser.add_argument('--lr', default=1e-3, type=float, help='Learning rate')
    
    parser.add_argument('--batch_size', default=100, type=int, help='Batch size (number of nodes sampled to compute triplet loss in each batch)')
    parser.add_argument('--hidden_dim', default=128, type=int, help='Hidden dimension')
    parser.add_argument('--out_dim', default=64, type=int, help='Output dimension of tweet representation')
    parser.add_argument('--heads', default=4, type=int, help='Number of heads used in GAT')
    parser.add_argument('--validation_percent', default=0.2, type=float, help='Percentage of validation nodes(tweets)')
    parser.add_argument('--use_hardest_neg', dest='use_hardest_neg', default=False, action='store_true', 
                        help='If true, use hardest negative messages to form triplets. Otherwise use random ones')
    parser.add_argument('--is_shared', default=False)
    parser.add_argument('--inter_opt', default='cat_w_avg')
    parser.add_argument('--is_initial', default=False)
    parser.add_argument('--sampler', default='RL_sampler')
    parser.add_argument('--cluster_type', default='kmeans', help='Types of clustering algorithms') # DBSCAN
    
    # RL-0
    parser.add_argument('--threshold_start0', default=[[0.2],[0.2],[0.2]], type=float, 
                        help='The initial value of the filter threshold for state1 or state3')
    parser.add_argument('--RL_step0', default=0.02, type=float, help='The starting epoch of RL for state1 or state3')
    parser.add_argument('--RL-start0', default=0, type=int, help='The starting epoch of RL for state1 or state3')
    
    # RL-1
    parser.add_argument('--eps_start', default=0.001, type=float, help='The initial value of the eps for state2')
    parser.add_argument('--eps_step', default=0.02, type=float, help='The step size of eps for state2')
    parser.add_argument('--min_Pts_start', default=2, type=int, help='The initial value of the min_Pts for state2')
    parser.add_argument('--min_Pts_step', default=1, type=int, help='The step size of min_Pts for state2')
    
    # other arguments
    parser.add_argument('--use_cuda', dest='use_cuda', default=True, action='store_true', help='Use cuda')
    parser.add_argument('--data_path', default='./incremental_0502/', type=str, help='Path of features, labels and edges')
    # format: './incremental_0808/incremental_graphs_0808/embeddings_XXXX'
    parser.add_argument('--mask_path', default=None, type=str, help='File path that contains the training, validation and test masks')
    # format: './incremental_0808/incremental_graphs_0808/embeddings_XXXX'
    parser.add_argument('--log_interval', default=10, type=int, help='Log interval')
    
    args = parser.parse_args()  # 解析参数
    
    return args

argparse报错(一): SystemExit: 2

这是解析函数args = parser.parse_args()的问题。

改写成:

args = parser.parse_args(args=[])

scipy packages

scipy.io packages

scipy.io.loadmat('filepath')函数

读取路径为'filepath'的.mat文件，函数的返回值为字典类型dict。

scipy.io.loadmat("FilePath")

scipy.io.savemat()函数

io.savemat('SavedData.mat',{'key1':data1, 'key2':data2})

将ndarray类型的数据data1、data2以key1、key2为变量名，保存在SaveData.mat中。

其中，SaveData.mat会被保存到项目文件下(该.py文件的相同路径)。

scipy.sparse package

sparse稀疏矩阵没啥，就是加快计算速度，节省内存。e.g. 普通矩阵matrix 4000*4000，要计算16 millon times，真心跑不动啊，电脑哼哧哼哧响半天，还没计算完，而sparse matrix可能只需要计算1万次。

sparse.coo_matrix()函数

coo_matrix是最简单的稀疏矩阵存储方式，采用三元组(row, col, data)的形式来存储矩阵中非零元素的信息。

在这里插入图片描述

>>> import numpy as np
>>> from scipy.sparse import coo_matrix

>>> _row  = np.array([0, 3, 1, 0])
>>> _col  = np.array([0, 3, 1, 2])
>>> _data = np.array([4, 5, 7, 9])
>>> coo = coo_matrix((_data, (_row, _col)), shape=(4, 4), dtype=np.int)
>>> coo.todense()  # 通过toarray方法转化成密集矩阵(numpy.matrix)
>>> coo.toarray()  # 通过toarray方法转化成密集矩阵(numpy.ndarray)
array([[4, 0, 9, 0],
       [0, 7, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 5]])

coo的row和col属性

row- coo format row index array of the matrix

col- coo format column index array of the matrix

.tocoo()函数

convert this matrix to coo稀疏矩阵

import scipy.sparse as sp
import numpy as np
b = np.array([[0, 1, 2], [4, 2, 0], [0, 1, 0]])
# 将稀疏矩阵用此种方法表示，后续仍可以直接对矩阵进行操作
a = sp.csr_matrix(b, dtype=np.float32)     
c = a.tocoo().astype(np.float32)

sparse.csr_matrix()函数

csr_matrix矩阵要求矩阵元按行顺序存储，每一行中的元素可以乱序存储。

csr实现了存贮二维张量的CSR格式，不支持N维张量，比coo格式能更好地利用存储和更快的计算操作。尚不存在cuda操作支持。

稀疏压缩矩阵：

data，对于每一行就只需要一个指针表示该行元素的起始位置即可。
indices，存储每行中数据的列号
indptr，存储每一行数据元素的起始位置

在这里插入图片描述

csr_matrix构造方式(一): 二维数组或矩阵(其他矩阵转化)

import numpy as np
from scipy.sparse import csr_matrix

# 通过二维矩阵或数组创建稀疏矩阵
a = np.zeros((3, 4))
a[1, 2] = 12
a[2, 2] = 22

csr_matrix(a).toarray()

'--------------------'
array([[ 0.,  0.,  0.,  0.],
       [ 0.,  0., 12.,  0.],
       [ 0.,  0., 22.,  0.]])

csr_matrix构造方式(二): 三元组方式(密集矩阵构建)

import numpy as np
from scipy.sparse import csr_matrix

# 创建稀疏矩阵三元组，和coo_matrix创建一样指定data, i, j
indptr = np.array([0, 2, 3, 6])
indices = np.array([0, 2, 2, 0, 1, 2])
data = np.array([1, 2, 3, 4, 5, 6])

csr_matrix((data, indices, indptr), shape=(3, 3)).toarray()

'------------------------------------'
array([[1, 0, 2],
       [0, 0, 3],
       [4, 5, 6]])

csr_matrix构造方式(三): 空csr_matrix + 赋值

x = csr_matrix((4, 3))
# 向稀疏矩阵位置赋值
x[1, 2] = 12
x[3, 1] = 23

x.toarray()

array([[ 0.,  0.,  0.],
       [ 0.,  0., 12.],
       [ 0.,  0.,  0.],
       [ 0., 23.,  0.]])
'------------------------'
x.todense()
matrix([[ 0.,  0.,  0.],
        [ 0.,  0., 12.],
        [ 0.,  0.,  0.],
        [ 0., 23.,  0.]])

sparse.lil_matrix()函数

lil_matrix, 即List of Lists format，又称Row-based linked list space matrix 基于行列表的稀疏矩阵。它使用两个嵌套列表存储稀疏矩阵：

data，保存每行中的非零元素的值
rows，保存每行非零元素所在的列号。

这种格式很适合逐个添加元素，并且能快速获取行相关的数据。

在这里插入图片描述

lil_matrix初始化方式与coo_matrix三种初始化方式相同：通过密集矩阵构建、通过其他矩阵转化、构建一个空矩阵。

优点：行切片操作效率高，列切片效率低；稀疏矩阵格式之间转化很高效tobsr()、tocsr()、to_csc()、to_dia()、to_dok()、to_lil()。

缺点：加法操作效率低、列切片效率低、矩阵乘法效率低。

scipy.sparse operation

scipy.sparse.todense()函数 -> numpy矩阵

numpy 张量矩阵tensor需要用scipy.sparse.csr_matrix等转换为稀疏矩阵，以节约内存和提高计算速度。
sparse.todense()函数将csr等稀疏矩阵转回numpy形式，用于存储为.npz文件。

scipy.sparse.vstack()函数

表示按行上下拼接(行数增加)，列数必须相同。

>>> from scipy.sparse import coo_matrix, vstack
>>> A = coo_matrix([[1, 2], [3, 4]])
>>> B = coo_matrix([[5, 6]])
>>> vstack([A, B]).toarray()
array([[1, 2],
       [3, 4],
       [5, 6]])

scipy.sparse.isspmatrix_coo(x)函数

scipy.sparse.isspmatrix_coo(x)，判断稀疏矩阵类型的函数

scipy.sparse.diags()函数

从对角线构造一个稀疏矩阵。

from scipy.sparse import diags

>>> diagonals = [[1, 2, 3, 4], [1, 2, 3], [1, 2]]  
# 使用diags函数，该函数的第二个变量为对角矩阵的偏移量，
0：代表不偏移，就是(0,0)(1，1)(2，2)(3，3)...这样的方式写
k：正数：代表像正对角线的斜上方偏移k个单位的那一列对角线上的元素。
-k：负数，代表向正对角线的斜下方便宜k个单位的那一列对角线上的元素，

由此看下边输出

>>> diags(diagonals, [0, -1, 2]).toarray()
array([[1, 0, 1, 0],
       [1, 2, 0, 2],
       [0, 2, 3, 0],
       [0, 0, 3, 4]])

第一个参数的第一个元素是[1,2,3,4],对应的第2个参数的数是0，所以1，2，3，4分别在结果的矩阵的(0,0)(1,1)(2,2)(3,3)位置。
第1个参数的第2个元素是[1,2,3]，对应的第2个参数的数是-1，所以1,2,3分别放在(1,0)(2,1)(3,2)的位置。
第1个参数的第三个元素是[1,2],对应的第2个参数的数是2，所以1，2分别放在(0,2)(1,3)的位置上，其余地方补0就好。因为对角矩阵肯定是个方阵，所以就最后就是4*4的方阵。

Broadcasting of scalars is supported (but shape needs to be
specified):

>>> diags([1, -2, 1], [-1, 0, 1], shape=(4, 4)).toarray()
array([[-2.,  1.,  0.,  0.],
       [ 1., -2.,  1.,  0.],
       [ 0.,  1., -2.,  1.],
       [ 0.,  0.,  1., -2.]])

这种情况，第一个参数是一个一维数组，第二个参数也是一个列表，这就需要把第一个参数列表（数组）中的每个元素都与第二个参数的每个元素进行对应，所以就是下面这种情况。
第1个参数的第1个元素是1,对应的第2个参数的数是-1，，而且后边规定了是4*4的方阵，1首先放的位置是(1,0)，所以此对角线的元素都为1，所以1分别在结果的矩阵的(1,0)(2,1)(3,2)位置。
第1个参数的第2个元素是-2，对应的第2个参数的数是0，所以-2分别放在(0,0)(1,1)(2,2)(3,3)的位置。
第1个参数的第3个元素是1,对应的第2个参数的数是1，所以1分别放在(0,1)(1,2)(2,3)的位置上，其余地方补0就好。

If only one diagonal is wanted (as in `numpy.diag`), the following
works as well:

>>> diags([1, 2, 3], 1).toarray()
array([[ 0.,  1.,  0.,  0.],
       [ 0.,  0.,  2.,  0.],
       [ 0.,  0.,  0.,  3.],
       [ 0.,  0.,  0.,  0.]])

这种情况，第一个参数是一个一维数组，第二个参数是一个数，这就需要把第一个参数列表（数组）作为整体与第二个参数进行对应，所以就是下面这种情况。

第1个参数的第1个元素是[1,2,3]对应的第2个参数的数是1，1,2,3分别放的位置是(0,1)(1,2)(2,3)其余地方补0就好

scipy.sparse.linalg.eigsh(A, k)函数

scipy.sparse.linalg.eigsh(A,k)函数，求矩阵A的特征值和特征向量。

其中，A是实对称矩阵或复Hermitian矩阵；k是所需的特征值和特征向量的数量。

scipy.sparse.save_npz()和load_npz()存读.npz文件

以.npz格式保存和载入numpy格式的稀疏矩阵

sparse.save()函数保存.npy文件

Scipy.sparse errors

scipy.sparse error(1): AttributeError: module 'scipy.sparse' has no attribute 'coo_array' #4378

参考：AttributeError: module 'scipy.sparse' has no attribute 'coo_array' · Issue #4378 · pyg-team/pytorch_geometric · GitHub

原因: there is a version conflict between networkx and scipy.

要么人工设置scipy和networkx的版本
要么自动升级matched version

pip install --upgrade scipy networkx

Spacy package

spacy是python里面一个工业级别的自然语言处理工具，包括：词性标注、句法分析、命名实体识别、词向量、词根还原。

对于英语，spacy提供了4种预训练语言模型，用来预测语言特征，包括：en_core_web_sm, en_core_web_md, en_core_web_lg 和 en_core_web_trf.

spacy处理文本的过程是模块化的，当调用nlp处理文本时，spacy通过pipeline将文本标记化以生成Doc对象。

en_core_web_lg包含的组件：tok2vec, tagger, parser, senter, attribute_ruler, lemmatizer, ner
en_core_web_lg安装

python -m spacy download en_core_web_lg

en_core_web_lg使用：tokenizer, pos, dependency, NER, lemmatizer, sentence segmentation.

networkx package

networkx是一个使用python开发的图论与复杂网络建模工具，内置了很多常用的图与复杂网络分析算法。

import networkx as nx

networkx 创建图方式(一): 空图

import networkx as nx
from scipy import sparse

# 创建无向图
G = nx.Graph()

# 创建有向图
G = nx.DiGraph()

networkx添加节点node

# 添加节点node
G.add_node('t_123')
G.add_node('t_456')

# 从列表中添加节点方式nodes
G.add_nodes_from(['t_123', 't_456', 't_789'])

networkx添加边edge

# 添加边edge
G.add_edge('t_123', 't_456')

# 从列表中添加边edges
G.add_edges_from([['t_123', 't_456'], ['t_123', 't_789'], ['t_456', 't_789']])

networkx查看和设置节点属性

# 查看全部节点
G.nodes

# 查看单个节点属性
G.nodes[i]['weight']

# 设置单个节点属性
G.nodes[i]['weight'] = True
G.nodes[i]['weight'] = 0.1

networkx查看和设置边edge属性

# 查看全部边
G.edges

# 查看单个节点属性
G.edges[i]['weight']

# 设置单个节点属性
G.edges[i]['weight'] = True
G.edges[i]['weight'] = 0.1

networkx查看图Graph

# draw查看图
nx.draw(G, with_labels=True)  # with_labels参数表示显示标签

# draw_networkx 查看图
nx.draw_networkx(G, with_labels=True)

networkx.to_numpy_matrix(G)函数返回图邻接矩阵A

返回图邻接矩阵，graph adjacency matrix as a numpy matrix.

G = nx.Graph()
G.add_nodes_from(['t_123', 't_456', 't_789'])
# G.add_node('t_456')
G.add_edges_from([['t_123', 't_456'],['t_123', 't_789'], ['t_456', 't_789']])
G.nodes['t_123']['tweet_id'] = True
'------------------------------------------'
nx.draw_networkx(G, with_labels=True)
'-------------------------------------------'
nx.to_numpy_matrix(G)

matrix([[0., 1., 1.],
        [1., 0., 1.],
        [1., 1., 0.]])

networkx 创建图方式(二): 字典转化

构建networkx graph的首选方法是通过图构造函数将其他格式的数据转换为networkx graph。

图构造函数 to_networkx_graph()，从已知数据结构制作networkx graph。

# 从字典中创建具有单条边的图
d = {0: {1: 1}}  # dict-of-dicts single edge (0,1)
G = nx.Graph(d)

字典数据：

to_dict_of_dicts(G[, nodelist, edge_data])函数，将图形的邻接表示作为字典的字典返回
from_dict_of_dicts(d[, create_using])函数，从字典的字典中返回一个图形。

列表数据：

networkx.to_dict_of_lists(G[, nodelist])函数，以列表字典的形式返回图形的邻接表示。
networkx.from_dict_of_lists(d[, create_using])函数，返回列表字典中的图表。
to_edgelist(G[, nodelist])函数，返回图中的边列表。
from_edgelist(edgelist[, create_using])函数，从边列表中返回一个图。

networkx.adjacency_matrix()函数

返回Graph的邻接矩阵

error(1): nx.adjacency_matrix()计算邻接矩阵与真实结果不一致

nx.adjacency_matrix计算邻接矩阵与真实结果不一致：解决办法记录_小猪上吊ing的博客-CSDN博客_nx.adjacency_matrix

问题描述：根据edgelist计算的邻接矩阵，与networkx.adjacency(g)返回的邻接矩阵不一致

原因分析：新加边add_edges会自动加新节点。

解决方案：在g.add_edges_from(edgelist)操作之前，先把edgelist中的节点抽取出来按顺序排好，用操作g.add_nodes_from()把节点统一添加进图g中。

edgelist = [
        (0, 1),
        (1, 3),
        (2, 4),
        (1, 5),
        (1, 3),
        (5, 5),
        (1, 3)
    ]
"""由于nx.MultiGraph()可累计多条重复边作为权重，所以(1,3)出现3次权重是3"""
g = nx.MultiGraph()  # 无向多边图

""" 节点id按照顺序排！！否则生成的邻接矩阵不一样 """
nodeset = sorted(set(itertools.chain(*edgelist)))
g.add_nodes_from(nodeset)

g.add_edges_from(edgelist)
print(g.nodes())
adj = sp.lil_matrix(nx.adjacency_matrix(g))
print(adj.todense())

networkx.get_node_attributes(G, 'tweet_id')函数

返回Graph中'tweet_id'节点及其属性

.keys()获取tweet_id列表

G = nx.Graph()
tweet_list = ['t_123', 't_456', 't_789']
G.add_nodes_from(tweet_list)
for i in tweet_list:
    G.nodes[i]['tweet_id'] = True
user_list = ['Sydney', 'Beijing', 'Melbourne']
entity_list = ['me', 'bing', 'zhen']
G.add_nodes_from(user_list)
G.add_nodes_from(entity_list)

for i in entity_list:
    G.nodes[i]['entity_id'] = True
for i in user_list:
    G.nodes[i]['user_id'] = True
G.add_edges_from([['t_123', 't_456'],['t_123', 't_789'], ['t_123', 'Sydney'], ['t_456', 'Melbourne'], ['t_456', 'Beijing']])
G.add_edges_from([['t_123', 'me'], ['t_123', 'bing'], ['t_789', 'zhen']])
G.nodes['t_123']['tweet_id'] = True

nx.get_node_attributes(G, 'tweet_id')
'''
{'t_123': True, 't_456': True, 't_789': True}
'''
nx.get_node_attributes(G, 'tweet_id').keys()
'''
dict_keys(['t_123', 't_456', 't_789'])
'''

dgl package

dgl搭建图graph，或理解为图神经网络。

DGL(Deep Graph Library)是一个python packages，旨在提供图上深度学习工具，相比于tensorflow，pytorch，MxNet，DGL构建更高效的图神经网络，因为这些框架在实现神经网络过程中容易出现训练速度缓慢和内存溢出等问题。

为什么 Tensorflow、Pytorch、MXNet、Keras 在搭建图神经网络模型过程中会出现这些问题呢?

第一个原因在于图神经网络模型相比于深度神经网络模型，所采用的的方法是消息传递。

消息传递是这样的：

首先，图神经网络模型中每个节点在训练过程中，会向邻居节点发送该节点的此时消息，同时，该节点也会接收来自邻居节点的消息。
然后，该节点在获取到周围节点消息之后，会对这些消息进行聚合，以计算节点的新的表示。

第二个原因在于图神经网络模型与深度学习框架的计算机制不同，深度学习框架都是基于张量计算的，但是图神经网络模型在进行计算时，并非直接表示为一个完整的张量，而是需要手动补零，因此，也导致了深度学习框架在计算图神经网络模型时，容易出现速度变慢，内存溢出问题。

dgl接受的图数据: Networkx构建的图, scipy稀疏矩阵, DGLGraph。

-> 转换成dgl图，或dgl图神经网络。

g_nx = nx.petersen_graph()  # 创建一个petersen图
g_dgl = dgl.DGLGraph(g_nx)  # 将petersen图转化为DGLGraph形式
g_dgl = dgl.DLGraph()  # 直接创建一个空图

# 添加节点
g_dgl.add_notes(10)  # 添加10个节点

# 添加单条边
g_dgl.add_edge(0,1)

# 添加多条边
g_dgl.add_edges([1,2,3], [3,4,5])  # 出度和入度，three edges: 1->3, 2->4, 3->5

pickle package

python自带的序列化包

在机器学习中，我们常常需要把训练好的模型存储起来，这样在进行决策时直接将模型读出，而不需要重新训练模型，这样就大大节约了时间。

Python提供的pickle模块就很好地解决了这个问题，它可以序列化对象并保存到磁盘中，并在需要的时候读取出来，任何对象都可以执行序列化操作。

1) Pickle可以保存任何数据格式的数据，在经常存取的场景(保存和恢复状态)下读取更加高效。

2) file则是只能读取和存储字符串格式的数据，适用于小场景，读取不那么频繁，数据格式不那么复杂。

3) -> open函数则是将当前读取的数据/状态存储到内存中，然后方便调用其它函数(file， pickle函数)写入或读取！

import  pickle
import pandas as pd

data = pd.DataFrame()
# 写入数据
pkl_file = open('D:/raw_data', 'wb')
pickle.dump(data, pkl_file, pickle.HIGHEST_PROTOCOL)
pkl_file.close()

# 读入数据
pkl_file_rb = open('D/raw_data', 'rb')
new_data = pickle.load(pkl_file_rb)

pickle.dump()函数

pickle.dump(obj, file, [,protocol])

函数的功能：将obj对象序列化存入已经打开的file中。

参数讲解：

obj：想要序列化的obj对象。
file:文件名称。
protocol：序列化使用的协议。如果该项省略，则默认为0。如果为负值或HIGHEST_PROTOCOL，则使用最高的协议版本。

pickle.dumps()函数

pickle.dumps(obj[, protocol])

函数的功能：将obj对象序列化为bytes对象，而不是存入文件中。

参考: pickle — Python object serialization — Python 3.11.0 documentation

参数讲解：

obj：想要序列化的obj对象。
protocal：如果该项省略，则默认为0。

pickle.load()函数

pickle.load(file)

函数的功能：将file中的对象序列化读出。

参数讲解：

file：文件名称

pickle.loads()函数

pickle.loads(string)

函数的功能：从bytes-like对象中读出序列化前的obj对象。

参数讲解：

string：文件名称。

Notice: dump() 与 load() 相比 dumps() 和 loads() 还有另一种能力：dump()函数能一个接着一个地将几个对象序列化存储到同一个文件中，随后调用load()来以同样的顺序反序列化读出这些对象。

#coding:utf-8
__author__ = 'MsLili'
#pickle模块主要函数的应用举例
import pickle
dataList = [[1, 1, 'yes'],
            [1, 1, 'yes'],
            [1, 0, 'no'],
            [0, 1, 'no'],
            [0, 1, 'no']]
dataDic = { 0: [1, 2, 3, 4],
            1: ('a', 'b'),
            2: {'c':'yes','d':'no'}}
 
#使用dump()将数据序列化到文件中
fw = open('dataFile.txt','wb')
# Pickle the list using the highest protocol available.
pickle.dump(dataList, fw, -1)
# Pickle dictionary using protocol 0.
pickle.dump(dataDic, fw)
fw.close()
 
#使用load()将数据从文件中序列化读出
fr = open('dataFile.txt','rb')
data1 = pickle.load(fr)
print(data1)
data2 = pickle.load(fr)
print(data2)
fr.close()
 
#使用dumps()和loads()举例
p = pickle.dumps(dataList)
print( pickle.loads(p) )
p = pickle.dumps(dataDic)
print( pickle.loads(p) )

typing package

typing.Any

是一种特殊类型，他可以代表所有类型。即所有无参数类型注解和返回类型注解的都会默认使用Any类型。

def add(a):
  return a+1

'---------------'

def add(a:Any) -> Any:
  return a+1

typing.Tuple()方法

typing.Tuple是元组tuple的泛型，其后面紧跟一个方括号，方括号中按顺序声明了构成本元组的元素类型。

person: Tuple[str, int, float] = ('Mike', 22, 1.75)

typing.Dict()方法

typing.Dict是字典dict的泛型，Dict推荐用于注解返回类型。

def size(rect: Mapping[str, int]) -> Dict[str, int]:
    return {'width': rect['width'] + 100; 'height': rect['width'] + 100}

itertools package

itertools.combinations(iterable, r)方法

返回由输入iterable中元素组成长度为n的子序列。

from itertools import combinations
test_data = ['a', 'a', 'a', 'b']
for i in combinations(test_data, 2):
    print I


'''
('a', 'a')
('a', 'a')
('a', 'b')
('a', 'a')
('a', 'b')
('a', 'b')
'''

json package

json(javaScript object Notation)是一种轻量级的数据交换格式，易于人阅读和编写。

json.dumps()函数

json.dumps()函数用于将dict类型的数据转换成str。因为dict类型的数据直接写入json文件中会发生报错！

import json
 
name_emb = {'a':'1111','b':'2222','c':'3333','d':'4444'} 
 
jsObj = json.dumps(name_emb)    
 
print(name_emb)
print(jsObj)
 
print(type(name_emb))
print(type(jsObj))

------------------------------
{'a': '1111', 'c': '3333', 'b': '2222', 'd': '4444'}
{"a": "1111", "c": "3333", "b": "2222", "d": "4444"}
<type 'dict'>
<type 'str'>

json.loads()函数

json.loads()函数用于将str类型的数据转换成dict。

import json
 
name_emb = {'a':'1111','b':'2222','c':'3333','d':'4444'} 
 
jsDumps = json.dumps(name_emb)    
 
jsLoads = json.loads(jsDumps) 
 
print(name_emb)
print(jsDumps)
print(jsLoads)
 
print(type(name_emb))
print(type(jsDumps))
print(type(jsLoads))     
-------------------------------------
{'a': '1111', 'c': '3333', 'b': '2222', 'd': '4444'}
{"a": "1111", "c": "3333", "b": "2222", "d": "4444"}
{u'a': u'1111', u'c': u'3333', u'b': u'2222', u'd': u'4444'}
<type 'dict'>
<type 'str'>
<type 'dict'>

json.dump(dict, file, indent=2)函数

json.dump()函数用于将dict类型的数据转换成str，并写入到json文件中。

dict，表示输出数据
file，是文件路径
indent，表示缩进，用于print美观。

import json  
  
name_emb = {'a':'1111','b':'2222','c':'3333','d':'4444'}  
          
emb_filename = ('/home/cqh/faceData/emb_json.json')  
 
# solution 1
jsObj = json.dumps(name_emb)    
with open(emb_filename, "w") as f:  
    f.write(jsObj)  
    f.close()  
    
# solution 2   
json.dump(name_emb, open(emb_filename, "w"))

json.load()函数

json.load()函数用于从json文件中读取数据。

import json  
 
emb_filename = ('/home/cqh/faceData/emb_json.json')  
 
jsObj = json.load(open(emb_filename))    
 
print(jsObj)
print(type(jsObj))
 
for key in jsObj.keys():
    print('key: %s   value: %s' % (key,jsObj.get(key)))
--------------------------------------------
{u'a': u'1111', u'c': u'3333', u'b': u'2222', u'd': u'4444'}
<type 'dict'>
key: a   value: 1111
key: c   value: 3333
key: b   value: 2222
key: d   value: 4444

random package

random.randint; random;uniform; choice; randrange

random.randint(low, high=None, size=None, )函数要设定范围从low（包括）到high（不包括）

返回[low, high)之间一个随机整型数.

import random

print( random.randint(1,10) )        # 产生 1 到 10 的一个整数型随机数  
print( random.random() )             # 产生 0 到 1 之间的随机浮点数
print( random.uniform(1.1,5.4) )     # 产生  1.1 到 5.4 之间的随机浮点数，区间可以不是整数
print( random.choice('tomorrow') )   # 从序列中随机选取一个元素
print( random.randrange(1,100,2) )   # 生成从1到100的间隔为2的随机整数

a=[1,3,5,6,7]                # 将序列a中的元素顺序打乱
random.shuffle(a)
print(a)

random.sample(list, length)函数

多用于截取列表的指定长度的随机数，但是不会改变列表本身的排序

list = [0,1,2,3,4]
rs = random.sample(list, 2)
print(rs)
-----------------------------------
[2, 4]

time package

time.time()函数

返回当前时间的时间戳(1970纪元后经过的浮点秒数)

datetime package

datetime.fromisocalender()函数

返回指定的year、week、day

copy package

指针引用，赋值引用传递的是指针，b=a
浅拷贝，copy.copy()，只拷贝父对象，不会拷贝对象内部的子对象
深拷贝，copy.deepcopy()，拷贝对象及其子对象

python gc package

python垃圾回收机制gc(garbage collector)模块，主要依靠gc模块的引用计数技术来进行垃圾回收。

gc.collect()

清除缓存，尽量避免主动调用gc.collect()。

除非你new出一个大对象，使用完毕后希望立刻回收，释放内存。

天狼啸月1990

关注

1
点赞
踩
4

收藏

觉得还不错? 一键收藏
打赏
0
评论
GNN python packages

python packages related to GNN models
复制链接

扫一扫

专栏目录