python

weixin_43591355

已于 2022-01-20 17:24:13 修改

阅读量178

点赞数

文章标签： python

于 2019-12-16 13:49:47 首次发布

本文链接：https://blog.csdn.net/weixin_43591355/article/details/95500270

版权

1、操作excel文件：
https://www.cnblogs.com/hedeyong/p/7646125.html
2、查找Wins下安装目录：
https://www.cnblogs.com/miaoqianling/p/10629129.html

数据

1、数据类别：nominal data, ordinal data(有序类别ordered category), numerical data
2、数据属性：粒度granularity，scope，时效temporality，真实性faithfulness
3、数据存储结构：表格数据Tabular data，常见存储于Comma-Seperated Values即csv格式文件。record按行存储，field以逗号隔开。
4、可视化：
定性数据/分类数据categorical：nominal+ordinal。条形图
定量数据/numerical数据：通常直方图+散点图

5、统计推断方法：假设检验；置信区间。过程中包含重新取样resampling。
——假设检验：测试前，默认null假设为真。test statistic小，另一假设为真；反之null假设为真。
——置换检验：P(T>Tobs)
https://blog.csdn.net/u011467621/article/details/47971917
——自举检验bootstrapping：通过resampling→新随机样本。模拟的参数，置信区间
6、分类方法：categorical prediction，用观察值已知的属性预测其未知的类别。训练集→分类器classifier(算法)
——二元分类
——多元分类

字符串

1、str.index("c")取第一个出现的。str.count("c")：出现总数。
2、切片:str[起:终:步长]
负数-x：倒数第x
颠倒：str[::-1]
3、大小写：str.upper() str.lower()
4、布尔型：str.starswith() str.endwith()
5、分割：str.split("ccc")，按ccc分割，生成数组。

2、in用于container。
3、判断：a is b判断对象，a==b判断值。
4、布尔表达式：·not exp
5、对象：obj=class()
6、导入模块：
PYTHONPATH=...或sys.path.append(path)//在import前使用
dir()：查看库里所有函数
7、__init__.py
https://blog.csdn.net/mangobar/article/details/81869854

8、格式化输出：print('%s is %d' %(str,int))
%g：浮点数字(根据值的大小采用%e或%f)

字典

1、初始化：

dic={}
dic[k1]=v1
dic[k2]=v2

或：

dic={
	k1:v1
	k2:v2
}

2、遍历：for k,v in dic.items()
删除：del dic[k]
3、

d={'a':1,'b':2}
print(d.items())
→dict_items([('a', 1), ('b', 2)])

for key,value in d.items():
    print (key,value)
→a 1
 b 2

## Numpy array
```js
a1=[..]
a2=[..]
import numpy as np
arr1=np.array(a1)
arr2=np.array(a2)
#直接运算
num=arr1/arr2
#通过下标定位
num[num>20]

numpy

np.arange(start,end,step)：返回有固定步长的list。默认起点0步长1。
np.count_nonzero(array)：非零元素的总个数
https://blog.csdn.net/zfhsfdhdfajhsr/article/details/109813613

list和array区别：https://blog.csdn.net/yeziand01/article/details/81487202’

np.random.choice(a,size)：a必须为一维array或int
https://blog.csdn.net/ImwaterP/article/details/96282230

计算数组第q%分位的数值：np.percentile(array,q)
https://blog.csdn.net/weixin_40845358/article/details/84638449

np.concatenate((a,b,..),axis)：默认axis=0，是对列加行；axis=1为对行加列。

Pandas

数据结构：数据段DataFrame。表格形式，tabular data structure。

dic={k1:[v1],k2:[v2],k3:[v3]}
import pandas as pd
tb=pd.DataFrame(dic)
#默认下标索引：阿拉伯数字0-n。更改索引：
tb.index=[xx,xx,xx]
#?
tb.describe()
#引入csv文件生成表格：
tb1=pd.read_csv('xx.csv',index_col=0)
#查看是否为空值。返回boolean：
tb.isnull()

#DataFrame[]中括号,通过列名取数据：
#取一列：中括号或双中
tb['col1'] #返回类型:pandas.core.series.Series
tb[['co1']] #返回类型:pandas.core.frame.DataFrame
#取多列：必须双中
tb[['col1','col2',...]] #返回类型:pandas.core.frame.DataFrame
#加入新列：
tb['col4']=[xx,xx,xx]

#切片:
tb[0:5]
#标签定位：
tb.iloc[0]
tb.loc['label1']
tb.loc[['label2','label3']]

1、pandas里的Series：线性数据结构，一维数组。
df.apply()//返回Series类对象
df.count()：按列数数
df.unique()：数组形式，返回列的所有唯一值
df.nunique()：唯一值个数
df.sample(frac)：随机取frac比例的数。返回df，索引+值。
df.reset_index(drop=True)：索引重置

2、选取某一列：df.columns['..']
删除某一列：df.drop['..',axis=1,inplace=True] //修改原数据。//删行就是0
中位数：df.median()
行数/列数：df.shape[0]，df.shape[1]
为某行某列赋值：df.loc[行index,列名]=xxx或df[列名][行index]=xxx
loc俩参数，一个管行一个管列。
df.loc[boolean值,:]
iloc与loc区别：取索引而不是列名

3、df转list：df['..'].values.tolist()//画图用list

4、某列排序：df.sort_values(列名, ascending='True/False')//返回df格式
组分类：df.groupby([列1,列2,...])//默认基于axis=0分组。还可接受list作参数。返回df groupby object
聚合：对分组后数据聚合。agg()：pd series/dict/list可作参数。
df.groupby(...).agg(..)
df.groupby(...).count()//返回df对象
df.any()：查找是否有真值/非零，一真为真，全假为假。默认axis=0按列查看。any(axis=1)或any(1)为按行。

5.加多一列/进行运算：
https://www.cnblogs.com/wuzhiblog/p/python_new_row_or_col.html

Generator生成器

边循环边计算。
g=(x for x in range(10))
1、输出下一个：next(g)
2、执行：每次调用next()时执行，遇到yield语句返回，再次执行时从上次返回的yield语句处继续执行。

Iterator迭代器

zip()：将可迭代对象对应元素打包成元组，得到由元组组成的列表。
注：若各迭代器的元素个数不一致→返回列表长度与最短的对象相同。

列表解析List comprehension

sent='Fine Ok fine'
ws=sent.split()
w=[len(ws) for w in ws]

计数器Counter()

from Collection import Counter
c=Counter(list)
#返回最常见前n个，倒序
c.most_common(n)
#返回所有元素，按次数重复，无序
c.elements()
#OR:
c=Counter()
c.update(newlist[ or element])

结果例：c.most_common(5)
[(‘the’, 88),
(‘a’, 77),
(‘to’, 66),
(‘in’, 51),
(‘of’, 49)]

9、函数多值参数：
*args：存放元组
**kwargs：存放字典

def func(arg1,arg2,**agr3):
	if arg3.get("age") == "3": 
		return(len(arg3))
	....
	
func(1,2,name='a',age='3')

正则
re.sub(已有,要替换,str)
\W：非字符
\S：非空格
^：从头
|：或
.*?：非贪婪(尽可能少)。不匹配\n
pattern=re.compile(’…’)
re.search(pattern,str)

集合

1、

a=set(list1)
b=set(list2)
#两集合元素有无交叉
print(a.intersection(b))
#两集合里只出现一次的元素
print(a.symmetric_difference(b))
#本集合独有元素
print(a.difference(b))
#集合合并
print(a.union(b))

元组不支持赋值。

匹配文件：glob库。
glob() ：返回匹配指定模式的文件名或目录。

import glob
glob.glob('*.json')

排序：sorted(iterable,key,reverse)
key=lambda 元素:元素[字段索引]：针对字段排序。

print(sorted([4, 5, -10, -1],key=lambda x:-x))#倒序
→[5, 4, -1, -10]
print(sorted([4, 5, -10, -1],reverse=True))#同上

https://www.cnblogs.com/baxianhua/p/8874892.html
https://blog.csdn.net/Super_Tiger_Lee/article/details/78158059
https://www.jianshu.com/p/0c2cd801712b

https://www.jianshu.com/p/9d232e4a3c28
https://www.jianshu.com/p/661af704198c

https://www.cnblogs.com/weiyinfu/p/10693445.html

类文件对象

→读文件：函数open()的返回对象，eg文件流，内存的字节流，网络流，自定义流etc。不需特定类继承，只需要read()方法。

f=open('xxxxxpath','rb',encoding='gbk',errors='ignore') #r或rb
f.read()

写文件：

f=open('xxxxxpath','wb',encoding='gbk',errors='ignore') #w或wb：覆盖，'a'：追加
f.write('...')
f.close() #不写这句数据丢失

with…as

含必要的清理操作，释放资源，可不用在乎异常，替代try…finally。

with open(...) as f:
	f.write(...)

https://blog.csdn.net/msspark/article/details/86745391

accumulate(s)：s序列里向前叠加的和的迭代器(itertools)
e.g.s=[1,2,3]→for i in accumulate(s):print(i)：[1,3,6]