Python-杂记

最新推荐文章于 2022-08-01 23:19:26 发布

瓦全

最新推荐文章于 2022-08-01 23:19:26 发布

阅读量378

点赞数

分类专栏： Python 文章标签：编码数据标签 Python

本文链接：https://blog.csdn.net/qq_31150399/article/details/72921151

版权

4 篇文章 0 订阅

订阅专栏

 
 1.“.fit_transform”与“.transform”的区别 

from sklearn import preprocessing
le = preprocessing.LabelEncoder()

  #用data_y数据 
 训练标签的编码准则，并 
 返回用data_y编码好的标签赋值给data_y 

data_y = le.fit_transform(data_y)

  #用data_y数据 
 训练标签的编码准则 

data_y = le.fit_transform(data_y)

  #用之前训练过的的编码准则和data_y数据来 
 进行标签编码，将编码好的标签 
 返回，赋值给data_y 

data_y = le.fit_transform(data_y)

 
 2.打印交叉表 

print pd.crosstab(data['FAULT_TYPE_3'],data['ORG_NO_5'],margins= True)

 
 3.生成随机矩阵 

①

 
 >>> 
 from 
  numpy 
 import 
  random 

 
 >>> data 
 = 
  random.random(size 
 = 
 ( 
 5 
 , 
 4 
 )) 

 
 >>> data 

 
 array([[ 8.83326804e-01, 4.62247133e-01, 7.00437565e-04, 

 
 6.06600334e-02], 

 
 [ 9.76011953e-01, 9.28506787e-01, 6.00816917e-01, 

 
 3.81064458e-01], 

 
 [ 9.46751253e-01, 4.25659552e-01, 3.25210318e-01, 

 
 7.47624195e-01], 

 
 [ 6.71764806e-01, 2.65358764e-01, 1.84557967e-01, 

 
 4.33813712e-01], 

 
 [ 6.02910969e-01, 3.82080865e-01, 6.20733312e-01, 

 
 8.27651438e-01]]) 

  random函数接收需要生成 
 随机矩阵的形状的元组作为唯一参数。上面的代码将会返回一个两行四列的 
 随机矩阵，随机数的值位于0到1之间，矩阵是numpy.array类型。除了random函数外，还有生成整数 
 随机矩阵的函数randint。 

②

 
 >>> data=random.randint( 
 1 
 , 
 100 
 ,size 
 = 
 ( 
 5 
 , 
 4 
 )) 

 
 >>> df = DataFrame(data,index 
 = 
 [' 
 one 
 ',' 
 two 
 ',' 
 three 
 ',' 
 four 
 ',' 
 five 
 '], 

 
 columns 
 = 
 [' 
 year 
 ',' 
 state 
 ',' 
 pop 
 ',' 
 debt 
 ']) 

 
 >>> df 

  >>> data 

 
 array([[95, 53, 98, 55], 

 
 [94, 93, 44, 62], 

 
 [52, 47, 42, 13], 

 
 [97, 74, 50, 34], 

 
 [53, 4, 25, 11]]) 

 
 4.将矩阵化成dataframe 

 
 >>> 
 from 
  pandas 
 import 
  DataFrame 

 
 >>> df = DataFrame(data,index=['one','two','three','four','five'], 

 
 columns=['year','state','pop','debt']) 

 
 >>> df 

 
 year state pop debt 

 
 one 95 53 98 55 

 
 two 94 93 44 62 

 
 three 52 47 42 13 

 
 four 97 74 50 34 

 
 five 53 4 25 11 

 
 5.索引、切片 

  ——pandas 对象的 index 不限于整数 

  series 

 
 >>> df['year'] 

 
 one 95 

 
 two 94 

 
 three 52 

 
 four 97 

 
 five 53 

 
 Name: year, dtype: int32 

  ①使用整数做切片索引——从0开始，不包含右边界 

 
 >>> df['year'][2:4] 

 
 three 52 

 
 four 97 

 
 Name: year, dtype: int32 

  ②使用非整数作为切片索引——包含末端 

 
 >>> df['year']['two':'four'] 

 
 two 94 

 
 three 52 

 
 four 97 

 
 Name: year, dtype: int32 

 
 DataFrame 

  DataFrame 对象的标准切片语法为： 
 .ix[::,::]。ix 对象可以接受两套切片，分别为行（axis=0）和列（axis=1）的方向 

 
 >>> df.ix[2:4,0:3] 

 
 year state pop 

 
 three 52 47 42 

 
 four 97 74 50 

  而不使用 ix ，直接切的情况就特殊了： 

 
 索引 

 
 >>> df['year'] 

 
 one 95 

 
 two 94 

 
 three 52 

 
 four 97 

 
 five 53 

 
 Name: year, dtype: int32 

 
 切片 

 
 >>> df[2:4] 

 
 year state pop debt 

 
 three 52 47 42 13 

 
 four 97 74 50 34 

 
 >>> df['two':'four'] 

 
 year state pop debt 

 
 two 94 93 44 62 

 
 three 52 47 42 13 

 
 four 97 74 50 34 

 
 6.使用pandas的get_dummies实现分类属性的独热编码 

  源码如下（ 
 红色为个人翻译的注释）： 

 
 def 
 get_dummies(data, prefix= 
 None 
 , prefix_sep= 
 '_' 
 , dummy_na= 
 False 
 , 

 
 columns= 
 None 
 , sparse= 
 False 
 ): 

"""

 
 Convert categorical variable into dummy/indicator variables 

 
 Parameters 

 
 ---------- 

 
 data : array-like, Series, or DataFrame 
 #数据集 

 
 prefix : string, list of strings, or dict of strings, default None 
 #给编码后的列加前缀，默认是none；可以定义前缀名字，如统一标注prefix='col'或者按原始列名标注prefix=['colA','colB'] 

 
 String to append DataFrame column names 

 
 Pass a list with length equal to the number of columns 

 
 when calling get_dummies on a DataFrame. Alternativly, `prefix` 

 
 can be a dictionary mapping column names to prefixes. 

 
 prefix_sep : string, default '_' 
 #编码后的前缀与原始列名之间的分隔符，默认为'_'，可以自定义为其他 

 
 If appending prefix, separator/delimiter to use. Or pass a 

 
 list or dictionary as with `prefix.` 

 
 dummy_na : bool, default False 
 #布尔值，是否加一列来给空行做标记，默认为否 

 
 Add a column to indicate NaNs, if False NaNs are ignored. 

 
 columns : list-like, default None 
 #将指定的列做独热编码，默认为none，个人认为与prefix类似，但是prefix是默认将全部分类变量进行独热编码，而columns可以指定部分列进行编码 

 
 Column names in the DataFrame to be encoded. 

 
 If `columns` is None then all the columns with 

 
 `object` or `category` dtype will be converted. 

 
 sparse : bool, default False 
 #布尔值，是否将DataFrame转换为稀疏矩阵，默认为否 

 
 Whether the returned DataFrame should be sparse or not. 

 
 .. versionadded:: 0.16.1 

 
 Returns 

 
 ------- 

 
 dummies : DataFrame 

 
  
 Examples 

 
 -------- 

 
 >>> import pandas as pd 

 
 >>> s = pd.Series(list('abca')) 

 
 >>> get_dummies(s) 

 
 a b c 

 
 0 1 0 0 

 
 1 0 1 0 

 
 2 0 0 1 

 
 3 1 0 0 

 
 >>> s1 = ['a', 'b', np.nan] 

 
 >>> get_dummies(s1) 

a b

 
 0 1 0 

 
 1 0 1 

 
 2 0 0 

 
 >>> get_dummies(s1, dummy_na=True) 

 
 a b NaN 

 
 0 1 0 0 

 
 1 0 1 0 

 
 2 0 0 1 

 
 >>> df = DataFrame({'A': ['a', 'b', 'a'], 'B': ['b', 'a', 'c'], 

 
 'C': [1, 2, 3]}) 

 
 >>> get_dummies(df, prefix=['col1', 'col2']): 

 
 C col1_a col1_b col2_a col2_b col2_c 

 
 0 1 1 0 0 1 0 

 
 1 2 0 1 1 0 0 

 
 2 3 1 0 0 0 1 

 
 See also ``Series.str.get_dummies``. 

"""

PS

  >>> pd.get_dummies(df, prefix_sep='.',columns='A') 

  B C A.a A.b 

  0 b 1 1 0 

  1 a 2 0 1 

  2 c 3 1 0 

  >>> pd.get_dummies(df, prefix_sep='.',columns='A','B') 

  SyntaxError: non-keyword arg after keyword arg 

  >>> pd.get_dummies(df, prefix_sep='.',columns=['A','B']) 

  C A.a A.b B.a B.b B.c 

  0 1 1 0 0 1 0 

  1 2 0 1 1 0 0 

  2 3 1 0 0 0 1 

关注

专栏目录