python数据分析函数备忘录

最新推荐文章于 2024-01-17 11:30:00 发布

花纵酒

最新推荐文章于 2024-01-17 11:30:00 发布

阅读量363

点赞数

分类专栏： python之机器学习文章标签： python

本文链接：https://blog.csdn.net/lm19770429/article/details/107443974

版权

python之机器学习专栏收录该内容

33 篇文章 1 订阅

订阅专栏

np.argwhere和DataFrame的unique（）

示例数据：

adults=pd.read_csv('data/adults.txt')
adults.head()

	age	workclass	final_weight	education	education_num	marital_status	occupation	relationship	race	sex	capital_gain	hours_per_week	native_country	salary
0	39	State-gov	77516	Bachelors	13	Never-married	Adm-clerical	Not-in-family	White	Male	2174	40	United-States	<=50K
1	50	Self-emp-not-inc	83311	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	13	United-States	<=50K
2	38	Private	215646	HS-grad	9	Divorced	Handlers-cleaners	Not-in-family	White	Male	0	40	United-States	<=50K
3	53	Private	234721	11th	7	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	0	40	United-States	<=50K
4	28	Private	338409	Bachelors	13	Married-civ-spouse	Prof-specialty	Wife	Black	Female	0	40	Cuba	<=50K

X=adults[['age','education','occupation','hours_per_week']].copy()
Y=adults['salary'].copy()

X.education.unique() #返回元素唯一值

array(['Bachelors', 'HS-grad', '11th', 'Masters', '9th', 'Some-college',
       'Assoc-acdm', 'Assoc-voc', '7th-8th', 'Doctorate', 'Prof-school',
       '5th-6th', '10th', '1st-4th', 'Preschool', '12th'], dtype=object)

np.argwhere(X.education.unique()=='Masters') #返回元素在Series中的位置信息

array([[3]])    # 注意返回的是一个二维array，取值用下标[0][0],或者[0,0]

通常这个方法，把string类型数据进行数字化，做简单分析可以

更好的做法是0ne-hots，如下：

pd.get_dummies

get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, columns=None, sparse=False, drop_first=False, dtype=None) -> 'DataFrame'
    Convert categorical variable into dummy/indicator variables.

把分类变量矢量化

Examples
    --------
    >>> s = pd.Series(list('abca'))
    
    >>> pd.get_dummies(s)
       a  b  c
    0  1  0  0
    1  0  1  0
    2  0  0  1
    3  1  0  0
    
    >>> s1 = ['a', 'b', np.nan]
    
    >>> pd.get_dummies(s1)
       a  b
    0  1  0
    1  0  1
    2  0  0
    
    >>> pd.get_dummies(s1, dummy_na=True)
       a  b  NaN
    0  1  0    0
    1  0  1    0
    2  0  0    1
    
    >>> df = pd.DataFrame({'A': ['a', 'b', 'a'], 'B': ['b', 'a', 'c'],
    ...                    'C': [1, 2, 3]})
    
    >>> pd.get_dummies(df, prefix=['col1', 'col2'])
       C  col1_a  col1_b  col2_a  col2_b  col2_c
    0  1       1       0       0       1       0
    1  2       0       1       1       0       0
    2  3       1       0       0       0       1
    
    >>> pd.get_dummies(pd.Series(list('abcaa')))
       a  b  c
    0  1  0  0
    1  0  1  0
    2  0  0  1
    3  1  0  0
    4  1  0  0
    
    >>> pd.get_dummies(pd.Series(list('abcaa')), drop_first=True)
       b  c
    0  0  0
    1  1  0
    2  0  1
    3  0  0
    4  0  0
    
    >>> pd.get_dummies(pd.Series(list('abc')), dtype=float)
         a    b    c
    0  1.0  0.0  0.0
    1  0.0  1.0  0.0
    2  0.0  0.0  1.0

edu_dummies=pd.get_dummies(X.education,prefix=['edu'])

X_edu_dummies=pd.concat([X,edu_dummies],axis=1)

X_edu_dummies.drop('education',axis=1)

pandas的

map()

类似Python内建的map()方法，pandas中的map()方法将函数、字典索引或是一些需要接受单个输入值的特别的对象与对应的单个列的每一个元素建立联系并串行得到结果。

#定义F->女性，M->男性的映射字典
gender2xb = {'F': '女性', 'M': '男性'}
#利用map()方法得到对应gender列的映射列
data.gender.map(gender2xb)

也可以用lambda函数：

#因为已经知道数据gender列性别中只有F和M所以编写如下lambda函数
data.gender.map(lambda x:'女性' if x is 'F' else '男性')

也可以定义函数：

def gender_to_xb(x):
return '女性' if x is 'F' else '男性'

data.gender.map(gender_to_xb)

也可以用字符串格式化：

data.gender.map("This kid's gender is {}".format)

`apply()`

apply()堪称pandas中最好用的方法，其使用方式跟map()很像，主要传入的主要参数都是接受输入返回输出。

但相较于map()针对单列Series进行处理，一条apply()语句可以对单列或多列进行运算，覆盖非常多的使用场景。

data.gender.apply(lambda x:'女性' if x is 'F' else '男性')

注意在处理多个值时要给apply()添加参数axis=1，表示列向：

def generate_descriptive_statement(year, name, gender, count):
    year, count = str(year), str(count)
    gender = '女性' if gender is 'F' else '男性'

    return '在{}年，叫做{}性别为{}的新生儿有{}个。'.format(year, name, gender, count)

data.apply(lambda row:generate_descriptive_statement(row['year'],
                                                      row['name'],
                                                      row['gender'],
                                                      row['count']),
           axis = 1)