pandas的datafram的一些使用场景

最新推荐文章于 2024-05-01 22:51:03 发布

SCHLAU_tono

最新推荐文章于 2024-05-01 22:51:03 发布

阅读量265

点赞数

文章标签： python 数据挖掘数据分析

本文链接：https://blog.csdn.net/qq_40899248/article/details/121782874

版权

I/O

Input

从本地读取csv文件的方法pd.read_csv(...)
官方文档 pandas.read_csv
其中常见的参数

Parameters	Meaning	Usage
sep	用来标记界定列的符号（Delimiter to use）默认 `,`另外一个参数`delimiter`用法和功能一致
header	指定某一行用作列名。默认的行动是"infer the column". 如果没有使用`names`，则默认等同于`header=0`。如果使用了`names`，则等同于`header=None`
index_col	指定用作索引的列，输入可以是int, str, sequence of int/str, or False，当为False时，强制pandas不使用第一列作为index
quoting	指定pandas对双引号的处理，输入 int or QUOTE_MINIMAL (0), QUOTE_ALL (1), QUOTE_NONNUMERIC (2) or QUOTE_NONE (3)	当`quoting=3`或`quoting=QUOTE_NONE (3)` 时，表示不处理双引号中的内容，读取全部数据

列表插入

要求

给定list A 和list B 且A和B元素一一对应，根据以上数据创建二维表

实现代码

1 查询

1.1 Scenario

找到column A中的元素 alpha 所对应的行在column B的元素

Code

# hiden datafram initialize code
>>> df
   Tags     I      want   to      race
0    VB  0.00  0.009300  0.00  0.00012
1    TO  0.00  0.000000  0.99  0.00000
2    NN  0.00  0.000054  0.00  0.00057
3  PPSS  0.37  0.000000  0.00  0.00000

# To find the entry when 'want' is 'NN'
>>> df.loc[df['Tags']=='NN'][' want']
Out:
2    0.000054
Name:  want, dtype: float64

另一种写法

# Initialization
>>> display(data)
      word1        word2  SimLex999  w2v_sim
0       old          new       1.58      0.0
1     smart  intelligent       9.20      0.0
2      hard    difficult       8.77      0.0
3     happy     cheerful       9.55      0.0
4      hard         easy       0.95      0.0

>>> w1_happy = data.word1 == 'happy'
>>> display(w1_happy)
0      False
1      False
2      False
3       True
4      False

>>> data.loc[w1_happy][data.word2 == 'cheerful'].iloc[0] # select the first element of subset
word1           happy
word2        cheerful
SimLex999        9.55
w2v_sim             0

1.2 条件查询

查询一行中单元格满足条件C的列

code

>>> df
    A   C   D   E
0  11  78   5  11
1  12  98   7  34
2  13  11  11  56
3  89  12  12  78

# Select columns which contains any value between 30 to 40
>>> filter = ((df>=30) & (df<=40)).any()
>>> sub_df = df.loc[: , filter]
>>> print(sub_df)

    B   E
0  34  11
1  31  34
2  16  56
3  41  78

((df>=30) & (df<=40)).any() : (df == 11) returns a same sized dataframe containing only bool values. In this bool dataframe, a cell contains True if the corresponding cell in the original dataframe is 11, otherwise it contains False. Then call any() function on this Boolean dataframe object. It looks over the column axis and returns a bool series. Each value in the bool series represents a column and if value is True then it means that column has one or more 11s.
Then we passed that bool sequence to column section of loc[] to select columns with value 11.

查询一列中单元格满足条件C的行

code

2 赋值

Scenario

给A列，b行的数据赋值

Code

语句

df.loc[index,colum]=values

应用

# hiden datafram initialize code
>>> df
   Tags     I      want   to      race
0    VB  0.00  0.009300  0.00  0.00012
1    TO  0.00  0.000000  0.99  0.00000
2    NN  0.00  0.000054  0.00  0.00057
3  PPSS  0.37  0.000000  0.00  0.00000

>>> df.loc[1,'I']=0.0002

3 查询

3.1 Scenario

查找某一列数据符合条件的行的数量

3.2 Code

语句

Solution 1
value_counts(values,sort=True, ascending=False, normalize=False,bins=None,dropna=True)
Solution 2
dataset.groupby('stance').count()['title']

Parameters:
sort=True：是否要进行排序；默认进行排序
ascending=False：默认降序排列；
normalize=False：是否要对计算结果进行标准化并显示标准化后的结果，默认是False。
bins=None：可以自定义分组区间，默认是否；
dropna=True：是否删除缺失值nan，默认删除