python 数据处理 - Pandas模块

最新推荐文章于 2024-08-29 10:35:36 发布

o(>-<)o

最新推荐文章于 2024-08-29 10:35:36 发布

阅读量832

点赞数

分类专栏： python库文章标签： python 数据分析

本文链接：https://blog.csdn.net/weixin_41535493/article/details/109102348

版权

python库专栏收录该内容

6 篇文章 0 订阅

订阅专栏

这篇博客介绍了Pandas库中DataFrame的创建、索引操作，包括reset_index、set_index、reindex函数的使用，以及replace方法进行数据替换，pd.to_datetime转换日期格式，pd.read_csv读取CSV文件。还分享了处理缺失值、删除行和列的技巧，以及数据拷贝等实用操作。

摘要由CSDN通过智能技术生成

python 数据处理 - Pandas模块

1.一些函数

1.1 pd.DataFrame

class pandas.``DataFrame(data=None, index=None**,** columns=None**,** dtype=None**,** copy=False**)**

Parameters

data: ndarray (structured or homogeneous), Iterable, dict, or DataFrame
index: Index or array-like
columns: Index or array-like
dtype: dtype, default None
copy: bool, default False

Copy data from inputs. Only affects DataFrame / 2d ndarray input

# 示例1
voltage = pd.DataFrame(data=np.array(voltage),
			index=pd.MultiIndex.from_product([id, ['real', 'pred']], names=['vin', 'datatype']))

1.2 reset_index, set_index, reindex

1.2.1. reset_index:

DataFrame.``reset_index(level=None, drop=False**,** inplace=False**,** col_level=0**,** col_fill=’')

Definition:

Reset the index, or a level of it. Reset the index of the DataFrame, and use the default one instead. If the DataFrame has a MultiIndex, this method can remove one or more levels.

Parameters

level: int, str, tuple, or list, default None

Only remove the given levels from the index. Removes all levels by default.
drop: bool, default False

Do not try to insert index into dataframe columns. This resets the index to the default integer index.
inplace: bool, default False

Modify the DataFrame in place (do not create a new object).
col_level: int or str, default 0

If the columns have multiple levels, determines which level the labels are inserted into. By default it is inserted into the first level.
col_fill: object, default ‘’

If the columns have multiple levels, determines how the other levels are named. If None then the index name is repeated.

# 示例1
df.reset_index(inplace=True, drop=True)

1.2.2. set_index:

DataFrame.``set_index(keys, drop=True**,** append=False**,** inplace=False**,** verify_integrity=False**)**

Definition:

Set the DataFrame index using existing columns. Set the DataFrame index (row labels) using one or more existing columns or arrays (of the correct length). The index can replace the existing index or expand on it.

Parameters

keys: label or array-like or list of labels/arrays

This parameter can be either a single column key, a single array of the same length as the calling DataFrame, or a list containing an arbitrary combination of column keys and arrays. Here, “array” encompasses Series, Index, np.ndarray, and instances of Iterator.
drop: bool, default True

Delete columns to be used as the new index.
append: bool, default False

Whether to append columns to existing index.
inplace: bool, default False

Modify the DataFrame in place (do not create a new object).
verify_integrity: bool, default False

Check the new index for duplicates. Otherwise defer the check until necessary. Setting to False will improve the performance of this method.

# 示例1
df.set_index('vin', inplace=True)
# 示例2, multiindex
df.set_index(['a', 'b'], inplace=True)
# 示例3, series
df.set_index(pd.Series(range(6)))

1.2.3 reindex

Definition:

Conform Series/DataFrame to new index with optional filling logic. Places NA/NaN in locations having no value in the previous index. A new object is produced unless the new index is equivalent to the current one and copy=False.

Parameters

keywords for axes: array-like, optional

New labels / index to conform to, should be specified using keywords. Preferably an Index object to avoid duplicating data.
method: {None, ‘backfill’/’bfill’, ‘pad’/’ffill’, ‘nearest’}

Method to use for filling holes in reindexed DataFrame. Please note: this is only applicable to DataFrames/Series with a monotonically increasing/decreasing index. None (default): don’t fill gaps pad / ffill: Propagate last valid observation forward to next valid. backfill / bfill: Use next valid observation to fill gap. nearest: Use nearest valid observations to fill gap.
copy: bool, default True

Return a new object, even if the passed indexes are the same.
level: int or name

Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value: scalar, default np.NaN

Value to use for missing values. Defaults to NaN, but can be any “compatible” value.
limit: int, default None

Maximum number of consecutive elements to forward or backward fill.
tolerance: optional

Maximum distance between original and new labels for inexact matches. The values of the index at the matching locations most satisfy the equation abs(index[indexer] - target) <= tolerance. Tolerance may be a scalar value, which applies the same tolerance to all values, or list-like, which applies variable tolerance per element. List-like includes list, tuple, array, Series, and must be the same size as the index and its dtype must exactly match the index’s type.

# 示例1
df = df.reindex(index=car_df.index[::-1])
# 示例2
df = df.reindex(columns=['a', 'b'])
# 示例3 - 使用df.columns重新命名dataframe

1.3. replace

DataFrame.``replace(to_replace=None, value=None**,** inplace=False**,** limit=None**,** regex=False**,** method='pad’)

Parameters

to_replace: str, regex, list, dict, Series, int, float, or None
value: scalar, dict, list, str, regex, default None

Value to replace any values matching to_replace with.

# 示例1
df['TermIndex'].replace([1,2],['一','二'], inplace=True) #替换“TermIndex”的值，将数字转为中文

1.4. pd.to_datetime

pandas.to_datetime（arg，errors =‘raise’，utc = None，format = None，unit = None ）
pandas中的to_datetime( )有和datetime( )类似的功能。获取指定的时间和日期，将Str和Unicode转化为datetimedatetime(str,format)。

# 示例1
df['date_formatted']=pd.to_datetime(df['date'],format='%Y-%m-%d')

1.5. pd.read_csv

Parameters

filepath_or_buffer: str, path object or file-like object
header**: int, list of int, default ‘infer’**
names****array-like, optional

List of column names to use. If the file contains a header row, then you should explicitly pass header=0 to override the column names. Duplicates in this list are not allowed.
index_col****int, str, sequence of int / str, or False, default None
usecols****list-like or callable, optional
dtype****Type name or dict of column -> type, optional
nrows****int, optional
encodingstr, optional

Encoding to use for UTF when reading/writing (ex. ‘utf-8’). List of Python standard encodings
chunksizeint, optional

Return TextFileReader object for iteration. See the IO Tools docs for more information on iterator and chunksize.
parse_datesbool or list of int or names or list of lists or dict, default False

The behavior is as follows:
- boolean. If True -> try parsing the index.
- list of int or names. e.g. If [1, 2, 3] -> try parsing columns 1, 2, 3 each as a separate date column.
- list of lists. e.g. If [[1, 3]] -> combine columns 1 and 3 and parse as a single date column.
- dict, e.g. {‘foo’ : [1, 3]} -> parse columns 1, 3 as date and call result ‘foo’
If a column or index cannot be represented as an array of datetimes, say because of an unparseable value or a mixture of timezones, the column or index will be returned unaltered as an object data type. For non-standard datetime parsing, use pd.to_datetime after pd.read_csv. To parse an index or column with a mixture of timezones, specify date_parser to be a partially-applied pandas.to_datetime() with utc=True. See Parsing a CSV with mixed timezones for more.

Note: A fast-path exists for iso8601-formatted dates.
date_parserfunction, optional

Function to use for converting a sequence of string columns to an array of datetime instances. The default uses dateutil.parser.parser to do the conversion. Pandas will try to call date_parser in three different ways, advancing to the next if an exception occurs: 1) Pass one or more arrays (as defined by parse_dates) as arguments; 2) concatenate (row-wise) the string values from the columns defined by parse_dates into a single array and pass that; and 3) call date_parser once for each row using one or more strings (corresponding to the columns defined by parse_dates) as arguments.

# 示例1
data = pd.read_csv(r"D:\ThinkEnergy_Data\车辆网样例数据\15.csv",usecols=['created','voltage','total_current','soc_display','year','month','day'])
# 示例2 - 输入设置index，如下将第一列作为索引
data = pd.read_csv(filepath, index_col=0)
# 示例3 - 输出没有index
data.to_csv(filepath, index=False)

2.一些tricks

2.1. 取满足条件的行 & 取满足条件行的相应索引

# 示例1
data = data[((data.index) % 50 == 0)]
# 示例2
index = data[(data['status'] == 2) & (data['charging'] == 1)].index.tolist()
# 示例3 - 找到包含nan值的行
temp_indices = data[data.isnull().T.any()].index.tolist()

2.2. 处理缺失值

# 示例1
data.fillna(0, inplace=True)
# 示例2
data.dropna(subset=['volt'], axis=0, how='any', inplace=True)
# 示例3
data.replace(nan, 0)

2.3. 丢弃 drop

# 示例1
data.drop(temp_indices, axis=1,inplace=True)
# 示例2
data.drop_duplicates(inplace=True)

2.4. 拷贝

# 示例1
df1 = df.copy(deep=True)

2.5. 其他

df1 = df.describe()

a=pd.cut(df['压差'],[0,0.005,0.01,0.015,0.02,0.025,0.03,0.035,0.04,0.045,
0.05,0.055,0.06,0.065,0.07,0.075,0.08,0.085,0.09,0.095,0.1,1000])
df_V = pd.Series(df['压差']).groupby(a).agg(['count']).reset_index()#groupby提取汇总统计值
print(df_V)