python 数据处理 - Pandas模块
1.一些函数
1.1 pd.DataFrame
class pandas.``DataFrame
(data=None, index=None**,** columns=None**,** dtype=None**,** copy=False**)**
Parameters
-
data: ndarray (structured or homogeneous), Iterable, dict, or DataFrame
-
index: Index or array-like
-
columns: Index or array-like
-
dtype: dtype, default None
-
copy: bool, default False
Copy data from inputs. Only affects DataFrame / 2d ndarray input
# 示例1
voltage = pd.DataFrame(data=np.array(voltage),
index=pd.MultiIndex.from_product([id, ['real', 'pred']], names=['vin', 'datatype']))
1.2 reset_index, set_index, reindex
1.2.1. reset_index:
DataFrame.``reset_index
(level=None, drop=False**,** inplace=False**,** col_level=0**,** col_fill=’')
Definition:
Reset the index, or a level of it. Reset the index of the DataFrame, and use the default one instead. If the DataFrame has a MultiIndex, this method can remove one or more levels.
Parameters
-
level: int, str, tuple, or list, default None
Only remove the given levels from the index. Removes all levels by default.
-
drop: bool, default False
Do not try to insert index into dataframe columns. This resets the index to the default integer index.
-
inplace: bool, default False
Modify the DataFrame in place (do not create a new object).
-
col_level: int or str, default 0
If the columns have multiple levels, determines which level the labels are inserted into. By default it is inserted into the first level.
-
col_fill: object, default ‘’
If the columns have multiple levels, determines how the other levels are named. If None then the index name is repeated.
# 示例1
df.reset_index(inplace=True, drop=True)
1.2.2. set_index:
DataFrame.``set_index
(keys, drop=True**,** append=False**,** inplace=False**,** verify_integrity=False**)**
Definition:
Set the DataFrame index using existing columns. Set the DataFrame index (row labels) using one or more existing columns or arrays (of the correct length). The index can replace the existing index or expand on it.
Parameters
-
keys: label or array-like or list of labels/arrays
This parameter can be either a single column key, a single array of the same length as the calling DataFrame, or a list containing an arbitrary combination of column keys and arrays. Here, “array” encompasses
Series
,Index
,np.ndarray
, and instances ofIterator
. -
drop: bool, default True
Delete columns to be used as the new index.
-
append: bool, default False
Whether to append columns to existing index.
-
inplace: bool, default False
Modify the DataFrame in place (do not create a new object).
-
verify_integrity: bool, default False
Check the new index for duplicates. Otherwise defer the check until necessary. Setting to False will improve the performance of this method.
# 示例1
df.set_index('vin', inplace=True)
# 示例2, multiindex
df.set_index(['a', 'b'], inplace=True)
# 示例3, series
df.set_index(pd.Series(range(6)))
1.2.3 reindex
Definition:
Conform Series/DataFrame to new index with optional filling logic. Places NA/NaN in locations having no value in the previous index. A new object is produced unless the new index is equivalent to the current one and copy=False
.
Parameters
-
keywords for axes: array-like, optional
New labels / index to conform to, should be specified using keywords. Preferably an Index object to avoid duplicating data.
-
method: {None, ‘backfill’/’bfill’, ‘pad’/’ffill’, ‘nearest’}
Method to use for filling holes in reindexed DataFrame. Please note: this is only applicable to DataFrames/Series with a monotonically increasing/decreasing index. None (default): don’t fill gaps pad / ffill: Propagate last valid observation forward to next valid. backfill / bfill: Use next valid observation to fill gap. nearest: Use nearest valid observations to fill gap.
-
copy: bool, default True
Return a new object, even if the passed indexes are the same.
-
level: int or name
Broadcast across a level, matching Index values on the passed MultiIndex level.
-
fill_value: scalar, default np.NaN
Value to use for missing values. Defaults to NaN, but can be any “compatible” value.
-
limit: int, default None
Maximum number of consecutive elements to forward or backward fill.
-
tolerance: optional
Maximum distance between original and new labels for inexact matches. The values of the index at the matching locations most satisfy the equation
abs(index[indexer] - target) <= tolerance
. Tolerance may be a scalar value, which applies the same tolerance to all values, or list-like, which applies variable tolerance per element. List-like includes list, tuple, array, Series, and must be the same size as the index and its dtype must exactly match the index’s type.
# 示例1
df = df.reindex(index=car_df.index[::-1])
# 示例2
df = df.reindex(columns=['a', 'b'])
# 示例3 - 使用df.columns重新命名dataframe
1.3. replace
DataFrame.``replace
(to_replace=None, value=None**,** inplace=False**,** limit=None**,** regex=False**,** method='pad’)
Parameters
-
to_replace: str, regex, list, dict, Series, int, float, or None
-
value: scalar, dict, list, str, regex, default None
Value to replace any values matching to_replace with.
# 示例1
df['TermIndex'].replace([1,2],['一','二'], inplace=True) #替换“TermIndex”的值,将数字转为中文
1.4. pd.to_datetime
pandas.to_datetime(arg,errors =‘raise’,utc = None,format = None,unit = None )
pandas中的to_datetime( )有和datetime( )类似的功能。获取指定的时间和日期,将Str和Unicode转化为datetimedatetime(str,format)。
# 示例1
df['date_formatted']=pd.to_datetime(df['date'],format='%Y-%m-%d')
1.5. pd.read_csv
Parameters
-
filepath_or_buffer: str, path object or file-like object
-
header**: int, list of int, default ‘infer’**
-
names****array-like, optional
List of column names to use. If the file contains a header row, then you should explicitly pass
header=0
to override the column names. Duplicates in this list are not allowed. -
index_col****int, str, sequence of int / str, or False, default
None
-
usecols****list-like or callable, optional
-
dtype****Type name or dict of column -> type, optional
-
nrows****int, optional
-
encodingstr, optional
Encoding to use for UTF when reading/writing (ex. ‘utf-8’). List of Python standard encodings
-
chunksizeint, optional
Return TextFileReader object for iteration. See the IO Tools docs for more information on
iterator
andchunksize
. -
parse_datesbool or list of int or names or list of lists or dict, default False
The behavior is as follows:
- boolean. If True -> try parsing the index.
- list of int or names. e.g. If [1, 2, 3] -> try parsing columns 1, 2, 3 each as a separate date column.
- list of lists. e.g. If [[1, 3]] -> combine columns 1 and 3 and parse as a single date column.
- dict, e.g. {‘foo’ : [1, 3]} -> parse columns 1, 3 as date and call result ‘foo’
If a column or index cannot be represented as an array of datetimes, say because of an unparseable value or a mixture of timezones, the column or index will be returned unaltered as an object data type. For non-standard datetime parsing, use
pd.to_datetime
afterpd.read_csv
. To parse an index or column with a mixture of timezones, specifydate_parser
to be a partially-appliedpandas.to_datetime()
withutc=True
. See Parsing a CSV with mixed timezones for more.Note: A fast-path exists for iso8601-formatted dates.
-
date_parserfunction, optional
Function to use for converting a sequence of string columns to an array of datetime instances. The default uses
dateutil.parser.parser
to do the conversion. Pandas will try to call date_parser in three different ways, advancing to the next if an exception occurs: 1) Pass one or more arrays (as defined by parse_dates) as arguments; 2) concatenate (row-wise) the string values from the columns defined by parse_dates into a single array and pass that; and 3) call date_parser once for each row using one or more strings (corresponding to the columns defined by parse_dates) as arguments.
# 示例1
data = pd.read_csv(r"D:\ThinkEnergy_Data\车辆网样例数据\15.csv",usecols=['created','voltage','total_current','soc_display','year','month','day'])
# 示例2 - 输入设置index,如下将第一列作为索引
data = pd.read_csv(filepath, index_col=0)
# 示例3 - 输出没有index
data.to_csv(filepath, index=False)
2.一些tricks
2.1. 取满足条件的行 & 取满足条件行的相应索引
# 示例1
data = data[((data.index) % 50 == 0)]
# 示例2
index = data[(data['status'] == 2) & (data['charging'] == 1)].index.tolist()
# 示例3 - 找到包含nan值的行
temp_indices = data[data.isnull().T.any()].index.tolist()
2.2. 处理缺失值
# 示例1
data.fillna(0, inplace=True)
# 示例2
data.dropna(subset=['volt'], axis=0, how='any', inplace=True)
# 示例3
data.replace(nan, 0)
2.3. 丢弃 drop
# 示例1
data.drop(temp_indices, axis=1,inplace=True)
# 示例2
data.drop_duplicates(inplace=True)
2.4. 拷贝
# 示例1
df1 = df.copy(deep=True)
2.5. 其他
df1 = df.describe()
a=pd.cut(df['压差'],[0,0.005,0.01,0.015,0.02,0.025,0.03,0.035,0.04,0.045,
0.05,0.055,0.06,0.065,0.07,0.075,0.08,0.085,0.09,0.095,0.1,1000])
df_V = pd.Series(df['压差']).groupby(a).agg(['count']).reset_index()#groupby提取汇总统计值
print(df_V)