1.pandas读取CSV文件。
读取处理:
skiprows:跳过⼀定的⾏数
nrows:仅读取⼀定的⾏数
skipfooter:尾部有固定的⾏数不读取
skip_blank_lines:空⾏跳过
内容处理:
sep/delimiter:分隔符很重要,常⻅的有逗号,空格和Tab(‘\t’),也可以指定正则表达式 na_values:指定应该被当作na_values的数值
thousands:指定每千位分隔符
索引处理:
index_col:将真实的某列(列的数⺫,甚⾄列名)当作index
header:用作列名的行号
columns :指定列名
squeeze:仅读到⼀列时,不再保存为pandas.DataFrame⽽是pandas.Series
2.将数据存储到CSV-to_csv.
import pandas as pd
import numpy as np
df=pd.read_csv('E:/BaiduNetdiskDownload/数据分析与Pandas/07_数据/data/user_info_train.txt',
delimiter=',',encoding='gb18030',header=None,
names=['user_id','sex','occupation','education','marriage_status','account_type'],
index_col='user_id',na_values=4,skiprows=3,nrows=100)
print(df.head())
out:
sex occupation education marriage_status account_type
user_id
6360 NaN 2.0 NaN 3.0 2.0
2583 2.0 2.0 2.0 1.0 1.0
34764 1.0 2.0 3.0 3.0 1.0
9554 1.0 2.0 NaN 2.0 2.0
6720 1.0 2.0 3.0 3.0 2.
print(df.info())
out:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100 entries, 6360 to 1025
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 sex 99 non-null float64
1 occupation 88 non-null float64
2 education 55 non-null float64
3 marriage_status 98 non-null float64
4 account_type 80 non-null float64
dtypes: float64(5)
memory usage: 4.7 KB
None
df.to_csv('E:/BaiduNetdiskDownload/数据分析与Pandas/07_数据/data/liuyan.csv')
df.to_csv('E:/BaiduNetdiskDownload/数据分析与Pandas/07_数据/data/liuyan3.csv'
,na_rep=0,header=None,index=None,float_format='%.2f')
3.读取与写入Exlce文件。
import pandas as pd
import numpy as np
# from scipy import stats
file=pd.ExcelFile('c:/Users/liuyan/Desktop/DEM.xls')
# parse需要提取的工作表名称。
df=file.parse('Sheet1')
df1=df.iloc[:,:8]
df2=df.iloc[:,:2]
print(df1)
out:
Index Compounds ... NA-5 NA-6
0 MEDN0003 Glycine ... 4692.4 3061.7
1 MEDN0005 L-Threonine ... 898160.0 855820.0
2 MEDN0006 L-Tyrosine ... 490300.0 508090.0
3 MEDN0007 L-Arginine ... 31803000.0 29369000.0
4 MEDN0009 L-Aspartic Acid ... 14992000.0 15800000.0
.. ... ... ... ... ...
615 MEDN1295 15-methyl palmitic acid ... 121740.0 128540.0
616 MEDP0308 Dodecanedioic Aicd ... 38549.0 33471.0
617 MEDP0429 Punicic Acid ... 986520.0 858280.0
618 MEDP0585 Stearidonic Acid ... 483200.0 492600.0
619 MEDP1458 Docosaenoic acid ... 71967.0 123320.0
[620 rows x 8 columns]
print(df2)
out:
Index Compounds
0 MEDN0003 Glycine
1 MEDN0005 L-Threonine
2 MEDN0006 L-Tyrosine
3 MEDN0007 L-Arginine
4 MEDN0009 L-Aspartic Acid
.. ... ...
615 MEDN1295 15-methyl palmitic acid
616 MEDP0308 Dodecanedioic Aicd
617 MEDP0429 Punicic Acid
618 MEDP0585 Stearidonic Acid
619 MEDP1458 Docosaenoic acid
[620 rows x 2 columns]
df1.to_excel('c:/Users/liuyan/Desktop/test.xls')