1.读写
读指定格式txt:
train = pd.read_table('/home/hadoop/jzzz/train/subsidy_train.txt',sep=',',header=-1) #助学金
train.columns = ['id','money']
读写csv:college.to_csv('/home/hadoop/jzzz/input/college1.csv',index=True)
college = pd.read_csv('/home/hadoop/jzzz/input/college1.csv')
college.columns = ['college','num']
2.对某列同一值的相应计算:
取最大值:college = pd.DataFrame(score_train_test.groupby(['college'])['score'].max()) #对college相同的项的score值取最大项
计算出现次数:college = pd.DataFrame(score_train_test['college'].value_counts())
统计每个学生的总消费次数:card = pd.DataFrame(card_train_test.groupby(['id'])['consume'].count())
均值:~.mean()
3.合并表格:score_train_test = pd.merge(score_train_test, college, how='left',on='college') #合并score_train_test和college,左外链接,用于链接的列索引为college
接在后面:card_train_test = pd.concat([card_train,card_test])
4.提取满足条件的对应元素:train_shitang=card_train.loc[card_train.how == '食堂']
提取某列值:ids = test['id'].values
提取某项非空的值:train = train_test[train_test['money'].notnull()] 空值:~.isnull()
置NaN为-1:train = train.fillna(-1)
5.predictors = [x for x in train.columns if x not in [target]] #对于训练列中的每一个x,如果x不在target里,将所有的x生成一个新表predictors
6.去重:train_shitang = train_shitang.drop_duplicates()
7.python中range循环的用法 for i in range():
3种:
1: range(10),等于[0,1,2,3,4,5,6,7,8,9]
2: range(1,9),等于[1,2,3,4,5,6,7,8]
3: range(1,9,2),等于[1,3,5,7]