kaggle官方教程函数小抄

Kaggle 学习笔记

标题机器学习数据处理

数值型缺失值

//1.删除缺失值所在的行
X_full.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = X_full.SalePrice
X_full.drop(['SalePrice'], axis=1, inplace=True)

//2.保留数值型数据
X = X_full.select_dtypes(exclude=['object'])
X_test = X_test_full.select_dtypes(exclude=['object'])

//获取每列缺失值个数
missing_val_count_by_column = (X_train.isnull().sum())
print(missing_val_count_by_column[missing_val_count_by_column > 0])

//去除含有缺失值的列
cols_with_missing = [col for col in X_train.columns if X_train[col].isnull().any()]
reduced_X_train = X_train.drop(cols_with_missing, axis = 1)

//填充缺失值
from sklearn.impute import SimpleImputer
imput = SimpleImputer() 
imputed_X_train = pd.DataFrame(imput.fit_transform(X_train))
imputed_X_valid = pd.DataFrame(imput.transform(X_valid))

类别型处理:

object_cols = [col for col in X_train.columns if X_train[col].dtype == "object"]
//选出训练集和验证集中类别种类相同,可进行编码的列名
good_label_cols = [col for col in object_cols if set(X_train[col]) == set(X_valid[col])]
bad_label_cols = list(set(object_cols)-set(good_label_cols))

//展示每列数据种类
object_nunique = list(map(lambda col: X_train[col].nunique(), object_cols))
d= dict(zip(object_cols, object_nunique))
sorted(d.items(), key = lambda x:x[1])

# Columns that will be one-hot encoded
low_cardinality_cols = [col for col in object_cols if X_train[col].nunique() < 10]

# Columns that will be dropped from the dataset
high_cardinality_cols = list(set(object_cols)-set(low_cardinality_cols))


categorical_cols = [cname for cname in X_train_full.columns if X_train_full[cname].nunique() < 10 and X_train_full[cname].dtype == "object"]

# Select numerical columns
numerical_cols = [cname for cname in X_train_full.columns if X_train_full[cname].dtype in ['int64', 'float64']]

Python Pipeline

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

# Preprocessing for numerical data
numerical_transformer = SimpleImputer(strategy='constant')

# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[ ('num', numerical_transformer, numerical_cols),('cat', categorical_transformer, categorical_cols)])

# Define model
model = RandomForestRegressor(n_estimators=100, random_state=0)

# Bundle preprocessing and modeling code in a pipeline
clf = Pipeline(steps=[('preprocessor',preprocessor),('model', model)])

# Preprocessing of training data, fit model 
clf.fit(X_train, y_train)

# Preprocessing of validation data, get predictions
preds = clf.predict(X_valid)
print('MAE:', mean_absolute_error(y_valid, preds))

//交叉验证
scores = -1 * cross_val_score(my_pipeline, X, y,cv=5,scoring='neg_mean_absolute_error')

特征工程

时间戳型数据处理

//时间戳数据应用时应按照时序选择数据训练
clicks = click_data.copy()
clicks['day'] = clicks['click_time'].dt.day.astype('uint8')
# Fill in the rest
clicks['hour'] = clicks['click_time'].dt.hour.astype('uint8')
clicks['minute'] = clicks['click_time'].dt.minute.astype('uint8')
clicks['second'] = clicks['click_time'].dt.second.astype('uint8')

标签型数据处理

from sklearn import preprocessing

cat_features = ['ip', 'app', 'device', 'os', 'channel']
label_encoder = preprocessing.LabelEncoder()
# Create new columns in clicks using preprocessing.LabelEncoder()
for feature in cat_features:
    encoded= label_encoder.fit_transform(clicks[feature])
    clicks[feature + "_labels"]  = encoded
   
//类别处理API category_encoders    
import category_encoders as ce
cat_features = ['category', 'currency', 'country']

# Create the encoder
count_enc = ce.CountEncoder()

# Transform the features, rename the columns with the _count suffix, and join to dataframe
count_encoded = count_enc.fit_transform(ks[cat_features])
data = data.join(count_encoded.add_suffix("_count"))

# Create the encoder
target_enc = ce.TargetEncoder(cols=cat_features)
target_enc.fit(train[cat_features], train['outcome'])

# Transform the features, rename the columns with _target suffix, and join to dataframe
train_TE = train.join(target_enc.transform(train[cat_features]).add_suffix('_target'))
valid_TE = valid.join(target_enc.transform(valid[cat_features]).add_suffix('_target'))

CatBoostEncoding 性能优于LighGBM

# remove IP from the encoded features
cat_features = ['app', 'device', 'os', 'channel']

train, valid, test = get_data_splits(clicks)

# Create the CatBoost encoder
cb_enc = ce.CatBoostEncoder(cols = cat_features)

# Learn encoding from the training set
cb_enc.fit(train[cat_features], train['is_attributed'])

# Apply encoding to the train and validation sets as new columns
# Make sure to add `_cb` as a suffix to the new columns
train_encoded = train.join(cb_enc.transform(train[cat_features]).add_suffix("_cb"))
valid_encoded = valid.join(cb_enc.transform(valid[cat_features]).add_suffix("_cb"))

数据扩增

//针对相关性较大的标签

import itertools

cat_features = ['ip', 'app', 'device', 'os', 'channel']
interactions = pd.DataFrame(index=clicks.index)
for col1, col2 in itertools.combinations(cat_features, 2):
    col_new_name = '_'.join([col1, col2])
    new_value = clicks[col1].map(str) + "_" + clicks[col2].map(str)
    encoder = preprocessing.LabelEncoder()
    interactions[col_new_name] = encoder.fit_transform(new_value)


def count_past_events(series, time_window='6H'):
    series = pd.Series(series.index, index = series)
    past_events = series.rolling(time_window).count()-1
    return past_events

//存疑
timedeltas = clicks.groupby('ip')['click_time'].transform(time_diff)
def time_diff(series):
    """ Returns a series with the time since the last timestamp in seconds """
    return  series.diff().dt.total_seconds()
Kaggle 是一个数据科学和机器学习社区平台,提供了丰富的数据集、竞赛和教程资源。下面是一个简单的 Kaggle 使用教程: 1. 注册一个 Kaggle 账号:访问 Kaggle 官网(https://www.kaggle.com/)并点击右上角的 "Sign Up" 进行注册。 2. 探索数据集:在 Kaggle 上有数千个开放的数据集可供使用。你可以通过搜索或浏览不同的领域和主题来找到感兴趣的数据集。 3. 下载数据集:一旦你找到了想要使用的数据集,你可以点击数据集页面上的 "Download" 进行下载。 4. 参加竞赛:Kaggle 上有举办各种机器学习竞赛,你可以选择参加感兴趣的竞赛。竞赛页面会提供详细的问题描述、数据集和评估指标等信息。 5. 提交结果:在竞赛页面上,你可以下载竞赛提供的训练和测试数据集。你需要使用训练数据集建立模型,并在测试数据集上进行预测。最后,你将提交你的预测结果,Kaggle 会根据评估指标对你的结果进行评估。 6. 加入讨论和社区:Kaggle 是一个活跃的社区平台,你可以加入不同的讨论组、论坛或组织,并与其他数据科学家、机器学习工程师交流和分享经验。 此外,Kaggle 还提供了大量的教程和内置的笔记本资源,用于学习和实践机器学习算法和数据分析技术。你可以在 Kaggle 上搜索并浏览相关的教程资源,以提升你的数据科学技能。 希望这个简单的教程能帮助你开始使用 Kaggle!如有更多问题,请继续提问。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值