练手项目——Click-Through Rate Prediction 逻辑回归

系统环境:

  • 操作系统:Windows 8.1 64-bit
  • CPU:Intel i7-4200HQ 3.60GHz
  • RAM:8GB
  • GPU: GeForce GTX 970M (CUDA 10.1)


库环境:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib
import sklearn


print('pandas version:',pd.__version__)
print('matplotlib version:',matplotlib.__version__)
print('seaborn version:',sns.__version__)
print('sklearn version:',sklearn.__version__)
pandas version: 0.23.4
matplotlib version: 2.2.3
seaborn version: 0.9.0
sklearn version: 0.22.2.post1

读取数据

  • 数据下载地址:kaggle_Click-Through Rate Prediction
  • 由于数据量大,笔记本内存有限,无法读取和运算如此大量的数据。因此使用dump_data.py进行采样,选取10000条数据进行训练,10000条数据进行测试

File descriptions:

  • train - Training set. 10 days of click-through data, ordered chronologically. Non-clicks and clicks are subsampled according to different strategies.
  • test - Testing set. 1 day of ads to for testing your model predictions.

Data fields:

  • id: ad identifier
  • click: 0/1 for non-click/click
  • hour: format is YYMMDDHH, so 14091123 means 23:00 on Sept. 11, 2014 UTC.
  • C1 – anonymized categorical variable
  • banner_pos
  • site_id
  • site_domain
  • site_category
  • app_id
  • app_domain
  • app_category
  • device_id
  • device_ip
  • device_model
  • device_type
  • device_conn_type
  • C14-C21 – anonymized categorical variables
data_train = pd.read_csv(r'./dataset/ctr/train_sample_ctr_8450.csv')
data_test = pd.read_csv(r'.\dataset\ctr\test_sample_ctr_155.csv')
data_train.info()
data_test.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8450 entries, 0 to 8449
Data columns (total 24 columns):
id                  8450 non-null uint64
click               8450 non-null int64
hour                8450 non-null int64
C1                  8450 non-null int64
banner_pos          8450 non-null int64
site_id             8450 non-null object
site_domain         8450 non-null object
site_category       8450 non-null object
app_id              8450 non-null object
app_domain          8450 non-null object
app_category        8450 non-null object
device_id           8450 non-null object
device_ip           8450 non-null object
device_model        8450 non-null object
device_type         8450 non-null int64
device_conn_type    8450 non-null int64
C14                 8450 non-null int64
C15                 8450 non-null int64
C16                 8450 non-null int64
C17                 8450 non-null int64
C18                 8450 non-null int64
C19                 8450 non-null int64
C20                 8450 non-null int64
C21                 8450 non-null int64
dtypes: int64(14), object(9), uint64(1)
memory usage: 1.5+ MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 155 entries, 0 to 154
Data columns (total 23 columns):
id                  155 non-null uint64
hour                155 non-null int64
C1                  155 non-null int64
banner_pos          155 non-null int64
site_id             155 non-null object
site_domain         155 non-null object
site_category       155 non-null object
app_id              155 non-null object
app_domain          155 non-null object
app_category        155 non-null object
device_id           155 non-null object
device_ip           155 non-null object
device_model        155 non-null object
device_type         155 non-null int64
device_conn_type    155 non-null int64
C14                 155 non-null int64
C15                 155 non-null int64
C16                 155 non-null int64
C17                 155 non-null int64
C18                 155 non-null int64
C19                 155 non-null int64
C20                 155 non-null int64
C21                 155 non-null int64
dtypes: int64(13), object(9), uint64(1)
memory usage: 27.9+ KB
data_train.describe()
idclickhourC1banner_posdevice_typedevice_conn_typeC14C15C16C17C18C19C20C21
count8.450000e+038450.0000008.450000e+038450.0000008450.0000008450.0000008450.0000008450.0000008450.0000008450.0000008450.0000008450.0000008450.0000008450.0000008450.000000
mean9.179895e+180.1777511.410255e+071004.9597630.2872191.0101780.34698218833.676805319.02721961.3711242110.1647341.446154222.52568052899.88260483.929231
std5.358954e+180.3823262.981426e+021.0859780.5109660.5145680.8695655012.52571623.67281052.690929615.5402731.327442343.07772449979.53035070.765039
min1.275883e+150.0000001.410210e+071001.0000000.0000000.0000000.000000375.000000216.00000036.000000112.0000000.00000033.000000-1.00000013.000000
25%4.536772e+180.0000001.410230e+071005.0000000.0000001.0000000.00000016920.000000320.00000050.0000001863.0000000.00000035.000000-1.00000023.000000
50%9.077211e+180.0000001.410252e+071005.0000000.0000001.0000000.00000020352.000000320.00000050.0000002325.0000002.00000039.000000100035.50000061.000000
75%1.377866e+190.0000001.410281e+071005.0000001.0000001.0000000.00000021894.000000320.00000050.0000002526.0000003.000000171.000000100103.000000110.000000
max1.844664e+191.0000001.410302e+071012.0000007.0000005.0000005.00000024041.000000768.0000001024.0000002756.0000003.0000001839.000000100248.000000253.000000
data_test.describe()
idhourC1banner_posdevice_typedevice_conn_typeC14C15C16C17C18C19C20C21
count1.550000e+021.550000e+02155.000000155.000000155.000000155.000000155.000000155.000000155.000000155.000000155.000000155.000000155.000000155.000000
mean8.928224e+181.410311e+071005.0516130.1870971.0516130.38064521478.096774318.94193553.7806452437.0645161.309677153.86451656847.98064592.683871
std5.552391e+185.682679e+000.9383150.3912530.5070100.9136134764.1987008.76921927.678645607.8956541.331768215.95863249765.04095087.115519
min7.968314e+161.410310e+071002.0000000.0000000.0000000.0000004687.000000216.00000036.000000423.0000000.00000033.000000-1.00000013.000000
25%4.190929e+181.410311e+071005.0000000.0000001.0000000.00000022104.000000320.00000050.0000002545.0000000.00000035.000000-1.00000023.000000
50%9.154648e+181.410311e+071005.0000000.0000001.0000000.00000023141.000000320.00000050.0000002664.0000001.00000039.000000100077.00000051.000000
75%1.394021e+191.410312e+071005.0000000.0000001.0000000.00000024095.500000320.00000050.0000002761.0000003.000000175.000000100148.000000221.000000
max1.835151e+191.410312e+071010.0000001.0000004.0000003.00000024320.000000320.000000250.0000002790.0000003.0000001327.000000100233.000000251.000000

观察上述数据,可以发现数据整体不存在缺失值,数据类型为int或object, 需要对object类型数据进行向量化/离散化。对于ID这个特征,可以看到其方差极大,说明是一个连续的数据,结合经验,可以直接删除该特征

数据预处理

  • 删除id特征
data_train = data_train.drop('id',axis=1)
data_test = data_test.drop('id',axis=1)
  • 观察object类型特征并进行向量化/离散化处理
df_object = data_train.filter(regex='site_id|site_domain|site_category|app_id|app_domain|app_category|device_id|device_ip|device_model')
for itm, column in enumerate(df_object.columns):
    category = df_object.iloc[itm].unique()
    num = len(category)
    print('{}: {} \n num: {}'.format(column, category, num))
site_id: ['6399eda6' '968765cd' 'f028772b' 'ecad2386' '7801e8d9' '07d7df22'
 'a99f214a' 'ffd71479' '76dc4769'] 
 num: 9
site_domain: ['8fda644b' '25d4cfcd' 'f028772b' 'ecad2386' '7801e8d9' '07d7df22'
 'a99f214a' '34674ab0' '6569eeb3'] 
 num: 9
site_category: ['5b4d2eda' '16a36ef3' 'f028772b' 'ecad2386' '7801e8d9' '07d7df22'
 'a99f214a' 'cdc5cc94' '00b1f3a7'] 
 num: 9
app_id: ['d8bb8687' '98e6755b' '3e814130' 'ecad2386' '7801e8d9' '07d7df22'
 'a99f214a' '799d2fa5' 'be6db1d7'] 
 num: 9
app_domain: ['1fbe01fe' 'f3845767' '28905ebd' 'ecad2386' '7801e8d9' '07d7df22'
 'a99f214a' '0a945a46' '8a4875bd'] 
 num: 9
app_category: ['83a0ad1a' '5c9ae867' 'f028772b' 'ecad2386' '7801e8d9' '07d7df22'
 'a99f214a' '01bd946f' 'c48dce72'] 
 num: 9
device_id: ['85f751fd' 'c4e18dd6' '50e219e0' 'e2a1ca37' '2347f47a' '8ded1f7a'
 '324d6925' '51b00dde' 'dc70b0f9'] 
 num: 9
device_ip: ['1fbe01fe' 'f3845767' '28905ebd' 'ecad2386' '7801e8d9' '07d7df22'
 'a99f214a' '2d905ba5' 'c6263d8a'] 
 num: 9
device_model: ['85f751fd' 'c4e18dd6' '50e219e0' '685d1c4c' '2347f47a' '8ded1f7a'
 'a99f214a' 'a0ec533d' '88fbaa46'] 
 num: 9

特征离散/因子化

site_id = pd.get_dummies(data_train['site_id'], prefix= 'site_id')
site_domain = pd.get_dummies(data_train['site_domain'], prefix= 'site_domain')
site_category = pd.get_dummies(data_train['site_category'], prefix= 'site_category')
app_id = pd.get_dummies(data_train['app_id'], prefix= 'app_id')
app_domain = pd.get_dummies(data_train['app_domain'], prefix= 'app_domain')
app_category = pd.get_dummies(data_train['app_category'], prefix= 'app_category')
device_id = pd.get_dummies(data_train['device_id'], prefix= 'device_id')
device_ip = pd.get_dummies(data_train['device_ip'], prefix= 'device_ip')
device_model = pd.get_dummies(data_train['device_model'], prefix= 'device_model')

df = pd.concat([site_id, site_domain, site_category, 
                app_id, app_domain, app_category, 
                device_id, device_ip, device_model], axis=1)
df.columns
Index(['site_id_021cd138', 'site_id_02296256', 'site_id_023f3644',
       'site_id_0273c5ad', 'site_id_02d5151c', 'site_id_030440fe',
       'site_id_05c65e53', 'site_id_06a0ac14', 'site_id_070ca277',
       'site_id_079325ff',
       ...
       'device_model_feacaaee', 'device_model_feb70d53',
       'device_model_ff065cf0', 'device_model_ff0f1aca',
       'device_model_ff16d623', 'device_model_ff2a3543',
       'device_model_ff607a1a', 'device_model_ff91ea03',
       'device_model_ffc70ef9', 'device_model_ffe69079'],
      dtype='object', length=11919)

训练

小部分特征训练

  • 只选择int类型数据进行训练
def train(train_np):
    from sklearn import linear_model
    from sklearn.metrics import accuracy_score
    from sklearn.model_selection import train_test_split
    import time
    # y即Survival结果
    y = train_np[:, 0]
    # X即特征属性值
    X = train_np[:, 1:]


    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

    clf = linear_model.LogisticRegression(C=1,penalty="l2")
    t1 = time.time()
    clf.fit(X_train,y_train)
    t2 = time.time()
    y_pred = clf.predict(X_test)
#     print('accuracy:{:.3f}%, tims:{:.3f}s'.format(accuracy_score(y_test, y_pred)*100,(t2-t1)))
    return accuracy_score(y_test, y_pred), (t2-t1)
def train_cv(train_np):
    from sklearn import linear_model
    from sklearn.metrics import accuracy_score
    from sklearn.model_selection import cross_val_score
    import time
    # y即Survival结果
    y = train_np[:, 0]
    # X即特征属性值
    X = train_np[:, 1:]

    clf = linear_model.LogisticRegression(C=1,penalty="l2")
    t1 = time.time()
    scores = cross_val_score(clf, X, y, cv=5)
    t2 = time.time()
    return np.mean(scores), (t2-t1)/5
import time
import matplotlib.pyplot as plt 
import seaborn as sns
plt.figure(figsize=(16,9))
sns.heatmap(data_train[["click","hour","banner_pos","C1","C14","C15","C16",
                        "C17","C18","C19","C20","C21","device_type","device_conn_type"]].corr(),
            annot=True, fmt = ".2f", cmap = "coolwarm") 
plt.show()

在这里插入图片描述

通过上图可以看出单一数值型的特征与标签的线性关系并不是很大,下面看一下训练的效果

few_feature_df = data_train.filter(regex='click|hour|C.*|banner_pos|device_type|device_conn_type')
print('num of feature:',len(few_feature_df.columns))
few_feature_np = few_feature_df.values
acc, t = train_cv(few_feature_np)
print('accuracy:{:.3f}%, tims:{:.3f}s'.format(acc*100,t))
num of feature: 14
accuracy:82.225%, tims:0.035s

全特征训练

  • 使用数值类特征以及特征离散化过后的特征(共计:11931个)
all_feature_df = pd.concat([few_feature_df, df], axis=1)
print('num of feature:',len(all_feature_df.columns))
all_feature_np = all_feature_df.values
acc, t = train_cv(all_feature_np)
print('accuracy:{:.3f}%, tims:{:.3f}s'.format(acc*100,t))
num of feature: 11933
accuracy:82.225%, tims:7.074s

选取部分特征和全特征进行训练最终的结果一直,但时间消耗相差数倍。因此我们应当进行特征选择

选择部分特征

  • site_domain, site_category
df_1 = pd.concat([data_train, site_domain, site_category], axis=1)
df_1.drop(['site_domain', 'site_category'], axis=1, inplace=True)
train_df = df_1.filter(regex='click|site_domain_.*|hour|site_category_.*|C.*|banner_pos')
train_np = train_df.values
acc, t = train_cv(train_np)
print('accuracy:{:.3f}%, tims:{:.3f}s'.format(acc*100,t))
accuracy:82.225%, tims:1.448s

看到结果还是没有什么太大的变化,通过之前观察数据,部分特征的量纲之间差距很大,接下来进行特征数据标准化,来看一下效果。

标准化

  • StandardScaler() → 均值为0,方差为1
  • MinMaxScaler() → [0, 1]
  • MaxAbsScaler() → [-1, 1]
from sklearn.preprocessing import StandardScaler,MinMaxScaler,MaxAbsScaler
import warnings
warnings.filterwarnings('ignore')

scalers = [MinMaxScaler(),
         StandardScaler(),
         MaxAbsScaler()]
names = ['MinMaxScaler','StandardScaler()','MaxAbsScaler']
for name, scaler in zip(names,scalers):
    print(name,':')
    temp = scaler.fit_transform(df_1.filter(regex='hour|C.*|banner_pos|device_type|device_conn_type'))
    temp = pd.DataFrame(temp, 
                        columns=['hour_s','C1_s','C14_s','C15_s','C16_s','C17_s','C18_s',
                                 'C19_s','C20_s','C21_s','banner_pos_s','device_type_s','device_conn_type_s'])
    df_1_s = pd.concat([data_train['click'], temp, site_domain, site_category], axis=1)
    acc, t = train_cv(df_1_s.values)
    print('accuracy:{:.3f}%, tims:{:.3f}s'.format(acc*100,t))
MinMaxScaler :
accuracy:82.178%, tims:2.935s
StandardScaler() :
accuracy:82.142%, tims:2.540s
MaxAbsScaler :
accuracy:82.166%, tims:2.866s

通过效果看出,对于该问题使用MinMaxScaler()效果比较好,也是因为其与特征都进行离散化([0, 1]),这样量纲统一了。

接下来尝试再增加特征:
由于ip信息大部分均为不同的数据,做为特征时对结果会产生负影响。因此排除ip类的特征,选取其与特征

scaler = MinMaxScaler()
temp = scaler.fit_transform(df_1.filter(regex='hour|C.*|banner_pos|device_type|device_conn_type'))
trian_data_scalered = pd.DataFrame(temp, 
                    columns=['hour_s','C1_s','C14_s','C15_s','C16_s','C17_s','C18_s',
                            'C19_s','C20_s','C21_s','banner_pos_s','device_type_s','device_conn_type_s'])
selected_feature_df = pd.concat([data_train['click'], trian_data_scalered,site_domain,site_category,
                                 app_domain,app_category,device_model], axis=1)
acc, t = train_cv(selected_feature_df.values)
print('accuracy:{:.3f}%, tims:{:.3f}s'.format(acc*100,t))
accuracy:82.154%, tims:7.976s
  • 在尝试减少特征:删去domain类特征
scaler = MinMaxScaler()
temp = scaler.fit_transform(df_1.filter(regex='hour|C.*|banner_pos|device_type|device_conn_type'))
trian_data_scalered = pd.DataFrame(temp, 
                    columns=['hour_s','C1_s','C14_s','C15_s','C16_s','C17_s','C18_s',
                            'C19_s','C20_s','C21_s','banner_pos_s','device_type_s','device_conn_type_s'])
selected_feature_df = pd.concat([data_train['click'], trian_data_scalered,site_category,
                                 app_category,device_model], axis=1)
acc, t = train_cv(selected_feature_df.values)
print('accuracy:{:.3f}%, tims:{:.3f}s'.format(acc*100,t))
accuracy:82.059%, tims:1.255s
  • 选取site类特征以及其他特征,不选取app和device类特征
scaler = MinMaxScaler()
temp = scaler.fit_transform(df_1.filter(regex='hour|C.*|banner_pos'))
trian_data_scalered = pd.DataFrame(temp, 
                    columns=['hour_s','C1_s','C14_s','C15_s','C16_s','C17_s','C18_s',
                            'C19_s','C20_s','C21_s','banner_pos_s'])
selected_feature_df = pd.concat([data_train['click'], trian_data_scalered,site_id, site_domain,site_category], axis=1)
acc, t = train_cv(selected_feature_df.values)
print('accuracy:{:.3f}%, tims:{:.3f}s'.format(acc*100,t))
accuracy:82.130%, tims:0.973s
  • 选取app类特征以及其他特征,不选取site和device类特征
scaler = MinMaxScaler()
temp = scaler.fit_transform(df_1.filter(regex='hour|C.*|banner_pos'))
trian_data_scalered = pd.DataFrame(temp, 
                    columns=['hour_s','C1_s','C14_s','C15_s','C16_s','C17_s','C18_s',
                            'C19_s','C20_s','C21_s','banner_pos_s'])
selected_feature_df = pd.concat([data_train['click'], trian_data_scalered,app_id, app_domain, app_category], axis=1)
acc, t = train_cv(selected_feature_df.values)
print('accuracy:{:.3f}%, tims:{:.3f}s'.format(acc*100,t))
accuracy:82.178%, tims:0.546s
  • 选取device类特征以及其他特征,不选取app和site类特征
scaler = MinMaxScaler()
temp = scaler.fit_transform(df_1.filter(regex='hour|C.*|banner_pos|device_type|device_conn_type'))
trian_data_scalered = pd.DataFrame(temp, 
                    columns=['hour_s','C1_s','C14_s','C15_s','C16_s','C17_s','C18_s',
                            'C19_s','C20_s','C21_s','banner_pos_s','device_type_s','device_conn_type_s'])
selected_feature_df = pd.concat([data_train['click'], trian_data_scalered,device_id, device_ip, device_model], axis=1)
acc, t = train_cv(selected_feature_df.values)
print('accuracy:{:.3f}%, tims:{:.3f}s'.format(acc*100,t))
accuracy:82.012%, tims:8.948s

通过特征选择,准确率提升的提升并不是很大,因此尝试通过对逻辑回归模型调参

模型调参

from sklearn.model_selection import GridSearchCV
from sklearn import linear_model

# set data
scaler = MinMaxScaler()
temp = scaler.fit_transform(df_1.filter(regex='hour|C.*|banner_pos|device_type|device_conn_type'))
trian_data_scalered = pd.DataFrame(temp, 
                    columns=['hour_s','C1_s','C14_s','C15_s','C16_s','C17_s','C18_s',
                            'C19_s','C20_s','C21_s','banner_pos_s','device_type_s','device_conn_type_s'])
selected_feature_df = pd.concat([data_train['click'], trian_data_scalered,site_domain,site_category,
                                 app_domain,app_category,device_model], axis=1)
X = selected_feature_df.values[:,1:]
y = selected_feature_df.values[:,0]
# set train model
clf = linear_model.LogisticRegression()
param_LR = {'C':[0.01, 0.1, 0.5, 1, 5, 20, 50],
           'penalty':['l1'],
           'solver':['liblinear','saga'],
           'class_weight':['balanced','None']}
clf_gscv_l1 = GridSearchCV(clf, param_grid = param_LR, cv=5, scoring="accuracy", n_jobs= 8, verbose = 1)
clf_gscv_l1.fit(X, y)
Fitting 5 folds for each of 28 candidates, totalling 140 fits
[Parallel(n_jobs=8)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:   43.2s
[Parallel(n_jobs=8)]: Done 140 out of 140 | elapsed:  6.2min finished
print(clf_gscv_l1.best_estimator_)
print(clf_gscv_l1.best_score_)
LogisticRegression(C=0.1, class_weight='None', dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l1',
                   random_state=None, solver='saga', tol=0.0001, verbose=0,
                   warm_start=False)
0.8223668639053254
clf = linear_model.LogisticRegression()
param_LR = {'C':[0.01, 0.1, 0.5, 1, 5, 20, 50],
           'penalty':['l2'],
           'solver':['newton-cg','lbfgs','liblinear'],
           'class_weight':['balanced','None']}
clf_gscv_l2 = GridSearchCV(clf, param_grid = param_LR, cv=5, scoring="accuracy", n_jobs= 8, verbose = 1)
clf_gscv_l2.fit(X, y) 
Fitting 5 folds for each of 42 candidates, totalling 210 fits
[Parallel(n_jobs=8)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:   27.4s
[Parallel(n_jobs=8)]: Done 184 tasks      | elapsed:  4.8min
[Parallel(n_jobs=8)]: Done 210 out of 210 | elapsed:  5.9min finished
print(clf_gscv_l2.best_estimator_)
print(clf_gscv_l2.best_score_)
LogisticRegression(C=0.01, class_weight='None', dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='newton-cg', tol=0.0001, verbose=0,
                   warm_start=False)
0.8222485207100592

由于逻辑回归中L1和L2近适用于不同的优化算法(solver),所以分开进行调参。因为数据的标签是不均衡的,所以在调参中增添了标签均衡,但由于开启标签均衡后会导致训练集准确率下降,所以这里最佳参数中class_weight=None。但为了预防过拟合的情况,还需将将参数class_weight=‘balanced’

  • 0
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 4
    评论
用户画像作为大数据的根基,它抽象出一个用户的信息全貌,为进一步精准、快速地分析用户行为习惯、消费习惯等重要信息,提供了足够的数据基础,奠定了大数据时代的基石。 用户画像,即用户信息标签化,就是企业通过收集与分析消费者社会属性、生活习惯、消费行为等主要信息的数据之后,抽象出一个用户的商业全貌作是企业应用大数据技术的基本方式。用户画像为企业提供了足够的信息基础,能够帮助企业快速找到精准用户群体以及用户需求等更为广泛的反馈信息。 用户画像系统能很好地帮助企业分析用户的行为与消费习惯,可以预测商品的发展的趋势,提高产品质量,同时提高用户满意度。构建一个用户画像,包括数据源端数据收集、数据预处理、行为建模、构建用户画像。有些标签是可以直接获取到的,有些标签需要通过数据挖掘分析到!本套课程会带着你一步一步的实现用户画像案例,掌握了本套课程内容,可以让你感受到Flink+ClickHouse技术架构的强大和大数据应用的广泛性。 在这个数据爆发的时代,像大型电商的数据量达到百亿级别,我们往往无法对海量的明细数据做进一步层次的预聚合,大量的业务数据都是好几亿数据关联,并且我们需要聚合结果能在秒级返回。 包括我们的画像数据,也是有这方便的需求,那怎么才能达到秒级返回呢?ClickHouse正好满足我们的需求,它是非常的强大的。 本课程采用Flink+ClickHouse技术架构实现我们的画像系统,通过学习完本课程可以节省你摸索的时间,节省企业成本,提高企业开发效率。希望本课程对一些企业开发人员和对新技术栈有兴趣的伙伴有所帮助,如对我录制的教程内容有建议请及时交流。项目中采用到的算法包含Logistic Regression、Kmeans、TF-IDF等,Flink暂时支持的算法比较少,对于以上算法,本课程将带大家用Flink实现,并且结合真实场景,学完即用。系统包含所有终端的数据(移动端、PC端、小程序端),支持亿级数据量的分析和查询,并且是实时和近实时的对用户进行画像计算。本课程包含的画像指标包含:概况趋势,基础属性,行为特征,兴趣爱好,风险特征,消费特征,营销敏感度,用户标签信息,用户群里,商品关键字等几大指标模块,每个指标都会带大家实现。课程所涵盖的知识点包括:开发工具为:IDEA FlinkClickhouseHadoopHbaseKafkaCanalbinlogSpringBootSpringCloudHDFSVue.jsNode.jsElemntUIEcharts等等 课程亮点: 1.企业级实战、真实工业界产品 2.ClickHouse高性能列式存储数据库 3.提供原始日志数据进行效果检测 4.Flink join企业级实战演 5.第四代计算引擎Flink+ClickHouse技术架构6.微服务架构技术SpringBoot+SpringCloud技术架构7.算法处理包含Logistic Regression、Kmeans、TF-IDF等8.数据库实时同步落地方案实操9.统计终端的数据(移动端、PC端、小程序端) 10.支撑亿级海量数据的用户画像平台11.实时和近实时的对用户进行画像计算12.后端+大数据技术栈+前端可视化13.提供技术落地指导支持 14.课程凝聚讲师多年实战经验,经验直接复制15.掌握全部内容能独立进行大数据用户平台的设计和实操企业一线架构师讲授,代码在老师的指导下企业可以复用,提供企业解决方案。  版权归作者所有,盗版将进行法律维权。 

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值