kaggle Home Depot relevance相关性预测

#Home Depot 产品相关性预测 kaggle竞赛:https://www.kaggle.com/c/home-depot-product-search-relevance HomeDepot是美国一家家具建材商品网站,用户通过在搜索框中输入关键词,得到相关商品和服务,如输入floor,得到不同材料的地板商品、地板清洗商品、地板安装服务等。kaggle竞赛目的是通过设计一种模型,能够更好的匹配用户搜索关键词,得到相关性更高的产品和服务。 ##导入所需
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor, BaggingRegressor
df_train = pd.read_csv('train.csv',encoding='ISO-8859-1')
df_test = pd.read_csv('test.csv',encoding='ISO-8859-1')
#除了train test数据外,还有一个商品描述数据
df_desc = pd.read_csv('product_descriptions.csv')
#看一下各数据的样子
df_train.head(3)
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
idproduct_uidproduct_titlesearch_termrelevance
02100001Simpson Strong-Tie 12-Gauge Angleangle bracket3.0
13100001Simpson Strong-Tie 12-Gauge Anglel bracket2.5
29100002BEHR Premium Textured DeckOver 1-gal. #SC-141 …deck over3.0
df_test.head(3)
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
idproduct_uidproduct_titlesearch_term
01100001Simpson Strong-Tie 12-Gauge Angle90 degree bracket
14100001Simpson Strong-Tie 12-Gauge Anglemetal l brackets
25100001Simpson Strong-Tie 12-Gauge Anglesimpson sku able
df_desc.head(3)
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
product_uidproduct_description
0100001Not only do angles make joints stronger, they …
1100002BEHR Premium Textured DECKOVER is an innovativ…
2100003Classic architecture meets contemporary design…

train中relevance是我们要在test上预测的目标,relevance 1-3代表相关程度,3最高,1最低;search_term是搜索词,即该产品在某一搜索词下的相关度是多少;product discription里是对应产品id的产品介绍。

对train和test数据进行合并方便处理,同时在描述数据中product_uid是共同特征,也可合并进去。

df_all = pd.concat((df_train, df_test), axis=0, ignore_index=True)
#两个表的index都没有实际含义,选择忽视,axis=0按行合并
D:\programs\anaconda\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version of pandas will change to not sort by default. To accept the future behavior, pass ‘sort=True’. To retain the current behavior and silence the warning, pass sort=False “”“Entry point for launching an IPython kernel.
df_all.head(3)
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
idproduct_titleproduct_uidrelevancesearch_term
02Simpson Strong-Tie 12-Gauge Angle1000013.0angle bracket
13Simpson Strong-Tie 12-Gauge Angle1000012.5l bracket
29BEHR Premium Textured DeckOver 1-gal. #SC-141 …1000023.0deck over
df_all.shape
(240760, 5)
df_all = df_all.merge(df_desc,on='product_uid',how='left')
df_all.head(3)
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
idproduct_titleproduct_uidrelevancesearch_termproduct_description
02Simpson Strong-Tie 12-Gauge Angle1000013.0angle bracketNot only do angles make joints stronger, they …
13Simpson Strong-Tie 12-Gauge Angle1000012.5l bracketNot only do angles make joints stronger, they …
29BEHR Premium Textured DeckOver 1-gal. #SC-141 …1000023.0deck overBEHR Premium Textured DECKOVER is an innovativ…
##文本预处理
from nltk.stem.snowball import SnowballStemmer
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
###Stemmer词干提取 因为homedepot做的是搜索匹配,所以文本的统一性很重要,我们需要对文本特征做stemmer,提取词干,保证search term在文本中只有一种表达效果。
#去掉停止词
stop = stopwords.words('english')

#去掉数字
import re 
def hasnumber(input_str):
    return bool(re.search(r'\d',input_str))

#整合在一起
def check(string):
    if string in stop:
        return False
    elif hasnumber(string):
        return False
    else:
        return True
#清洁文本内容
stemmer = SnowballStemmer('english')
#提取词干
def text_stemmer(s):
     return ' '.join([stemmer.stem(word) for word in s.lower().split() if check(word)])
#应用
df_all['search_term'] = df_all['search_term'].map(lambda x: text_stemmer(x))

df_all['product_title'] = df_all['product_title'].map(lambda x:text_stemmer(x))

df_all['product_description'] = df_all['product_description'].map(lambda x:text_stemmer(x))
df_all.head()
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
idproduct_titleproduct_uidrelevancesearch_termproduct_description
02simpson strong-ti angl1000013.00angl bracketangl make joint stronger, also provid consiste…
13simpson strong-ti angl1000012.50l bracketangl make joint stronger, also provid consiste…
29behr premium textur deckov tugboat wood concre…1000023.00deckbehr premium textur deckov innov solid color c…
316delta vero shower faucet trim kit chrome (valv…1000052.33rain shower headupdat bathroom delta vero single-handl shower …
417delta vero shower faucet trim kit chrome (valv…1000052.67shower faucetupdat bathroom delta vero single-handl shower …
###处理训练数据
#对训练数据构造全部单词合集
train = df_all[:df_train.shape[0]]
test = df_all[df_test.shape[0]:]
train['all_text']=train['product_title'] + ' . ' + train['product_description'] + ' . '
test['all_text'] = test['product_title'] + ' . ' + test['product_description'] + ' . '
D:\programs\anaconda\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy “”“Entry point for launching an IPython kernel. D:\programs\anaconda\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
train['all_text'][0:5]
0 simpson strong-ti angl . angl make joint stron… 1 simpson strong-ti angl . angl make joint stron… 2 behr premium textur deckov tugboat wood concre… 3 delta vero shower faucet trim kit chrome (valv… 4 delta vero shower faucet trim kit chrome (valv… Name: all_text, dtype: object ###生成语料 根据train中的all_text生成语料,先使用tokenize将句子分成一个个单词;再用gensim.corpora.Dictionary来实现对语料中的每一个单词关联一个唯一的ID,这个字典定义了我们要处理的所有单词表。
from gensim.utils import tokenize
#使用的是gensim库中的tokenize方法,所以后续对test的处理要一致
from gensim.corpora.dictionary import Dictionary
dictionary = Dictionary(list(tokenize(x, errors='ignore')) for x in train['all_text'].values)
print(dictionary)
D:\programs\anaconda\lib\site-packages\gensim\utils.py:1197: UserWarning: detected Windows; aliasing chunkize to chunkize_serial warnings.warn(“detected Windows; aliasing chunkize to chunkize_serial”) Dictionary(136703 unique tokens: [‘alonehelp’, ‘also’, ‘angl’, ‘bent’, ‘coat’]…) 我们得到一个有136703个单词的训练语料库,然后对所有语料转换成单词个数的计算。这里使用迭代器来写一个类,实现对所有语料的每一个单词进行个数计算,因为语料库很大,直接生成list会很费内存。
class corpus:
    def __iter__(self):
        for x in train['all_text'].values:
            yield dictionary.doc2bow(list(tokenize(x, errors='ignore')))
#dictionary.doc2bow是获得语料的向量表达式
train_corpus = corpus()
count=0
for c in train_corpus:
    print(c)
    count+=1
    if count >2:
        break
[(0, 1), (1, 1), (2, 4), (3, 1), (4, 1), (5, 1), (6, 2), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 2), (13, 1), (14, 1), (15, 1), (16, 1), (17, 2), (18, 1), (19, 1), (20, 1), (21, 4), (22, 1), (23, 2), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 2), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1), (37, 3), (38, 1), (39, 2), (40, 1), (41, 1), (42, 1), (43, 2), (44, 1), (45, 2), (46, 1), (47, 1), (48, 1), (49, 2), (50, 3), (51, 1), (52, 1), (53, 1), (54, 1), (55, 2), (56, 1), (57, 1), (58, 2), (59, 1), (60, 1), (61, 3), (62, 1), (63, 1), (64, 1)] [(0, 1), (1, 1), (2, 4), (3, 1), (4, 1), (5, 1), (6, 2), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 2), (13, 1), (14, 1), (15, 1), (16, 1), (17, 2), (18, 1), (19, 1), (20, 1), (21, 4), (22, 1), (23, 2), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 2), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1), (37, 3), (38, 1), (39, 2), (40, 1), (41, 1), (42, 1), (43, 2), (44, 1), (45, 2), (46, 1), (47, 1), (48, 1), (49, 2), (50, 3), (51, 1), (52, 1), (53, 1), (54, 1), (55, 2), (56, 1), (57, 1), (58, 2), (59, 1), (60, 1), (61, 3), (62, 1), (63, 1), (64, 1)] [(1, 1), (4, 3), (21, 1), (25, 1), (39, 1), (41, 2), (56, 1), (65, 1), (66, 2), (67, 1), (68, 1), (69, 1), (70, 1), (71, 1), (72, 4), (73, 2), (74, 1), (75, 1), (76, 1), (77, 1), (78, 1), (79, 1), (80, 1), (81, 1), (82, 3), (83, 1), (84, 1), (85, 4), (86, 2), (87, 1), (88, 1), (89, 1), (90, 2), (91, 2), (92, 1), (93, 1), (94, 1), (95, 1), (96, 1), (97, 1), (98, 1), (99, 1), (100, 1), (101, 1), (102, 1), (103, 1), (104, 1), (105, 1), (106, 1), (107, 1), (108, 1), (109, 1), (110, 1), (111, 1), (112, 1), (113, 1), (114, 1), (115, 1), (116, 1), (117, 2), (118, 1), (119, 1), (120, 1), (121, 1), (122, 1), (123, 2), (124, 2), (125, 1), (126, 1), (127, 2), (128, 1), (129, 1), (130, 1), (131, 1), (132, 1), (133, 1), (134, 1), (135, 2), (136, 1), (137, 1), (138, 1), (139, 1), (140, 2), (141, 1), (142, 1), (143, 1), (144, 1), (145, 1), (146, 1), (147, 1), (148, 2), (149, 1), (150, 1), (151, 1), (152, 1), (153, 1), (154, 1), (155, 1), (156, 1), (157, 4), (158, 1)] 如上所示,每个句子中的每个单词转换为一组向量,()中的第一个元素表示该词在字典中的ID,第二个元素表示在这个句子中这个单词出现的次数。 ###使用TF-IDF模型 tf-idf模型简单理解是把词袋表达的向量转换到另一个向量空间,这个向量空间中,词频是根据语料中每个词的相对稀有程度(relative rarity)进行加权处理的。 TF(term frequency)=某个词在文中出现的次数/文章总词数 IDF( inverse document frequency)=log(N/N(x)),其中N表示语料库中文本总数,N(x)表示语料库中包含x的文本总数。 TF-IDF(x) = Tf(x)*IDF(x)
from gensim.models.tfidfmodel import TfidfModel
tfidf_g = TfidfModel(train_corpus)
#保存模型
tfidf_g.save("./gensim_tfidf.tfidf")
训练好后,我们评估一条普通的句子:
tfidf_g[dictionary.doc2bow(list(tokenize('morning yellow flower', errors='ignore')))]
[(1056, 0.44640344500226231), (1332, 0.40452528743632266), (34490, 0.79817495332456578)] 返回的元组中,第一个表示单词ID,第二个表示权重。
#进行封装
def to_tfidf(text):
    res = tfidf_g[dictionary.doc2bow(list(tokenize(text, errors='ignore')))]
    return res
###余弦相似度 余弦值越接近1,就表明夹角越接近0度,也就是两个向量越相似。
from gensim.similarities import MatrixSimilarity
def cos_sim(text1,text2):
    tf1 = to_tfidf(text1)
    tf2 = to_tfidf(text2)
    index = MatrixSimilarity([tf1],num_features=len(dictionary))
    sim = index[tf2]
    return float(sim[0])
先将两条文本根据训练好的tfidf模型转换为向量,拿其中一个作为index,扩展开全部的matrixsize,另一个带入,得到二者余弦值。
train['tfidf_cos_sim_in_title'] = train.apply(lambda x: cos_sim(x['search_term'], x['product_title']), axis=1)
D:\programs\anaconda\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy “”“Entry point for launching an IPython kernel.
test['tfidf_cos_sim_in_title'] = test.apply(lambda x: cos_sim(x['search_term'], x['product_title']), axis=1)
D:\programs\anaconda\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy “”“Entry point for launching an IPython kernel.
train['tfidf_cos_sim_in_desc'] = train.apply(lambda x: cos_sim(x['search_term'], x['product_description']), axis=1)
D:\programs\anaconda\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy “”“Entry point for launching an IPython kernel.
test['tfidf_cos_sim_in_desc'] = test.apply(lambda x: cos_sim(x['search_term'], x['product_description']), axis=1)
D:\programs\anaconda\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy “”“Entry point for launching an IPython kernel.
train.head(2)
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
idproduct_titleproduct_uidrelevancesearch_termproduct_descriptionall_texttfidf_cos_sim_in_titletfidf_cos_sim_in_desc
02simpson strong-ti angl1000013.0angl bracketangl make joint stronger, also provid consiste…simpson strong-ti angl . angl make joint stron…0.2879580.188301
13simpson strong-ti angl1000012.5l bracketangl make joint stronger, also provid consiste…simpson strong-ti angl . angl make joint stron…0.0000000.000000
test.head(2)
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
idproduct_titleproduct_uidrelevancesearch_termproduct_descriptionall_texttfidf_cos_sim_in_titletfidf_cos_sim_in_desc
166693138080winix freshom model true hepa air cleaner plas…149579NaNwinix air purifiwinix freshom true hepa air cleaner plasmawav …winix freshom model true hepa air cleaner plas…0.4132490.146256
166694138082ge rv outlet box amp volt ring type meter amp …149580NaNgcfi outletring-typ meter surfac mount factory-assembl fa…ge rv outlet box amp volt ring type meter amp …0.5305710.022509

可以看出我们增加了2个特征,即我们通过TF-IDF模型将单词转换为向量表示,计算两个向量的余弦值作为相似性度量,度量search item与title和product description的相似性。

###word2vec模型 ####tokenize分词
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
#一段文章分成各个句子
tokenizer.tokenize(train['all_text'].values[0])
[‘simpson strong-ti angl .’, ‘angl make joint stronger, also provid consistent, straight corners.’, ‘simpson strong-ti offer wide varieti angl various size thick handl light-duti job project structur connect needed.’, ‘bent (skewed) match project.’, ‘outdoor project moistur present, use zmax zinc-coat connectors, provid extra resist corros (look “z” end model number).versatil connector various connect home repair projectsstrong angl nail screw fasten alonehelp ensur joint consist straight strongdimensions: in.’, ‘x in.’, ‘x in.mad steelgalvan extra corros resistanceinstal common nail x in.’, ‘strong-driv sd screw .’]
#应用到train样本
sentences = [tokenizer.tokenize(x) for x in train['all_text'].values]
type(sentences)#得到list of句子
list
sentences[:2]
[[‘simpson strong-ti angl .’, ‘angl make joint stronger, also provid consistent, straight corners.’, ‘simpson strong-ti offer wide varieti angl various size thick handl light-duti job project structur connect needed.’, ‘bent (skewed) match project.’, ‘outdoor project moistur present, use zmax zinc-coat connectors, provid extra resist corros (look “z” end model number).versatil connector various connect home repair projectsstrong angl nail screw fasten alonehelp ensur joint consist straight strongdimensions: in.’, ‘x in.’, ‘x in.mad steelgalvan extra corros resistanceinstal common nail x in.’, ‘strong-driv sd screw .’], [‘simpson strong-ti angl .’, ‘angl make joint stronger, also provid consistent, straight corners.’, ‘simpson strong-ti offer wide varieti angl various size thick handl light-duti job project structur connect needed.’, ‘bent (skewed) match project.’, ‘outdoor project moistur present, use zmax zinc-coat connectors, provid extra resist corros (look “z” end model number).versatil connector various connect home repair projectsstrong angl nail screw fasten alonehelp ensur joint consist straight strongdimensions: in.’, ‘x in.’, ‘x in.mad steelgalvan extra corros resistanceinstal common nail x in.’, ‘strong-driv sd screw .’]] 分割成句子后,这些句子还是有层级关系的,我们想要得到的是所有句子的集合,即需要将句子list flattern.查询Stack Overflow方法如下:
sentences = [y for x in sentences for y in x]
其等价于: flattern=[] for sub in sentences: for val in sub: flattern.append(val) 但上面方法运行更快,且不用调用append
len(sentences)
606641
#各个句子分成单词
words = [word_tokenize(x) for x in sentences]
#### w2c model
from gensim.models.word2vec import Word2Vec

w2c = Word2Vec(words, size=128, window=5, min_count=5, workers=4)
这时,语料中的单词会有一个w2c向量表示:
w2c['door'].shape
D:\programs\anaconda\lib\site-packages\ipykernel_launcher.py:1: DeprecationWarning: Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead). “”“Entry point for launching an IPython kernel. (128,) 获得一个单词的w2c后,对于一个句子,我们可以取句子所有单词的平均值作为句子的w2c向量。
vocab = w2c.wv.vocab
#注:对于gensim 1.0.0后的版本,要想获得w2c向量字典,需要使用方法model.wv.vocab
#对于以前的版本,使用model.vocab即可
type(vocab)
dict
# 得到任意text的vector
def get_vector(text):
    # 全是0的,大小是128的array
    res =np.zeros([128])
    count = 0
    for word in word_tokenize(text):
        if word in vocab:
            res += w2c[word]
            count += 1
    return res/count
print(get_vector('this is a door'))
[-0.00261087 -0.56179226 0.60765644 -0.64292271 -0.56054996 0.08848376 -0.49025596 0.63867774 0.0506447 0.45001359 -0.13710753 -0.16271916 0.02016276 0.20406312 0.31635891 0.19369102 0.17811321 0.41733303 0.00445884 0.5458078 1.05040102 -0.06413073 0.41070253 0.42587531 -0.63050625 -1.0984747 0.29934129 0.17861572 -0.71340695 -0.06451187 0.14277897 -0.06567481 0.01526162 -0.38790436 1.20415058 1.21037786 0.14057088 -0.10719017 0.37104489 0.76831334 0.34643462 0.62355396 0.25301299 0.40690951 0.1148672 1.06050375 0.36682158 0.25096587 -0.74231262 0.35016962 -0.58686608 -0.0857836 -0.84342213 0.5809405 -0.00302781 -0.14390172 -0.0524666 -0.91113859 0.75996059 0.87425374 -0.26513928 -0.54596879 0.80864939 0.01382558 -0.06432911 0.4952433 -0.43694797 0.01296244 0.84968186 -0.10620818 -0.18429637 0.69937535 0.4414333 -0.13501882 0.02398617 -0.47228654 1.04885393 0.06891993 -0.38115454 0.34773821 0.31407464 -0.06125381 -0.52234665 -0.11498543 -0.03274459 -0.10401297 -0.58666455 0.96296111 -0.72077985 0.29961426 0.68775976 -0.03572528 0.28445438 0.04369911 0.61288889 -0.21892426 -0.05004786 -0.73410231 -0.58521137 -0.02520149 -0.44890615 -0.54609256 0.86551609 -0.28756648 0.4514165 -0.36830674 0.31522632 -0.05346495 -0.0451854 -0.26681575 -0.46619874 0.04488686 -0.38999537 -0.12920142 0.68646752 -0.86762143 0.79189127 0.25246545 -1.17674513 -0.17455208 0.34669894 -0.31969217 -0.08517368 0.52510963 0.2380766 -0.16318383 0.17944744 -0.9727306 ] D:\programs\anaconda\lib\site-packages\ipykernel_launcher.py:8: DeprecationWarning: Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead). ###计算相似度
from scipy import spatial
def w2c_cos_sim(text1,text2):
    try:
        w1 = get_vector(text1)
        w2 = get_vector(text2)
        sim = 1 - spatial.distance.cosine(w1, w2)
        return float(sim)
    except:
        return float(0)
#spatial.distance.cosine定义余弦距离为1-cos
w2c_cos_sim('hello world','hello from the other side')
D:\programs\anaconda\lib\site-packages\ipykernel_launcher.py:8: DeprecationWarning: Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead). 0.07032644504070107
train['w2v_cos_sim_in_title'] = train.apply(lambda x: w2c_cos_sim(x['search_term'], x['product_title']), axis=1)
train['w2v_cos_sim_in_desc'] = train.apply(lambda x: w2c_cos_sim(x['search_term'], x['product_description']), axis=1)

test['w2v_cos_sim_in_title'] = test.apply(lambda x: w2c_cos_sim(x['search_term'], x['product_title']), axis=1)
test['w2v_cos_sim_in_desc'] = test.apply(lambda x: w2c_cos_sim(x['search_term'], x['product_description']), axis=1)
D:\programs\anaconda\lib\site-packages\ipykernel_launcher.py:8: DeprecationWarning: Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead). D:\programs\anaconda\lib\site-packages\ipykernel_launcher.py:10: RuntimeWarning: invalid value encountered in true_divide # Remove the CWD from sys.path while we load stuff. D:\programs\anaconda\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy “”“Entry point for launching an IPython kernel. D:\programs\anaconda\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy D:\programs\anaconda\lib\site-packages\ipykernel_launcher.py:4: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy after removing the cwd from sys.path. D:\programs\anaconda\lib\site-packages\ipykernel_launcher.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy “””
train.head(2)
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
idproduct_titleproduct_uidrelevancesearch_termproduct_descriptionall_texttfidf_cos_sim_in_titletfidf_cos_sim_in_descw2v_cos_sim_in_titlew2v_cos_sim_in_desc
02simpson strong-ti angl1000013.0angl bracketangl make joint stronger, also provid consiste…simpson strong-ti angl . angl make joint stron…0.2879580.1883010.5319250.530175
13simpson strong-ti angl1000012.5l bracketangl make joint stronger, also provid consiste…simpson strong-ti angl . angl make joint stron…0.0000000.0000000.2797080.303249
test.head(2)
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
idproduct_titleproduct_uidrelevancesearch_termproduct_descriptionall_texttfidf_cos_sim_in_titletfidf_cos_sim_in_descw2v_cos_sim_in_titlew2v_cos_sim_in_desc
166693138080winix freshom model true hepa air cleaner plas…149579NaNwinix air purifiwinix freshom true hepa air cleaner plasmawav …winix freshom model true hepa air cleaner plas…0.4132490.1462560.6945830.591168
166694138082ge rv outlet box amp volt ring type meter amp …149580NaNgcfi outletring-typ meter surfac mount factory-assembl fa…ge rv outlet box amp volt ring type meter amp …0.5305710.0225090.6762370.369881
#去掉文本列
train = train.drop(['search_term','product_title','product_description','all_text'],axis=1)
test = test.drop(['search_term','product_title','product_description','all_text'],axis=1)
#保留id,提取目标
ids = test['id']
y_train = train['relevance'].values

X_train = train.drop(['id','relevance'],axis=1).values
X_test = test.drop(['id','relevance'],axis=1).values

model

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
params = [1,3,5,6,7,8,9,10]
test_scores = []
for param in params:
    clf = RandomForestRegressor(n_estimators=30, max_depth=param)
    test_score = np.sqrt(-cross_val_score(clf, X_train, y_train, cv=10, scoring='neg_mean_squared_error'))
    test_scores.append(np.mean(test_score))
import matplotlib.pyplot as plt
%matplotlib inline
plt.plot(params, test_scores)
plt.title("Param vs CV Error")
<matplotlib.text.Text at 0x26281610c88>

这里写图片描述

可以看出,max_depth在6的时候效果最好,大约在0.49左右。这里是增加4个特征,选择的模型是随机森林,下一步,可以构造新的特征,如简单的search item是否被包含,还可以使用别的模型,如LR,然后对模型进行ensemble等。


新手学习,欢迎指教!

  • 1
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 3
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值