kaggle Home Depot relevance相关性预测

最新推荐文章于 2024-05-12 10:04:29 发布

iam_emily

最新推荐文章于 2024-05-12 10:04:29 发布

阅读量2.7k

点赞数 1

分类专栏：数据挖掘 kaggle 文章标签：相似性搜索自然语言处理

本文链接：https://blog.csdn.net/iam_emily/article/details/81067697

版权

数据挖掘同时被 2 个专栏收录

3 篇文章 2 订阅

订阅专栏

kaggle

3 篇文章 0 订阅

订阅专栏

#Home Depot 产品相关性预测 kaggle竞赛：https://www.kaggle.com/c/home-depot-product-search-relevance HomeDepot是美国一家家具建材商品网站，用户通过在搜索框中输入关键词，得到相关商品和服务，如输入floor，得到不同材料的地板商品、地板清洗商品、地板安装服务等。kaggle竞赛目的是通过设计一种模型，能够更好的匹配用户搜索关键词，得到相关性更高的产品和服务。 ##导入所需

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor, BaggingRegressor

df_train = pd.read_csv('train.csv',encoding='ISO-8859-1')
df_test = pd.read_csv('test.csv',encoding='ISO-8859-1')

#除了train test数据外，还有一个商品描述数据
df_desc = pd.read_csv('product_descriptions.csv')

#看一下各数据的样子
df_train.head(3)

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	id	product_uid	product_title	search_term	relevance
0	2	100001	Simpson Strong-Tie 12-Gauge Angle	angle bracket	3.0
1	3	100001	Simpson Strong-Tie 12-Gauge Angle	l bracket	2.5
2	9	100002	BEHR Premium Textured DeckOver 1-gal. #SC-141 …	deck over	3.0

df_test.head(3)

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	id	product_uid	product_title	search_term
0	1	100001	Simpson Strong-Tie 12-Gauge Angle	90 degree bracket
1	4	100001	Simpson Strong-Tie 12-Gauge Angle	metal l brackets
2	5	100001	Simpson Strong-Tie 12-Gauge Angle	simpson sku able

df_desc.head(3)

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	product_uid	product_description
0	100001	Not only do angles make joints stronger, they …
1	100002	BEHR Premium Textured DECKOVER is an innovativ…
2	100003	Classic architecture meets contemporary design…

train中relevance是我们要在test上预测的目标，relevance 1-3代表相关程度，3最高，1最低；search_term是搜索词，即该产品在某一搜索词下的相关度是多少；product discription里是对应产品id的产品介绍。

对train和test数据进行合并方便处理，同时在描述数据中product_uid是共同特征，也可合并进去。

df_all = pd.concat((df_train, df_test), axis=0, ignore_index=True)
#两个表的index都没有实际含义，选择忽视，axis=0按行合并

D:\programs\anaconda\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version of pandas will change to not sort by default. To accept the future behavior, pass ‘sort=True’. To retain the current behavior and silence the warning, pass sort=False “”“Entry point for launching an IPython kernel.

df_all.head(3)

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	id	product_title	product_uid	relevance	search_term
0	2	Simpson Strong-Tie 12-Gauge Angle	100001	3.0	angle bracket
1	3	Simpson Strong-Tie 12-Gauge Angle	100001	2.5	l bracket
2	9	BEHR Premium Textured DeckOver 1-gal. #SC-141 …	100002	3.0	deck over

df_all.shape

(240760, 5)

df_all = df_all.merge(df_desc,on='product_uid',how='left')

df_all.head(3)

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	id	product_title	product_uid	relevance	search_term	product_description
0	2	Simpson Strong-Tie 12-Gauge Angle	100001	3.0	angle bracket	Not only do angles make joints stronger, they …
1	3	Simpson Strong-Tie 12-Gauge Angle	100001	2.5	l bracket	Not only do angles make joints stronger, they …
2	9	BEHR Premium Textured DeckOver 1-gal. #SC-141 …	100002	3.0	deck over	BEHR Premium Textured DECKOVER is an innovativ…

##文本预处理

from nltk.stem.snowball import SnowballStemmer
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer

###Stemmer词干提取因为homedepot做的是搜索匹配，所以文本的统一性很重要，我们需要对文本特征做stemmer,提取词干,保证search term在文本中只有一种表达效果。

#去掉停止词
stop = stopwords.words('english')

#去掉数字
import re 
def hasnumber(input_str):
    return bool(re.search(r'\d',input_str))

#整合在一起
def check(string):
    if string in stop:
        return False
    elif hasnumber(string):
        return False
    else:
        return True

#清洁文本内容
stemmer = SnowballStemmer('english')

#提取词干
def text_stemmer(s):
     return ' '.join([stemmer.stem(word) for word in s.lower().split() if check(word)])

#应用
df_all['search_term'] = df_all['search_term'].map(lambda x: text_stemmer(x))

df_all['product_title'] = df_all['product_title'].map(lambda x:text_stemmer(x))

df_all['product_description'] = df_all['product_description'].map(lambda x:text_stemmer(x))

df_all.head()

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	id	product_title	product_uid	relevance	search_term	product_description
0	2	simpson strong-ti angl	100001	3.00	angl bracket	angl make joint stronger, also provid consiste…
1	3	simpson strong-ti angl	100001	2.50	l bracket	angl make joint stronger, also provid consiste…
2	9	behr premium textur deckov tugboat wood concre…	100002	3.00	deck	behr premium textur deckov innov solid color c…
3	16	delta vero shower faucet trim kit chrome (valv…	100005	2.33	rain shower head	updat bathroom delta vero single-handl shower …
4	17	delta vero shower faucet trim kit chrome (valv…	100005	2.67	shower faucet	updat bathroom delta vero single-handl shower …

###处理训练数据

#对训练数据构造全部单词合集
train = df_all[:df_train.shape[0]]
test = df_all[df_test.shape[0]:]

train['all_text']=train['product_title'] + ' . ' + train['product_description'] + ' . '
test['all_text'] = test['product_title'] + ' . ' + test['product_description'] + ' . '

D:\programs\anaconda\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy “”“Entry point for launching an IPython kernel. D:\programs\anaconda\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

train['all_text'][0:5]

0 simpson strong-ti angl . angl make joint stron… 1 simpson strong-ti angl . angl make joint stron… 2 behr premium textur deckov tugboat wood concre… 3 delta vero shower faucet trim kit chrome (valv… 4 delta vero shower faucet trim kit chrome (valv… Name: all_text, dtype: object ###生成语料根据train中的all_text生成语料，先使用tokenize将句子分成一个个单词；再用gensim.corpora.Dictionary来实现对语料中的每一个单词关联一个唯一的ID，这个字典定义了我们要处理的所有单词表。

from gensim.utils import tokenize
#使用的是gensim库中的tokenize方法，所以后续对test的处理要一致
from gensim.corpora.dictionary import Dictionary
dictionary = Dictionary(list(tokenize(x, errors='ignore')) for x in train['all_text'].values)
print(dictionary)

D:\programs\anaconda\lib\site-packages\gensim\utils.py:1197: UserWarning: detected Windows; aliasing chunkize to chunkize_serial warnings.warn(“detected Windows; aliasing chunkize to chunkize_serial”) Dictionary(136703 unique tokens: [‘alonehelp’, ‘also’, ‘angl’, ‘bent’, ‘coat’]…) 我们得到一个有136703个单词的训练语料库，然后对所有语料转换成单词个数的计算。这里使用迭代器来写一个类，实现对所有语料的每一个单词进行个数计算，因为语料库很大，直接生成list会很费内存。

class corpus:
    def __iter__(self):
        for x in train['all_text'].values:
            yield dictionary.doc2bow(list(tokenize(x, errors='ignore')))
#dictionary.doc2bow是获得语料的向量表达式

train_corpus = corpus()

count=0
for c in train_corpus:
    print(c)
    count+=1
    if count >2:
        break

[(0, 1), (1, 1), (2, 4), (3, 1), (4, 1), (5, 1), (6, 2), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 2), (13, 1), (14, 1), (15, 1), (16, 1), (17, 2), (18, 1), (19, 1), (20, 1), (21, 4), (22, 1), (23, 2), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 2), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1), (37, 3), (38, 1), (39, 2), (40, 1), (41, 1), (42, 1), (43, 2), (44, 1), (45, 2), (46, 1), (47, 1), (48, 1), (49, 2), (50, 3), (51, 1), (52, 1), (53, 1), (54, 1), (55, 2), (56, 1), (57, 1), (58, 2), (59, 1), (60, 1), (61, 3), (62, 1), (63, 1), (64, 1)] [(0, 1), (1, 1), (2, 4), (3, 1), (4, 1), (5, 1), (6, 2), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 2), (13, 1), (14, 1), (15, 1), (16, 1), (17, 2), (18, 1), (19, 1), (20, 1), (21, 4), (22, 1), (23, 2), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 2), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1), (37, 3), (38, 1), (39, 2), (40, 1), (41, 1), (42, 1), (43, 2), (44, 1), (45, 2), (46, 1), (47, 1), (48, 1), (49, 2), (50, 3), (51, 1), (52, 1), (53, 1), (54, 1), (55, 2), (56, 1), (57, 1), (58, 2), (59, 1), (60, 1), (61, 3), (62, 1), (63, 1), (64, 1)] [(1, 1), (4, 3), (21, 1), (25, 1), (39, 1), (41, 2), (56, 1), (65, 1), (66, 2), (67, 1), (68, 1), (69, 1), (70, 1), (71, 1), (72, 4), (73, 2), (74, 1), (75, 1), (76, 1), (77, 1), (78, 1), (79, 1), (80, 1), (81, 1), (82, 3), (83, 1), (84, 1), (85, 4), (86, 2), (87, 1), (88, 1), (89, 1), (90, 2), (91, 2), (92, 1), (93, 1), (94, 1), (95, 1), (96, 1), (97, 1), (98, 1), (99, 1), (100, 1), (101, 1), (102, 1), (103, 1), (104, 1), (105, 1), (106, 1), (107, 1), (108, 1), (109, 1), (110, 1), (111, 1), (112, 1), (113, 1), (114, 1), (115, 1), (116, 1), (117, 2), (118, 1), (119, 1), (120, 1), (121, 1), (122, 1), (123, 2), (124, 2), (125, 1), (126, 1), (127, 2), (128, 1), (129, 1), (130, 1), (131, 1), (132, 1), (133, 1), (134, 1), (135, 2), (136, 1), (137, 1), (138, 1), (139, 1), (140, 2), (141, 1), (142, 1), (143, 1), (144, 1), (145, 1), (146, 1), (147, 1), (148, 2), (149, 1), (150, 1), (151, 1), (152, 1), (153, 1), (154, 1), (155, 1), (156, 1), (157, 4), (158, 1)] 如上所示，每个句子中的每个单词转换为一组向量,()中的第一个元素表示该词在字典中的ID，第二个元素表示在这个句子中这个单词出现的次数。 ###使用TF-IDF模型 tf-idf模型简单理解是把词袋表达的向量转换到另一个向量空间，这个向量空间中，词频是根据语料中每个词的相对稀有程度（relative rarity）进行加权处理的。 TF(term frequency)=某个词在文中出现的次数/文章总词数 IDF( inverse document frequency)=log(N/N(x))，其中N表示语料库中文本总数，N(x)表示语料库中包含x的文本总数。 TF-IDF(x) = Tf(x)*IDF(x)

from gensim.models.tfidfmodel import TfidfModel
tfidf_g = TfidfModel(train_corpus)

#保存模型
tfidf_g.save("./gensim_tfidf.tfidf")

训练好后，我们评估一条普通的句子：

tfidf_g[dictionary.doc2bow(list(tokenize('morning yellow flower', errors='ignore')))]

[(1056, 0.44640344500226231), (1332, 0.40452528743632266), (34490, 0.79817495332456578)] 返回的元组中，第一个表示单词ID，第二个表示权重。

#进行封装
def to_tfidf(text):
    res = tfidf_g[dictionary.doc2bow(list(tokenize(text, errors='ignore')))]
    return res

###余弦相似度余弦值越接近1，就表明夹角越接近0度，也就是两个向量越相似。

from gensim.similarities import MatrixSimilarity
def cos_sim(text1,text2):
    tf1 = to_tfidf(text1)
    tf2 = to_tfidf(text2)
    index = MatrixSimilarity([tf1],num_features=len(dictionary))
    sim = index[tf2]
    return float(sim[0])

先将两条文本根据训练好的tfidf模型转换为向量,拿其中一个作为index,扩展开全部的matrixsize，另一个带入，得到二者余弦值。

train['tfidf_cos_sim_in_title'] = train.apply(lambda x: cos_sim(x['search_term'], x['product_title']), axis=1)

test['tfidf_cos_sim_in_title'] = test.apply(lambda x: cos_sim(x['search_term'], x['product_title']), axis=1)

train['tfidf_cos_sim_in_desc'] = train.apply(lambda x: cos_sim(x['search_term'], x['product_description']), axis=1)

test['tfidf_cos_sim_in_desc'] = test.apply(lambda x: cos_sim(x['search_term'], x['product_description']), axis=1)

train.head(2)

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	id	product_title	product_uid	relevance	search_term	product_description	all_text	tfidf_cos_sim_in_title	tfidf_cos_sim_in_desc
0	2	simpson strong-ti angl	100001	3.0	angl bracket	angl make joint stronger, also provid consiste…	simpson strong-ti angl . angl make joint stron…	0.287958	0.188301
1	3	simpson strong-ti angl	100001	2.5	l bracket	angl make joint stronger, also provid consiste…	simpson strong-ti angl . angl make joint stron…	0.000000	0.000000

test.head(2)

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	id	product_title	product_uid	relevance	search_term	product_description	all_text	tfidf_cos_sim_in_title	tfidf_cos_sim_in_desc
166693	138080	winix freshom model true hepa air cleaner plas…	149579	NaN	winix air purifi	winix freshom true hepa air cleaner plasmawav …	winix freshom model true hepa air cleaner plas…	0.413249	0.146256
166694	138082	ge rv outlet box amp volt ring type meter amp …	149580	NaN	gcfi outlet	ring-typ meter surfac mount factory-assembl fa…	ge rv outlet box amp volt ring type meter amp …	0.530571	0.022509

可以看出我们增加了2个特征，即我们通过TF-IDF模型将单词转换为向量表示，计算两个向量的余弦值作为相似性度量，度量search item与title和product description的相似性。

###word2vec模型 ####tokenize分词

tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

#一段文章分成各个句子
tokenizer.tokenize(train['all_text'].values[0])

[‘simpson strong-ti angl .’, ‘angl make joint stronger, also provid consistent, straight corners.’, ‘simpson strong-ti offer wide varieti angl various size thick handl light-duti job project structur connect needed.’, ‘bent (skewed) match project.’, ‘outdoor project moistur present, use zmax zinc-coat connectors, provid extra resist corros (look “z” end model number).versatil connector various connect home repair projectsstrong angl nail screw fasten alonehelp ensur joint consist straight strongdimensions: in.’, ‘x in.’, ‘x in.mad steelgalvan extra corros resistanceinstal common nail x in.’, ‘strong-driv sd screw .’]

#应用到train样本
sentences = [tokenizer.tokenize(x) for x in train['all_text'].values]

type(sentences)#得到list of句子

list

sentences[:2]

[[‘simpson strong-ti angl .’, ‘angl make joint stronger, also provid consistent, straight corners.’, ‘simpson strong-ti offer wide varieti angl various size thick handl light-duti job project structur connect needed.’, ‘bent (skewed) match project.’, ‘outdoor project moistur present, use zmax zinc-coat connectors, provid extra resist corros (look “z” end model number).versatil connector various connect home repair projectsstrong angl nail screw fasten alonehelp ensur joint consist straight strongdimensions: in.’, ‘x in.’, ‘x in.mad steelgalvan extra corros resistanceinstal common nail x in.’, ‘strong-driv sd screw .’], [‘simpson strong-ti angl .’, ‘angl make joint stronger, also provid consistent, straight corners.’, ‘simpson strong-ti offer wide varieti angl various size thick handl light-duti job project structur connect needed.’, ‘bent (skewed) match project.’, ‘outdoor project moistur present, use zmax zinc-coat connectors, provid extra resist corros (look “z” end model number).versatil connector various connect home repair projectsstrong angl nail screw fasten alonehelp ensur joint consist straight strongdimensions: in.’, ‘x in.’, ‘x in.mad steelgalvan extra corros resistanceinstal common nail x in.’, ‘strong-driv sd screw .’]] 分割成句子后，这些句子还是有层级关系的，我们想要得到的是所有句子的集合，即需要将句子list flattern.查询Stack Overflow方法如下：

sentences = [y for x in sentences for y in x]

其等价于： flattern=[] for sub in sentences: for val in sub: flattern.append(val) 但上面方法运行更快，且不用调用append

len(sentences)

606641

#各个句子分成单词
words = [word_tokenize(x) for x in sentences]

#### w2c model

from gensim.models.word2vec import Word2Vec

w2c = Word2Vec(words, size=128, window=5, min_count=5, workers=4)

这时，语料中的单词会有一个w2c向量表示：

w2c['door'].shape

D:\programs\anaconda\lib\site-packages\ipykernel_launcher.py:1: DeprecationWarning: Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead). “”“Entry point for launching an IPython kernel. (128,) 获得一个单词的w2c后，对于一个句子，我们可以取句子所有单词的平均值作为句子的w2c向量。

vocab = w2c.wv.vocab
#注：对于gensim 1.0.0后的版本，要想获得w2c向量字典，需要使用方法model.wv.vocab
#对于以前的版本，使用model.vocab即可

type(vocab)

dict

# 得到任意text的vector
def get_vector(text):
    # 全是0的,大小是128的array
    res =np.zeros([128])
    count = 0
    for word in word_tokenize(text):
        if word in vocab:
            res += w2c[word]
            count += 1
    return res/count

print(get_vector('this is a door'))

[-0.00261087 -0.56179226 0.60765644 -0.64292271 -0.56054996 0.08848376 -0.49025596 0.63867774 0.0506447 0.45001359 -0.13710753 -0.16271916 0.02016276 0.20406312 0.31635891 0.19369102 0.17811321 0.41733303 0.00445884 0.5458078 1.05040102 -0.06413073 0.41070253 0.42587531 -0.63050625 -1.0984747 0.29934129 0.17861572 -0.71340695 -0.06451187 0.14277897 -0.06567481 0.01526162 -0.38790436 1.20415058 1.21037786 0.14057088 -0.10719017 0.37104489 0.76831334 0.34643462 0.62355396 0.25301299 0.40690951 0.1148672 1.06050375 0.36682158 0.25096587 -0.74231262 0.35016962 -0.58686608 -0.0857836 -0.84342213 0.5809405 -0.00302781 -0.14390172 -0.0524666 -0.91113859 0.75996059 0.87425374 -0.26513928 -0.54596879 0.80864939 0.01382558 -0.06432911 0.4952433 -0.43694797 0.01296244 0.84968186 -0.10620818 -0.18429637 0.69937535 0.4414333 -0.13501882 0.02398617 -0.47228654 1.04885393 0.06891993 -0.38115454 0.34773821 0.31407464 -0.06125381 -0.52234665 -0.11498543 -0.03274459 -0.10401297 -0.58666455 0.96296111 -0.72077985 0.29961426 0.68775976 -0.03572528 0.28445438 0.04369911 0.61288889 -0.21892426 -0.05004786 -0.73410231 -0.58521137 -0.02520149 -0.44890615 -0.54609256 0.86551609 -0.28756648 0.4514165 -0.36830674 0.31522632 -0.05346495 -0.0451854 -0.26681575 -0.46619874 0.04488686 -0.38999537 -0.12920142 0.68646752 -0.86762143 0.79189127 0.25246545 -1.17674513 -0.17455208 0.34669894 -0.31969217 -0.08517368 0.52510963 0.2380766 -0.16318383 0.17944744 -0.9727306 ] D:\programs\anaconda\lib\site-packages\ipykernel_launcher.py:8: DeprecationWarning: Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead). ###计算相似度

from scipy import spatial
def w2c_cos_sim(text1,text2):
    try:
        w1 = get_vector(text1)
        w2 = get_vector(text2)
        sim = 1 - spatial.distance.cosine(w1, w2)
        return float(sim)
    except:
        return float(0)
#spatial.distance.cosine定义余弦距离为1-cos

w2c_cos_sim('hello world','hello from the other side')

train['w2v_cos_sim_in_title'] = train.apply(lambda x: w2c_cos_sim(x['search_term'], x['product_title']), axis=1)
train['w2v_cos_sim_in_desc'] = train.apply(lambda x: w2c_cos_sim(x['search_term'], x['product_description']), axis=1)

test['w2v_cos_sim_in_title'] = test.apply(lambda x: w2c_cos_sim(x['search_term'], x['product_title']), axis=1)
test['w2v_cos_sim_in_desc'] = test.apply(lambda x: w2c_cos_sim(x['search_term'], x['product_description']), axis=1)

D:\programs\anaconda\lib\site-packages\ipykernel_launcher.py:8: DeprecationWarning: Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead). D:\programs\anaconda\lib\site-packages\ipykernel_launcher.py:10: RuntimeWarning: invalid value encountered in true_divide # Remove the CWD from sys.path while we load stuff. D:\programs\anaconda\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy “”“Entry point for launching an IPython kernel. D:\programs\anaconda\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy D:\programs\anaconda\lib\site-packages\ipykernel_launcher.py:4: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy after removing the cwd from sys.path. D:\programs\anaconda\lib\site-packages\ipykernel_launcher.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy “””

train.head(2)

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	id	product_title	product_uid	relevance	search_term	product_description	all_text	tfidf_cos_sim_in_title	tfidf_cos_sim_in_desc	w2v_cos_sim_in_title	w2v_cos_sim_in_desc
0	2	simpson strong-ti angl	100001	3.0	angl bracket	angl make joint stronger, also provid consiste…	simpson strong-ti angl . angl make joint stron…	0.287958	0.188301	0.531925	0.530175
1	3	simpson strong-ti angl	100001	2.5	l bracket	angl make joint stronger, also provid consiste…	simpson strong-ti angl . angl make joint stron…	0.000000	0.000000	0.279708	0.303249

test.head(2)

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	id	product_title	product_uid	relevance	search_term	product_description	all_text	tfidf_cos_sim_in_title	tfidf_cos_sim_in_desc	w2v_cos_sim_in_title	w2v_cos_sim_in_desc
166693	138080	winix freshom model true hepa air cleaner plas…	149579	NaN	winix air purifi	winix freshom true hepa air cleaner plasmawav …	winix freshom model true hepa air cleaner plas…	0.413249	0.146256	0.694583	0.591168
166694	138082	ge rv outlet box amp volt ring type meter amp …	149580	NaN	gcfi outlet	ring-typ meter surfac mount factory-assembl fa…	ge rv outlet box amp volt ring type meter amp …	0.530571	0.022509	0.676237	0.369881

#去掉文本列
train = train.drop(['search_term','product_title','product_description','all_text'],axis=1)
test = test.drop(['search_term','product_title','product_description','all_text'],axis=1)

#保留id，提取目标
ids = test['id']
y_train = train['relevance'].values


X_train = train.drop(['id','relevance'],axis=1).values
X_test = test.drop(['id','relevance'],axis=1).values

model

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score

params = [1,3,5,6,7,8,9,10]
test_scores = []
for param in params:
    clf = RandomForestRegressor(n_estimators=30, max_depth=param)
    test_score = np.sqrt(-cross_val_score(clf, X_train, y_train, cv=10, scoring='neg_mean_squared_error'))
    test_scores.append(np.mean(test_score))

import matplotlib.pyplot as plt
%matplotlib inline
plt.plot(params, test_scores)
plt.title("Param vs CV Error")

<matplotlib.text.Text at 0x26281610c88>

这里写图片描述

可以看出，max_depth在6的时候效果最好，大约在0.49左右。这里是增加4个特征，选择的模型是随机森林，下一步，可以构造新的特征，如简单的search item是否被包含，还可以使用别的模型，如LR，然后对模型进行ensemble等。

新手学习，欢迎指教！

iam_emily

关注

1
点赞
踩
3

收藏

觉得还不错? 一键收藏
3
评论
kaggle Home Depot relevance相关性预测

Home Depot 产品相关性预测kaggle竞赛：https://www.kaggle.com/c/home-depot-product-search-relevance HomeDepot是美国一家家具建材商品网站，用户通过在搜索框中输入关键词，得到相关商品和服务，如输入floor，得到不同材料的地板商品、地板清洗商品、地板安装服务等。kaggle竞赛目的是通过设计一种模型，能够更好的...
复制链接

扫一扫