一.从京东抓取一些婴儿奶粉的相关商品,格式如1.1:
1.1
1.2主要字段有:
商品名称,1
商品sku,2
商品链接,3
封面图链接,4
价格,5
评价人数,6
评论链接,7
商家店名,8
店铺链接,9
标签,10
是否广告,11
页码,12
当前时间,13
页面网址,14
二.数据处理
2.1需要的字段有:
品牌名、奶粉的段位、重量、店铺名、店铺url、商品url、商品价格、商品评论人数,其中前3个字段可以从商品的标题中提取。但有个问题,关于奶粉的品牌名如果使用实体识别就需要相关奶粉品牌的语料进行训练后提取。这里只简单的提前搜集好各种奶粉品牌名,如发现商品标题中有此品牌则认为该商品属于此品牌奶粉。奶粉的段位和重量(包括有几罐)可以简单的利用规则提取,其它的字段在抓取中可获得。
2.2处理结果为json格式:
{“product_weight”: “900克”, “product_price”: 230.0, “product_name”: “a2”, “product_url”: “https://item.jd.com/1950756.html”, “shop_name”: “a2海外自营官方旗舰店”, “shope_url”: “https://mall.jd.com/index-1000015026.html?from=pc”, “product_stage”: “1段”, “product_comment_num”: “32万+”}
{“product_weight”: “300克”, “product_price”: 131.0, “product_name”: “飞鹤”, “product_url”: “https://item.jd.com/3977355.html”, “shop_name”: “飞鹤京东自营旗舰店”, “shope_url”: “https://mall.jd.com/index-1000003568.html?from=pc”, “product_stage”: “1段”, “product_comment_num”: “14万+”}
{“product_weight”: “900克”, “product_price”: 389.0, “product_name”: “启赋”, “product_url”: “https://item.jd.com/100006355540.html”, “shop_name”: “惠氏(Wyeth)京东自营官方旗舰店”, “shope_url”: “https://mall.jd.com/index-1000002520.html?from=pc”, “product_stage”: “1段”, “product_comment_num”: “291万+”}
{“product_weight”: “380克”, “product_price”: 170.0, “product_name”: “爱他美”, “product_url”: “https://item.jd.com/6374127.html”, “shop_name”: “爱他美京东自营官方旗舰店”, “shope_url”: “https://mall.jd.com/index-1000002668.html?from=pc”, “product_stage”: “1段”, “product_comment_num”: “119万+”}
{“product_weight”: “900克”, “product_price”: 215.0, “product_name”: “a2”, “product_url”: “https://item.jd.com/1950749.html”, “shop_name”: “a2海外自营官方旗舰店”, “shope_url”: “https://mall.jd.com/index-1000015026.html?from=pc”, “product_stage”: “3段”, “product_comment_num”: “81万+”}
2.3以下为处理程序示例:
#读取数据并保存为json格式输出保存
def read\_babymilk\_info(file\_path, baby\_milk\_name\_set, json\_to\_write\_path):
df \= pd.read\_excel(file\_path)
df.drop\_duplicates(\['商品SKU'\], keep='first', inplace=True) # 去重
df = df\[df\['是否广告'\] != '广告'\] #删除广告数据
df = df.iloc\[:,\[1,3,5,6,8,9\]\]
product\_name \= None #商品名
product\_title = None #商品标题描述
product\_stage = None #几段
product\_weight = None #重量
product\_num = None #几罐
product\_url \= None #商品链接
product\_price = None #商品价格
product\_comment\_num = None #商品评论人数
shop\_name = None #店铺名
shope\_url = None #店铺url
write\_lines \= 0
with open(json\_to\_write\_path, 'w', encoding='utf-8') as f\_write:
for index, row in df.iterrows():
product\_dict \= dict()
#处理商品标题
product\_title = repr(row\[0\]).strip().replace('\\\\t','').replace('\\\\n',' ')
product\_title \= product\_title.lower()#有字母转小写
for milk\_name in baby\_milk\_name\_set:
if milk\_name in product\_title:
product\_name \= milk\_name
break
#pattern = re.compile('r\[0-9\]')
if '段' in product\_title:
posi \= product\_title.find('段')
stage \= product\_title\[posi - 1:posi\]
if stage.isdigit():
product\_stage \= stage + '段'
if 'g' in product\_title:
posi \= product\_title.find('g')
weight \= product\_title\[posi - 3:posi\]
if weight.isdigit():
product\_weight \= weight + '克'
elif'克' in product\_title:
posi \= product\_title.find('克')
weight \= product\_title\[posi - 3:posi\]
if weight.isdigit():
product\_weight \= weight + '克'
if '\*' in product\_title:
posi \= product\_title.find('\*')
num \= product\_title\[posi+1:posi+3\]
if num.isdigit():
product\_num \= '\*' + num
else:
num \= product\_title\[posi+1:posi+2\]
if num.isdigit():
product\_num \= '\*' + num
if product\_num != None:
product\_weight \= product\_weight + product\_num
product\_num \= None
product\_url \= row\[1\]
product\_price \= row\[2\]
product\_comment\_num \= row\[3\]
shop\_name \= row\[4\]
shope\_url \= row\[5\].strip()
product\_dict\['product\_name'\] = product\_name
product\_dict\['product\_stage'\] = product\_stage
product\_dict\['product\_weight'\] = product\_weight
product\_dict\['product\_url'\] = product\_url
product\_dict\['product\_price'\] = product\_price
product\_dict\['product\_comment\_num'\] = product\_comment\_num
product\_dict\['shop\_name'\] = shop\_name
product\_dict\['shope\_url'\] = shope\_url
json\_str \= json.dumps(product\_dict,ensure\_ascii=False)
f\_write.write(json\_str \+ '\\n')
write\_lines += 1
print('write line:{}'.format(write\_lines))
print('write done!')
三.利用neo4j,py2neo构建图谱
3.1利用2.2中的json数据直接构建节点和关系:
4个节点:店铺、品牌、段位以及商品url
5个关系:(店铺,拥有奶粉品牌,品牌),(店铺,拥有几段奶粉,段位),(店铺,奶粉链接,商品url),(品牌,拥有几段奶粉,段位),(商品url,几段奶粉,段位)
其中,节点商品url拥有属性价格,评论数,重量
3.2拥有2个品牌和3个店铺的图谱如下:
3.3拥有15个品牌和18个店铺的图谱如下:
3.3以下为图谱构建程序示例:
def build\_graph(graph, json\_data\_path):
line\_count \= 0
with open(json\_data\_path, 'r', encoding='utf-8') as f\_read:
for line in f\_read:
product\_dict \= json.loads(line)
product\_name \= product\_dict\['product\_name'\]
product\_stage \= product\_dict\['product\_stage'\]
product\_weight \= product\_dict\['product\_weight'\]
product\_url \= product\_dict\['product\_url'\]
product\_price \= product\_dict\['product\_price'\]
product\_comment\_num \= product\_dict\['product\_comment\_num'\]
shop\_name \= product\_dict\['shop\_name'\]
shop\_url \= product\_dict\['shope\_url'\]
#节点
shop\_name\_node = Node('店铺名', name = shop\_name, url = shop\_url)
product\_stage\_node \= Node('段位', name = product\_stage)
product\_name\_node \= Node('品牌名', name = product\_name)
product\_url\_node \= Node('商品url', name = product\_url, 价格 = product\_price, 重量 = product\_weight, 评论数 = product\_comment\_num)
subgraph\_begin \= graph.begin()
nodes \= \[\]
node\_matcher \= NodeMatcher(graph)
if not node\_matcher.match('店铺名', name=shop\_name).first():
nodes.append(shop\_name\_node)
if not node\_matcher.match('段位', name=product\_stage).first():
nodes.append(product\_stage\_node)
if not node\_matcher.match('品牌名', name=product\_name).first():
nodes.append(product\_name\_node)
if not node\_matcher.match('商品url', name=product\_url).first():
nodes.append(product\_url\_node)
nodes \= Subgraph(nodes)
subgraph\_begin.create(nodes)
subgraph\_begin.commit()
relations \= \[\]
#关系
shop\_name\_node = node\_matcher.match('店铺名', name=shop\_name).first()
product\_stage\_node \= node\_matcher.match('段位', name=product\_stage).first()
product\_name\_node \= node\_matcher.match('品牌名', name=product\_name).first()
product\_url\_node \= node\_matcher.match('商品url', name=product\_url).first()
rel\_1 \= Relationship(shop\_name\_node, '拥有奶粉品牌', product\_name\_node)
rel\_2 \= Relationship(shop\_name\_node, '拥有几段奶粉', product\_stage\_node)
rel\_3 \= Relationship(shop\_name\_node, '奶粉链接', product\_url\_node)
rel\_4 \= Relationship(product\_name\_node, '拥有几段奶粉', product\_stage\_node)
rel\_5 \= Relationship(product\_url\_node, '几段奶粉', product\_stage\_node)
relations.append(rel\_1)
relations.append(rel\_2)
relations.append(rel\_3)
relations.append(rel\_4)
relations.append(rel\_5)
for relation in relations:
graph.create(relation)
line\_count += 1
print('line\_count:{} completed!'.format(line\_count))
print('build graph done!')