简单的python爬取淘宝商品

项目目的
对商品标题进行文本分析,词云可视化
不同关键词word对应的sales的统计分析
商品的价格分布情况分析
商品的销量分布情况分析
不同价格区间的商品的平均销量分布
商品价格对销量的影响分析
商品价格对销售额的影响分析
不同省份的商品平均销售分布
项目步骤
数据采集: Python爬取淘宝网商品数据
对数据进行清洗和处理
文本分析:jieba分词,wordcloud可视化
数据柱形图可视化barh
数据直方图可视化hist
数据散点图可视化scatter
数据回归分析可视化regplot
模块
requests、retrying、missingno、jieba、matplotlib、wordcloud、imread、seaborn等

一、爬取数据
淘宝网是反爬虫的,虽然使用多线程、修改headers参数,但仍然不能保证每次100%爬取,所以增加了循环爬取,每次爬取未爬取成功的页面,直至所有页都爬取成功停止。

淘宝商品页为json格式,使用正则表达式进行解析

-- coding: utf-8 --

import re
import xlwt
import time
import requests
import pandas as pd
from retrying import retry
from concurrent.futures import ThreadPoolExecutor
import matplotlib
%matplotlib inline

计时开始

start = time.clock()

plist 为1-100页的URL的编号num

plist = []
for i in range(1, 101):
j = 44 * (i-1)
plist.append(j)

listno = plist
datatmsp = pd.DataFrame(columns=[])

while True:
# 设置最大重试次数
@retry(stop_max_attempt_number = 8)
def network_programming(num):
url = ‘https://s.taobao.com/search?initiative_id=tbindexz_20170306&ie=utf8&spm=a21bo.2017.201856-taobao-item.2&sourceId=tb.index&search_type=item&ssid=s5-e&commend=all&imgfile=&q=%E6%B2%99%E5%8F%91&suggest=history_1&_input_charset=utf-8&wq=shafa&suggest_query=shafa&source=suggest&bcoffset=4&ntoffset=4&p4ppushleft=1%2C48&s=’ + str(num)
web = requests.get(url, headers=headers)
web.encoding = ‘utf-8’
return web

# 多线程
def multithreading():
    # 每次爬取未爬取成功的页
    number = listno
    event = []

    with ThreadPoolExecutor(max_workers=10) as executor:
        for result in executor.map(network_programming, number, chunksize=10):
            event.append(result)

    return event

headers = {
    'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.167 Safari/537.36'

}

listpg = []
event = multithreading()
for i in event:
    json = re.findall(
    '"auctions":(.*?),"recommendAuctions"', i.text)
    if len(json):
        table = pd.read_json(json[0])
        datatmsp = pd.concat([datatmsp, table],
                            axis=0, ignore_index=True)

        pg = re.findall(
        '"pageNum":(.*?),"p4pbottom_up"', i.text)[0]
        listpg.append(pg)

lists = []
for a in listpg:
    b = 44 * (int(a)-1)
    lists.append(b)

listn = listno

listno = []
for p in listn:
    if p not in lists:
        listno.append(p)

if len(listno) == 0:
    break

datatmsp.to_excel(’./data/datastmsp.xls’, index=False)

end = time.clock()
print("爬取完成,用时: ",end-start, ‘s’)

爬取完成,用时: 9.088245 s
二、数据清洗,处理
datatmsp = pd.read_excel(’./data/datastmsp.xls’)
1
datatmsp.shape
1
(4392, 25)
import missingno as msno
msno.bar(datatmsp.sample(len(datatmsp)), figsize=(10, 4))

删除缺失值过半的列

half_count = len(datatmsp)/2
datatmsp = datatmsp.dropna(thresh=half_count, axis=1)

datatmsp = datatmsp.drop_duplicates()
png
datatmsp.head()
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
根据需求,本案例 item_loc, raw_title, view_price, view_sales 标题,区域,价格,销量分析

data = datatmsp[
[‘item_loc’, ‘raw_title’, ‘view_price’, ‘view_sales’]
]
data.head()
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

对城市进行分割,把省份和城市区别

对销售额进行

data[‘province’] = data.item_loc.apply(lambda x:x.split()[0])
data[‘city’] = data.item_loc.apply(lambda x:x.split()[0] if len(x)<4 else x.split()[1])

data[‘sales’] = data.view_sales.apply(lambda x:x.split(‘人’)[0])

data.dtypes

/usr/local/lib/python3.5/dist-packages/ipykernel_launcher.py:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy This is separate from the ipykernel package so we can avoid doing imports until /usr/local/lib/python3.5/dist-packages/ipykernel_launcher.py:4: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy after removing the cwd from sys.path. /usr/local/lib/python3.5/dist-packages/ipykernel_launcher.py:6: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy item_loc object raw_title object view_price float64 view_sales object province object city object sales object dtype: object

data[‘sales’] = data.sales.astype(‘int’)
list_col = [‘province’, ‘city’]
for i in list_col:
data[i] = data[i].astype(‘category’)

data = data.drop([‘item_loc’, ‘view_sales’], axis=1)

/usr/local/lib/python3.5/dist-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy “”“Entry point for launching an IPython kernel. /usr/local/lib/python3.5/dist-packages/ipykernel_launcher.py:4: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy after removing the cwd from sys.path.
data.head()

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

数据挖掘和分析
使用jieba

对列标题进行文本分析

title = data.raw_title.values.tolist()
import jieba
title_s = []
for line in title:
title_cut = jieba.lcut(line)
title_s.append(title_cut)

剔除不需要的单词,使用停用表

stopwords = pd.read_excel(’./data/stopwords.xlsx’)
stopwords = stopwords.stopword.values.tolist()

title_clean = []
for line in title_s:
line_clean = []
for word in line:
if word not in stopwords:
line_clean.append(word)
title_clean.append(line_clean)

统计每个词语的个数,先去重

title_clean_dist = []
for line in title_clean:
line_dist = []
for word in line:
if word not in line_dist:
line_dist.append(word)
title_clean_dist.append(line_dist)

将所有词转换为一个list

allwords_clean_dist = []
for line in title_clean_dist:
for word in line:
allwords_clean_dist.append(word)

将所有词语转换数据框

df_allwords_clean_dist = pd.DataFrame({
‘allwords’: allwords_clean_dist
})

word_count = df_allwords_clean_dist.allwords.value_counts().reset_index()
word_count.columns = [‘word’, ‘count’]
word_count.head()

Building prefix dict from the default dictionary … Loading model from cache /tmp/jieba.cache Loading model cost 0.644 seconds. Prefix dict has been built succesfully.
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

词云可视化
需要wordcloud

from wordcloud import WordCloud
import matplotlib.pyplot as plt
from scipy.misc import imread
plt.figure(figsize=(20,10))

pic = imread("./images/shafa.jpg")
w_c = WordCloud(font_path="./data/simhei.ttf",
background_color=‘white’,
mask=pic,
max_font_size=60,
margin=1)
wc = w_c.fit_words({
x[0]:x[1] for x in word_count.head(100).values
})

plt.imshow(wc, interpolation=‘bilinear’)
plt.axis(“off”)
plt.show()

/usr/local/lib/python3.5/dist-packages/ipykernel_launcher.py:6: DeprecationWarning: imread is deprecated! imread is deprecated in SciPy 1.0.0, and will be removed in 1.2.0. Use “imageio.imread“ instead. png 分析结论: 1. 客厅、组合、沙发、整装商品占比很高 2. 从沙发材质 布艺沙发占比高 3. 从沙发风格看,简约风格最多 4. 从户型看,小户型
不同关键词对应sales进统计分析
例如,词语简约,则统计商品中含有简约一词的商品的销售之和

import numpy as np

w_s_sum = []
for w in word_count.word:
i = 0
s_list = []
for t in title_clean_dist:
if w in t:
try:
s_list.append(data.sales[i])
except:
s_list.append(0)
i += 1
w_s_sum.append(sum(s_list))

df_w_s_sum = pd.DataFrame({‘w_s_sum’:w_s_sum})

df_w_s_sum.head()

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

word_count.head()

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

df_word_sum = pd.concat([word_count, df_w_s_sum],
axis=1,
ignore_index=True)
df_word_sum.columns = [‘word’, ‘count’, ‘w_s_sum’]

df_word_sum.head(30)

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
word count w_s_sum

可视化数据

df_word_sum.sort_values(‘w_s_sum’,
inplace=True,
ascending=True)
df_w_s = df_word_sum.tail(30)
import matplotlib
from matplotlib import pyplot as plt
font = matplotlib.font_manager.FontProperties(fname=’./data/simhei.ttf’)

index = np.arange(df_w_s.word.size)

plt.figure(figsize=(6,12))
plt.barh(index,
df_w_s.w_s_sum,
color=‘blue’,
align=‘center’,
alpha=0.8)

plt.yticks(index, list(df_w_s.word), fontproperties=font)

for y,x in zip(index, df_w_s.w_s_sum):
plt.text(x,y,"%.0f" %x, ha=‘left’, va=‘center’)

plt.show()

png
商品的价格分布情况分析
分析价格小于20000的商品

data.head()

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

data_p = data[data[‘view_price’] < 20000]

plt.figure(figsize=(7, 5))
plt.hist(data_p[‘view_price’], bins=15, color=‘purple’)
plt.xlabel(u’价格’, fontproperties=font)
plt.ylabel(u’商品数量’, fontproperties=font)
plt.title(u’不同价格对应的商品数据分布’, fontproperties=font)
plt.show()

png 由图表可知:商品数量随着价格总体呈现下降阶梯形势,价格越高,在售的商品越少
商品的销量分布情况分析
同样,为了使可视化效果更加直观,这里我们选择销量大于100的商品

data_s = data[(data[‘sales’] > 100) & (data[‘sales’] < 2000)]
print(u’销量100以上的商品占比 : %0.3f’ % (len(data_s)/len(data)))

plt.figure(figsize=(7, 5))
plt.hist(data_s[‘sales’], bins=20, color=‘blue’)
plt.xlabel(u’销量’, fontproperties=font)
plt.ylabel(u’商品数量’, fontproperties=font)
plt.title(u’不同销量对应的商品数据分布’, fontproperties=font)
plt.show()

销量100以上的商品占比 : 0.271 png 根据图表可知: 1. 销量100以上的商品仅占3.4%,其中销量100-200商品最多,200-300次之 2. 销量100-500之间,商品的数量随着销量呈现下降趋势,且趋势陡峭,低销量商品居多; 3. 销量750以上的商品很少。
不同价格区间的商品平均销量分布
data[‘price’] = data.view_price.astype(‘int’)
data[‘group’] = pd.qcut(data.price, 12)
df_group = data.group.value_counts().reset_index()

df_s_g = data[
[‘sales’,‘group’]
].groupby(‘group’).mean().reset_index()

df_s_g

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

index = np.arange(df_s_g.group.size)
plt.figure(figsize=(8,4))
plt.bar(index, df_s_g.sales, color=‘blue’)
plt.xticks(index, df_s_g.group, fontproperties=font, rotation=30)
plt.xlabel(‘Group’)
plt.ylabel(‘mean_sales’)
plt.title(u’不同价格商品的平均销量’,fontproperties=font)
plt.show()

png 由图表可知: 1. 价格在40-490之间的商品平均销量最高,888-1198之间的次之,7290元以上的最低; 2. 总体呈现减的趋势
商品价格对销量的影响分析
data_p

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

fig, ax = plt.subplots()
ax.scatter(data_p[‘view_price’],data_p[‘sales’], color=‘red’)
ax.set_xlabel(u’价格’)
ax.set_ylabel(u’销量’)
ax.set_title(u’商品价格对销量的影响’,fontproperties=font)
plt.show()
商品价格对销售额的影响
data[‘GMV’] = data[‘price’] * data[‘sales’]
import seaborn as sns
sns.regplot(x=‘price’, y=‘GMV’, data=data, color=‘purple’)
<matplotlib.axes._subplots.AxesSubplot at 0x7f7364ef8080>
由图表可知:

总体趋势:由线性回归拟合线可以看出,商品销售额随着价格增长呈现上升趋势;

多数商品的价格偏低,销售额也偏低;

不同省份商品数量分布
plt.figure(figsize=(8,4))
data.province.value_counts().plot(kind=‘bar’, color=‘blue’)
plt.xticks(rotation=0,fontproperties=font)
plt.xlabel(u’省份’,fontproperties=font)
plt.ylabel(u’数量’,fontproperties=font)
plt.title(u’不同价格商品的平均销量’,fontproperties=font)
plt.show()
由图表可知:

广东的最多,浙江次之,江苏第三,尤其是广东的数量远超过江苏、浙江、上海等地,说明在沙发这个子类目,广东的店铺占主导地位;

江浙沪等地的数量差异不大,基本相当。
————————————————
版权声明:本文为CSDN博主「秦景坤」的原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接及本声明。
原文链接:https://blog.csdn.net/qjk19940101/article/details/79593381

  • 1
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值