简单的python爬取淘宝商品

最新推荐文章于 2024-08-05 17:04:33 发布

Prime-夕枫

最新推荐文章于 2024-08-05 17:04:33 发布

阅读量1k

点赞数 1

文章标签： python

原文链接：https://blog.csdn.net/weixin_44263976/article/details/106537135

版权

项目目的
对商品标题进行文本分析，词云可视化
不同关键词word对应的sales的统计分析
商品的价格分布情况分析
商品的销量分布情况分析
不同价格区间的商品的平均销量分布
商品价格对销量的影响分析
商品价格对销售额的影响分析
不同省份的商品平均销售分布
项目步骤
数据采集： Python爬取淘宝网商品数据
对数据进行清洗和处理
文本分析：jieba分词，wordcloud可视化
数据柱形图可视化barh
数据直方图可视化hist
数据散点图可视化scatter
数据回归分析可视化regplot
模块
requests、retrying、missingno、jieba、matplotlib、wordcloud、imread、seaborn等

一、爬取数据
淘宝网是反爬虫的，虽然使用多线程、修改headers参数，但仍然不能保证每次100%爬取，所以增加了循环爬取，每次爬取未爬取成功的页面，直至所有页都爬取成功停止。

淘宝商品页为json格式，使用正则表达式进行解析

-- coding: utf-8 --

import re
import xlwt
import time
import requests
import pandas as pd
from retrying import retry
from concurrent.futures import ThreadPoolExecutor
import matplotlib
%matplotlib inline

计时开始

start = time.clock()

plist 为1-100页的URL的编号num

plist = []
for i in range(1, 101):
j = 44 * (i-1)
plist.append(j)

listno = plist
datatmsp = pd.DataFrame(columns=[])

while True:
# 设置最大重试次数
@retry(stop_max_attempt_number = 8)
def network_programming(num):
url = ‘https://s.taobao.com/search?initiative_id=tbindexz_20170306&ie=utf8&spm=a21bo.2017.201856-taobao-item.2&sourceId=tb.index&search_type=item&ssid=s5-e&commend=all&imgfile=&q=%E6%B2%99%E5%8F%91&suggest=history_1&_input_charset=utf-8&wq=shafa&suggest_query=shafa&source=suggest&bcoffset=4&ntoffset=4&p4ppushleft=1%2C48&s=’ + str(num)
web = requests.get(url, headers=headers)
web.encoding = ‘utf-8’
return web

# 多线程
def multithreading():
    # 每次爬取未爬取成功的页
    number = listno
    event = []

    with ThreadPoolExecutor(max_workers=10) as executor:
        for result in executor.map(network_programming, number, chunksize=10):
            event.append(result)

    return event

headers = {
    'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.167 Safari/537.36'

}

listpg = []
event = multithreading()
for i in event:
    json = re.findall(
    '"auctions":(.*?),"recommendAuctions"', i.text)
    if len(json):
        table = pd.read_json(json[0])
        datatmsp = pd.concat([datatmsp, table],
                            axis=0, ignore_index=True)

        pg = re.findall(
        '"pageNum":(.*?),"p4pbottom_up"', i.text)[0]
        listpg.append(pg)

lists = []
for a in listpg:
    b = 44 * (int(a)-1)
    lists.append(b)

listn = listno

listno = []
for p in listn:
    if p not in lists:
        listno.append(p)

if len(listno) == 0:
    break

datatmsp.to_excel(’./data/datastmsp.xls’, index=False)

end = time.clock()
print("爬取完成，用时: ",end-start, ‘s’)

爬取完成，用时: 9.088245 s
二、数据清洗，处理
datatmsp = pd.read_excel(’./data/datastmsp.xls’)
1
datatmsp.shape
1
(4392, 25)
import missingno as msno
msno.bar(datatmsp.sample(len(datatmsp)), figsize=(10, 4))

删除缺失值过半的列

half_count = len(datatmsp)/2
datatmsp = datatmsp.dropna(thresh=half_count, axis=1)

datatmsp = datatmsp.drop_duplicates()
png
datatmsp.head()
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
根据需求，本案例 item_loc, raw_title, view_price, view_sales 标题，区域，价格，销量分析

data = datatmsp[
[‘item_loc’, ‘raw_title’, ‘view_price’, ‘view_sales’]
]
data.head()
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

对城市进行分割，把省份和城市区别

对销售额进行

data[‘province’] = data.item_loc.apply(lambda x:x.split()[0])
data[‘city’] = data.item_loc.apply(lambda x:x.split()[0] if len(x)<4 else x.split()[1])

data[‘sales’] = data.view_sales.apply(lambda x:x.split(‘人’)[0])

data.dtypes

/usr/local/lib/python3.5/dist-packages/ipykernel_launcher.py:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy This is separate from the ipykernel package so we can avoid doing imports until /usr/local/lib/python3.5/dist-packages/ipykernel_launcher.py:4: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy after removing the cwd from sys.path. /usr/local/lib/python3.5/dist-packages/ipykernel_launcher.py:6: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy item_loc object raw_title object view_price float64 view_sales object province object city object sales object dtype: object

data[‘sales’] = data.sales.astype(‘int’)
list_col = [‘province’, ‘city’]
for i in list_col:
data[i] = data[i].astype(‘category’)

data = data.drop([‘item_loc’, ‘view_sales’], axis=1)

/usr/local/lib/python3.5/dist-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy “”“Entry point for launching an IPython kernel. /usr/local/lib/python3.5/dist-packages/ipykernel_launcher.py:4: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy after removing the cwd from sys.path.
data.head()

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

数据挖掘和分析
使用jieba

对列标题进行文本分析

title = data.raw_title.values.tolist()
import jieba
title_s = []
for line in title:
title_cut = jieba.lcut(line)
title_s.append(title_cut)

剔除不需要的单词，使用停用表

stopwords = pd.read_excel(’./data/stopwords.xlsx’)
stopwords = stopwords.stopword.values.tolist()

title_clean = []
for line in title_s:
line_clean = []
for word in line:
if word not in stopwords:
line_clean.append(word)
title_clean.append(line_clean)

统计每个词语的个数，先去重

title_clean_dist = []
for line in title_clean:
line_dist = []
for word in line:
if word not in line_dist:
line_dist.append(word)
title_clean_dist.append(line_dist)

将所有词转换为一个list

allwords_clean_dist = []
for line in title_clean_dist:
for word in line:
allwords_clean_dist.append(word)

将所有词语转换数据框

df_allwords_clean_dist = pd.DataFrame({
‘allwords’: allwords_clean_dist
})

word_count = df_allwords_clean_dist.allwords.value_counts().reset_index()
word_count.columns = [‘word’, ‘count’]
word_count.head()

Building prefix dict from the default dictionary … Loading model from cache /tmp/jieba.cache Loading model cost 0.644 seconds. Prefix dict has been built succesfully.
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

词云可视化
需要wordcloud

from wordcloud import WordCloud
import matplotlib.pyplot as plt
from scipy.misc import imread
plt.figure(figsize=(20,10))

pic = imread("./images/shafa.jpg")
w_c = WordCloud(font_path="./data/simhei.ttf",
background_color=‘white’,
mask=pic,
max_font_size=60,
margin=1)
wc = w_c.fit_words({
x[0]:x[1] for x in word_count.head(100).values
})

plt.imshow(wc, interpolation=‘bilinear’)
plt.axis(“off”)
plt.show()

/usr/local/lib/python3.5/dist-packages/ipykernel_launcher.py:6: DeprecationWarning: imread is deprecated! imread is deprecated in SciPy 1.0.0, and will be removed in 1.2.0. Use “imageio.imread“ instead. png 分析结论: 1. 客厅、组合、沙发、整装商品占比很高 2. 从沙发材质布艺沙发占比高 3. 从沙发风格看，简约风格最多 4. 从户型看，小户型
不同关键词对应sales进统计分析
例如，词语简约，则统计商品中含有简约一词的商品的销售之和

import numpy as np

w_s_sum = []
for w in word_count.word:
i = 0
s_list = []
for t in title_clean_dist:
if w in t:
try:
s_list.append(data.sales[i])
except:
s_list.append(0)
i += 1
w_s_sum.append(sum(s_list))

df_w_s_sum = pd.DataFrame({‘w_s_sum’:w_s_sum})

df_w_s_sum.head()

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

word_count.head()

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

df_word_sum = pd.concat([word_count, df_w_s_sum],
axis=1,
ignore_index=True)
df_word_sum.columns = [‘word’, ‘count’, ‘w_s_sum’]

df_word_sum.head(30)

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
word count w_s_sum

可视化数据

df_word_sum.sort_values(‘w_s_sum’,
inplace=True,
ascending=True)
df_w_s = df_word_sum.tail(30)
import matplotlib
from matplotlib import pyplot as plt
font = matplotlib.font_manager.FontProperties(fname=’./data/simhei.ttf’)

index = np.arange(df_w_s.word.size)

plt.figure(figsize=(6,12))
plt.barh(index,
df_w_s.w_s_sum,
color=‘blue’,
align=‘center’,
alpha=0.8)

plt.yticks(index, list(df_w_s.word), fontproperties=font)

for y,x in zip(index, df_w_s.w_s_sum):
plt.text(x,y,"%.0f" %x, ha=‘left’, va=‘center’)

plt.show()

png
商品的价格分布情况分析
分析价格小于20000的商品

data.head()

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

data_p = data[data[‘view_price’] < 20000]

plt.figure(figsize=(7, 5))
plt.hist(data_p[‘view_price’], bins=15, color=‘purple’)
plt.xlabel(u’价格’, fontproperties=font)
plt.ylabel(u’商品数量’, fontproperties=font)
plt.title(u’不同价格对应的商品数据分布’, fontproperties=font)
plt.show()

png 由图表可知：商品数量随着价格总体呈现下降阶梯形势，价格越高，在售的商品越少
商品的销量分布情况分析
同样，为了使可视化效果更加直观，这里我们选择销量大于100的商品

data_s = data[(data[‘sales’] > 100) & (data[‘sales’] < 2000)]
print(u’销量100以上的商品占比 : %0.3f’ % (len(data_s)/len(data)))

plt.figure(figsize=(7, 5))
plt.hist(data_s[‘sales’], bins=20, color=‘blue’)
plt.xlabel(u’销量’, fontproperties=font)
plt.ylabel(u’商品数量’, fontproperties=font)
plt.title(u’不同销量对应的商品数据分布’, fontproperties=font)
plt.show()

销量100以上的商品占比 : 0.271 png 根据图表可知: 1. 销量100以上的商品仅占3.4%,其中销量100-200商品最多，200-300次之 2. 销量100-500之间，商品的数量随着销量呈现下降趋势，且趋势陡峭，低销量商品居多； 3. 销量750以上的商品很少。
不同价格区间的商品平均销量分布
data[‘price’] = data.view_price.astype(‘int’)
data[‘group’] = pd.qcut(data.price, 12)
df_group = data.group.value_counts().reset_index()

df_s_g = data[
[‘sales’,‘group’]
].groupby(‘group’).mean().reset_index()

df_s_g

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

index = np.arange(df_s_g.group.size)
plt.figure(figsize=(8,4))
plt.bar(index, df_s_g.sales, color=‘blue’)
plt.xticks(index, df_s_g.group, fontproperties=font, rotation=30)
plt.xlabel(‘Group’)
plt.ylabel(‘mean_sales’)
plt.title(u’不同价格商品的平均销量’,fontproperties=font)
plt.show()

png 由图表可知： 1. 价格在40-490之间的商品平均销量最高，888-1198之间的次之，7290元以上的最低； 2. 总体呈现减的趋势
商品价格对销量的影响分析
data_p

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

fig, ax = plt.subplots()
ax.scatter(data_p[‘view_price’],data_p[‘sales’], color=‘red’)
ax.set_xlabel(u’价格’)
ax.set_ylabel(u’销量’)
ax.set_title(u’商品价格对销量的影响’,fontproperties=font)
plt.show()
商品价格对销售额的影响
data[‘GMV’] = data[‘price’] * data[‘sales’]
import seaborn as sns
sns.regplot(x=‘price’, y=‘GMV’, data=data, color=‘purple’)
<matplotlib.axes._subplots.AxesSubplot at 0x7f7364ef8080>
由图表可知：

总体趋势：由线性回归拟合线可以看出，商品销售额随着价格增长呈现上升趋势；

多数商品的价格偏低，销售额也偏低；

不同省份商品数量分布
plt.figure(figsize=(8,4))
data.province.value_counts().plot(kind=‘bar’, color=‘blue’)
plt.xticks(rotation=0,fontproperties=font)
plt.xlabel(u’省份’,fontproperties=font)
plt.ylabel(u’数量’,fontproperties=font)
plt.title(u’不同价格商品的平均销量’,fontproperties=font)
plt.show()
由图表可知：

广东的最多，浙江次之，江苏第三，尤其是广东的数量远超过江苏、浙江、上海等地，说明在沙发这个子类目，广东的店铺占主导地位；

江浙沪等地的数量差异不大，基本相当。
————————————————
版权声明：本文为CSDN博主「秦景坤」的原创文章，遵循CC 4.0 BY-SA版权协议，转载请附上原文出处链接及本声明。
原文链接：https://blog.csdn.net/qjk19940101/article/details/79593381

Prime-夕枫

关注

1
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
简单的python爬取淘宝商品

项目目的对商品标题进行文本分析，词云可视化不同关键词word对应的sales的统计分析商品的价格分布情况分析商品的销量分布情况分析不同价格区间的商品的平均销量分布商品价格对销量的影响分析商品价格对销售额的影响分析不同省份的商品平均销售分布项目步骤数据采集： Python爬取淘宝网商品数据对数据进行清洗和处理文本分析：jieba分词，wordcloud可视化数据柱形图可视化barh数据直方图可视化hist数据散点图可视化scatter数据回归分析可视化regplot模块re
复制链接

扫一扫