样本:京东、淘宝、天猫商城各抽样十五万商品每天价格
目的:分析价格波动走势及相关性
工具:Centos 7(两台)+Window 7+Python+PyMySQL+MySql+PyCharm+Navicat for MySql
环境配置
- centos7下快速安装mysql
wget http://dev.mysql.com/get/mysql-community-release-el7-5.noarch.rpm
rpm -ivh mysql-community-release-el7-5.noarch.rpm
yum install mysql-community-server
•成功安装之后重启mysql服务
service mysqld restart
•初次安装mysql是root账户是没有密码的
•设置密码的方法
mysql -uroot
mysql> set password for ‘root’@‘localhost’ = password(‘mypasswd’);
•远程授权连接mysql
GRANT ALL PRIVILEGES ON . TO ‘root’@’%’ IDENTIFIED BY ‘mypassword’ WITH GRANT OPTION;
FLUSH PRIVILEGES;
mysql> exit
Centos7默认Python版本是2.x,建议在升级3.x前安装gcc的编译包
- CentOS 7下安装Python3.5
•安装python3.5可能使用的依赖
yum install openssl-devel bzip2-devel expat-devel gdbm-devel readline-devel sqlite-devel
yum install openssl-devel bzip2-devel expat-devel gdbm-devel readline-devel sqlite-devel
•到python官网找到下载路径, 用wget下载
wget https://www.python.org/ftp/python/3.5.1/Python-3.5.1.tgz
•解压tgz包
tar -zxvf Python-3.5.1.tgz
•把python移到/usr/local下面
mv Python-3.5.1 /usr/local
•进入python目录
cd /usr/local/Python-3.5.1/
•配置
./configure
•编译 make
make
•编译,安装
make install
•删除旧的软链接,创建新的软链接到最新的python
rm -rf /usr/bin/python
ln -s /usr/local/bin/python3.5 /usr/bin/python
python -V
python2.x与python3.x中的包名有区别
- 安装Python抓取数据的依赖包
pip install requests
pip install PyMySQL
如果在升级后造成pip报错,可以在GitHub中分别下载这两个文件,以 python setup.py install 进行手动安装(或者修改yum里的python默认路径根据报错原因解决)
requests包 https://github.com/kennethreitz/requests
PyMySQL包 https://github.com/PyMySQL/PyMySQL
京东数据
•商品编号6,7位的为京东自营非图书商品。
•商品编号8位的为京东自营图书商品。
•商品编号10位为第三方商家商品。
商品数据分析
#issc用来判断商品连接是否存在
pid = str(pid)
url = r'http://item.jd.com/{}.html'.format(pid)
req = requests.get(url)
req.encoding = 'gb2312'
iscz= re.search('class="dt">(.*?)</div>', req.text).group(1).replace('\n', '')
#issc用来判断商品是否已下架(下架商品价格为-1.00)
url = 'http://p.3.cn/prices/get?skuid=J_' + str(pid)
html = urllib.request.urlopen(url).read().decode('utf-8')
nprice = re.search(r'"p":"(.*?)"', html).group(1).replace('\n', '')
if nprice != '-1.00'
#建立未下架商品索引
import pymysql.cursors
config = {
'host':' localhost',
'port':3306,
'user':'root',
'password':' *********',
'db':'database',
'charset':'utf8',
'cursorclass':pymysql.cursors.DictCursor,
}
connection = pymysql.connect(**config)
for i in range(500000,1000000,1):
try:
#getjd函数判断商品是否在售,同时写入数据表
rt=getjd(i)
except:
i+1
connection.close()
- 建立商品索引表及商品价格信息详情表
# 生成索引数据
try:
with connection.cursor() as cursor:
# 插入一条索引数据
sql = "INSERT INTO `data_index_01` (`gid`,`flag`) VALUES (%s,%s)"
cursor.execute(sql, (pid,'1'))
connection.commit()
finally:
return pid
根据索引数据表抽样15万条数据 (自营、非自营商品)
#定时器执行数据抓取脚本
while True:
current_time = time.localtime(time.time())
if((current_time.tm_hour == 10) and (current_time.tm_min == 10) and (current_time.tm_sec ==10)):
try:
connection = pymysql.connect(**config)
with connection.cursor() as cursor:
# Read a single record
sql = "SELECT `gid` FROM `data_index_01` where `flag`='1'"
cursor.execute(sql)
result = cursor.fetchall()
connection.close()
connection = pymysql.connect(**config)
for element in result:
try:
print(element['gid'])
getjd(element['gid'])
except:
sql = "UPDATE `data_index_01` SET `flag`='0' WHERE gid='"+element['gid']
connection.close()
finally:
connection.close()
time.sleep(1)
爬虫走起
按索引表抽样爬到的数据
欢迎讨论(持续更新)