python爬虫基础知识—06爬虫实例（淘宝商品+股票数据）

最新推荐文章于 2023-12-30 12:07:45 发布

张小北哈哈

最新推荐文章于 2023-12-30 12:07:45 发布

阅读量572

点赞数 1

分类专栏： python爬虫

python爬虫专栏收录该内容

6 篇文章 0 订阅

订阅专栏

1、“淘宝商品比价定向爬虫”实例介绍

目标：获取淘宝搜索页面的信息，提取其中的商品名称和价格。

理解：淘宝的搜索接口

翻页的处理

技术路线：requests+re

书包：https://s.taobao.com/search?q=%E4%B9%A6%E5%8C%85&imgfile=&commend=all&ssid=s5-e&search_type=item&sourceId=tb.index&spm=a21bo.2017.201856-taobao-item.1&ie=utf8&initiative_id=tbindexz_20170306

书包（第二页）：https://s.taobao.com/search?q=%E4%B9%A6%E5%8C%85&imgfile=&commend=all&ssid=s5-e&search_type=item&sourceId=tb.index&spm=a21bo.2017.201856-taobao-item.1&ie=utf8&initiative_id=tbindexz_20170306&bcoffset=3&ntoffset=3&p4ppushleft=1%2C48&s=44 （s=44）（44表示第二页起始商品的编号）

书包（第三页）：https://s.taobao.com/search?q=%E4%B9%A6%E5%8C%85&imgfile=&commend=all&ssid=s5-e&search_type=item&sourceId=tb.index&spm=a21bo.2017.201856-taobao-item.1&ie=utf8&initiative_id=tbindexz_20170306&bcoffset=0&ntoffset=6&p4ppushleft=1%2C48&s=88 （s=88）

程序的结构设计

步骤1：提交商品搜索请求，循环获取页面

步骤2：对于每个页面，提取商品名称和价格信息

步骤3：将信息输出到屏幕上

实例编写：

plt=re.findall(r'\"view_price\"\:\"[\d\.]*\"',html)

tlt=re.findall(r'\"raw_title\"\:\".*?\"',html)

import re

import requests

def getHTMLText(url):

try:

r = requests.get(url,timeout=30)

r.raise_for_status()

r.encoding = r.apparent_encoding

return r.text

except:

return ""

def parsePage(ilt,html):

try:

plt=re.findall(r'\\"view_price\\"\\:\\"[\\d\\.]*\\"',html)

tlt=re.findall(r'\\"raw_title\\"\\:\\".*?\\"',html)

for i in range(len(plt)):

price=eval(plt[i].split(':')[1])

title=eval(tlt[i].split(':')[1])

ilt.append([price,title])

except:

print("")

def printGoodList(ilt):

tplt="{:4}\t{:8}\t{:16}"

print(tplt.format("序号","价格","商品名称"))

count=0

for g in ilt:

count=count+1

print(tplt.format(count,g[0],g[1]))

def main():

goods="书包"

depth=2

start_url='https://s.taobao.com/search?q='+goods

infoList=[]

for i in range(depth):

try:

url=start_url + '&s=' + str(44*i)

html=getHTMLText(url)

parsePage(infoList,html)

except:

continue

printGoodList(infoList)

main()

2、“股票数据定向爬虫”实例介绍

目标：获取上交所和深交所所有股票的名称和交易信息

输出：保存到文件中

技术路线：requests-bs4-re

候选数据网站的选择

新浪股票：https://finance.sina.com.cn/stock/

百度股票：https://gupiao.baidu.com/stock/

选取原则：股票信息静态存在与HTML页面中，非js代码生成，没有Robots协议限制

选取方法：浏览器F12，源代码查看等

选取心态：不要纠结于某个网站，多找信息源尝试

程序的结构设计

步骤1：从东方财富网获取股票列表

步骤2：根据股票列表逐个到百度股票获取个股信息

步骤3：将结果存储到文件

import requests

from bs4 import BeautifulSoup

import traceback

import re

#获得URL对应的页面

def getHTMLText(url):

try:

r = requests.get(url, timeout=30)

r.raise_for_status()

r.encoding = r.apparent_encoding

return r.text

except:

return ""

#获得股票的信息列表

#lst：列表保存的列表类型，存储了所有股票的信息

#stockURL：获得股票列表的URL网站

def getStockList(lst,stockURL):

html=getHTMLText(stockURL)

soup=BeautifulSoup(html,'html.parser')

a=soup.find_all('a')

for i in a:

try:

href=i.attrs['href']

lst.append(re.findall(r"[s][hz]\d{6}",href)[0])

except:

continue

#获得每个个股的股票信息，并且把它存到一个数据结构

#参数：1、保存所有股票的信息列表 2、获得股票信息的URL网站 3、将来要保存文件的文件路径

def getStockInfo(lst,stockURL,fpath):

for stock in lst:

url=stockURL+stock+".html"

html=getHTMLText(url)

try:

if html == "":

continue

infoDict={}

soup=BeautifulSoup(html,'html.parser')

stockInfo=soup.find('div',attrs={'class':'stock-bets'})

name=stockInfo.find_all(attrs={'class':'bets-name'})[0]

infoDict.update({'股票名称':name.text.split()[0]})

keyList=stockInfo.find_all('dt')

valueList=stockInfo.find_all('dd')

for i in range(len(keyList)):

key=keyList[i].text

val=valueList[i].text

infoDict[key]=val

with open(fpath,'a',encoding='utf-8') as f:

f.write(str(infoDict)+'\n')

except:

traceback.print_exc()

continue

def main():

stock_list_url='http://quote.eastmoney.com/stocklist.html'

stock_info_url='https://gupiao.baidu.com/stock/'

output_file='D://BaiduStockInfo.txt'

slist=[]

getStockList(slist,stock_list_url)

getStockInfo(slist,stock_info_url,output_file)

main()

张小北哈哈

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
python爬虫基础知识—06爬虫实例（淘宝商品+股票数据）

1、“淘宝商品比价定向爬虫”实例介绍目标：获取淘宝搜索页面的信息，提取其中的商品名称和价格。理解：淘宝的搜索接口翻页的处理技术路线：requests+re书包：https://s.taobao.com/search?q=%E4%B9%A6%E5%8C%85&imgfile=&commend=all&ssid=s5-e&search_type=it...
复制链接

扫一扫

专栏目录