爬取某东p50详情页评论
爬取某东p50详细页评论并翻页+文件保存数据
1、获取网页地址
依旧老操作获取源码
import requests
import re
url ='https://search.jd.com/Search?keyword=华为p50'
headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)AppleWebKit/537.36 (HTML, like Gecko) "
"Chrome/96.0.4664.93 Safari/537.36 "
}
resp = requests.get(url, headers=headers)
print(resp.text)
获取了源码后,我们要抽丝剥茧,把我们想要的数据提取出来,
整个模块都属于<class="gl-item">
所以我们可以使用beautifulSoup提取我们想要的数据
1、下载beautifulSoup的库
2、导入from bs4 import BeautifulSoup
3、
4、 运行:
有我们需要的数据
需要进一步提取。将价格提取出来、
f12查看价格所在的标签提取出来<class=“p-price”>
price = li.find(class_="p-price").get_text()
运行结果如下:
同理,编号、名称、地址、店家都能用相同的方式得出。
代码如下:
for li in goods_list:
price = li.find(class_="p-price").find('i').get_text()
name = li.find(class_="p-name p-name-type-2").find('em').get_text() # 商品名称
detail_addr = li.find(class_='p-name p-name-type-2').find('a')['href'] # 商品地址
shop = li.find(class_='p-shop').find('a').get_text()
tuple=(price,name,detail_addr,shop)
print(tuple)
运行结果:
翻页上篇我们知道了怎么分页。
def get_html(url ,currentPage, pageSize):
headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.93 Safari/537.36"
}
print("获取网站信息"+str(currentPage)+"页消息")
if currentPage != 1:
url = url+'&page='+ str(currentPage) +'&s=' + str(pageSize) + '&click=0'
resp =requests.get(url,headers=headers)
if resp.status_code == 200:
html =resp.text
return html
else:
print("获取失败!")
_current_row = 0;#记录行数
keyword ='华为p50'
url = 'https://search.jd.com/Search?keyword='+keyword+'&enc=utf-8'
total = input('输入需要爬取的页数:')
page ={
'total':0,#总页数
'currentPage':1,#当前页数
'pageSize':0 #每页的条数
}
main主程序
page['total'] = eval(total)
for i in range(page['total']):
html = get_html(url,page['currentPage'],page['currentPage']*page['pageSize'])
源代码:
import requests
from bs4 import BeautifulSoup
def get_html(url ,currentPage, pageSize):
headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.93 Safari/537.36"
}
print( "获取网站信息" + str(currentPage) + "页消息" )
if currentPage != 1:
url = url + '&page=' + str(currentPage) + '&s=' + str(pageSize) + '&click=0'
resp = requests.get(url, headers=headers)
if resp.status_code == 200:
html = resp.text
return html
else:
print("获取失败!")
if __name__ == '__main__':
_current_row = 0; # 记录行数
keyword = '华为p50'
url = 'https://search.jd.com/Search?keyword=' + keyword + '&enc=utf-8'
total = input( '输入需要爬取的页数:' )
page = {
'total': 0, # 总页数
'currentPage': 1, # 当前页数
'pageSize': 0 # 每页的条数
}
page['total'] = eval( total )
for i in range( page['total'] ):
html = get_html( url, page['currentPage'], page['currentPage'] * page['pageSize'] )
soup = BeautifulSoup(html, 'lxml')
goods_list = soup.find_all('li', class_='gl-item')
print( "分析到第" + str( page['currentPage'] ) + '页共有' + str( len( goods_list ) ) + '条商品信息' )
for li in goods_list:
_current_row += 1 # 记录行数
no = li['data-sku'] # 商品编号
price = li.find(class_="p-price").find('i').get_text()
name = li.find(class_="p-name p-name-type-2").find('em').get_text() # 商品名称
detail_addr = li.find(class_='p-name p-name-type-2').find('a')['href'] # 商品地址
shop = li.find(class_='p-shop').find('a').get_text()
tuple = (price,name,detail_addr,shop)
print(tuple)
page['currentPage'] = page['currentPage'] + 1
page['pageSize'] = len(goods_list) * page['currentPage']