python+requests+beautifulsoup爬取大众点评评论信息

最新推荐文章于 2025-04-04 18:30:14 发布

住街对面的查理

最新推荐文章于 2025-04-04 18:30:14 发布

阅读量1w

点赞数 3

分类专栏： python爬虫文章标签：爬虫 python requests soup

本文链接：https://blog.csdn.net/macrun28/article/details/79223551

版权

python爬虫专栏收录该内容

2 篇文章

订阅专栏

特别声明，此文写于2018年2月，大众点评的页面逻辑，已做了改动，请找最近爬的文章看下，谢谢支持。

先简单聊两句，距离上一篇博客大概过去了4个月，在忙一些别的事情，除了公司有新项目上线，学习新技术之外，博主恋爱了，嗯，奔着结婚的那种，荣升程序员鄙视链顶端，emmmmm，我想说，来呀，打我呀！

好了好了，这是一篇技术型博文，最近公司需求，爬取大众点评中几个连锁便利店的评论信息，因为只是一次需求，不用做成接口类型的，所以，记得之前学过python 的 requests + beautifulsoup 去爬取并处理爬取的页面的信息

连锁便利店：武汉的 7tt，today今天等

首先看一下
这里写图片描述
https://www.dianping.com/search/keyword/16/0_7tt
https://www.dianping.com/search/keyword/16/0_today今天
这是两个连锁便利店的列表路径，都是固定格式后拼接便利店名字
首先获取每个店的id，拼成这家店的详情链接，例如http://www.dianping.com/shop/22711693
点击最下面的更多点评，即可得到全部的评论的页面
这里写图片描述

所以最终的评论页面链接是http://www.dianping.com/shop/22711693/review_all
接着，点击下方的页码，会改变链接，即在后面拼/p2代表页数
http://www.dianping.com/shop/22711693/review_all/p2
所以可以通过获取最下方页码来遍历全部评论
那，怎么获取页码呢？
window下f12，mac下alt+comand+j
这里写图片描述
可以看到class=PageLink的一共有9个，所以循环时+1就行，代码如下：

url = "https://www.dianping.com/shop/%s/review_all" % i
r = requests.get(url, headers=headers, cookies=cookies,proxies = proxies)
# print r.text
soup = BeautifulSoup(r.text, 'lxml')
lenth = soup.find_all(class_='PageLink').__len__() + 1

这里得到的lenth就是这一页的页码

然后如何在这一页获取每个评论的用户名，星级，评论内容
如图是放在多个li里面的，所以先获取li，再通过li获取下面的内容
这里写图片描述

coment = []
coment = soupIn.select('.reviews-items li')

接着遍历li

 for one in coment:
            try:
                if one['class'][0]=='item':
                    continue
            except(KeyError),e:
                pass
            name = one.select_one('.main-review .dper-info .name')
            #print name.get_text().strip()
            name = name.get_text().strip()
            star = one.select_one('.main-review .review-rank span')
            #print star['class'][1][7:8]
            star = star['class'][1][7:8]
            pl = one.select_one('.main-review .review-words')
            pl['class'] = {'review-words'}
            words = pl.get_text().strip()
            returnList.append([title,name,star,words])

因为获取到的是class="reviews-items"下面所有的li，这里断点调试发现，除去class="item"的就行，所以进行了判断，

用户名name很好获取，这里的星级star是通过span中的class来表示的，class=“sml-str40” 表示4星，所以需要获取class属性并截取，

而最重要的评论，是有点击展开评论按钮，改变class="Hide"的，所以这里需要先去除掉评论div的Hide属性，直接定义覆盖： pl[‘class’] = {‘review-words’}

基本完成了，存到list[]中，然后写文件，或者数据库即可

访问需要带有请求头headers ，cookies才可以访问，cookies代表用户访问身份识别，其中的一些参数是要解析的，并且有时间戳，超时会失效等，headers中的referer表示你是从那个页码跳转过来的，如果不加referer会在访问几次后现在你继续访问，有爬虫嫌疑。

另外如果同一ip访问次数过多也会封ip的，这里就要用代理了proxies，python很简单，直接在请求中带上proxies参数就行，r = requests.get(url, headers=headers, cookies=cookies,proxies = proxies)，另外代理ip的话，给大家推荐个网站http://www.data5u.com/，最下方会有20个免费的，一般小爬虫够用了，使用代理就会出现代理连接是否通之类的问题，需要在程序中添加下面的代码，设置连接时间

requests.adapters.DEFAULT_RETRIES = 5
s = requests.session()
s.keep_alive = False

最后的样子就是这样的
这里写图片描述

大致就是这样，下面附上代码，

欢迎关注我的微博@住街对面的查理，我的生活很有趣，你要不要来看一看。

#coding=utf-8
from bs4 import BeautifulSoup
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
import json
import requests

list = [22711693,24759450,69761921,69761921,22743334,66125712,22743270,57496584,75153221,57641884,66061653,70669333,57279088,24740739,66126129,
        75100027,92667587,92452007,72345827,90004047,90485109,90546031,83527455,91070982,83527745,94273474,80246564,83497073,69027373,96191554,
        96683472,90500524,92454863,92272204,70443082,96076068,91656438,75633029,96571687,97659144,69253863,98279207,90435377,70669359,96403354,
        83618952,81265224,77365611,74592526,90479676,56540304,37924067,27496773,56540319,32571869,43611843,58612870,22743340,67293664,67292945,
        57641749,75157068,58934198,75156610,59081304,75156647,75156702,67293838,]
returnList = []
proxies = {
    # "https": "http://14.215.177.73:80",
    "http": "http://202.108.2.42:80",
}
headers = {
    'Host': 'www.dianping.com',
    'Referer': 'http://www.dianping.com/shop/22711693',
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/535.19',
    'Accept-Encoding': 'gzip'
}
cookies = {
    '_lxsdk_cuid': '16146a366a7c8-08cd0a57dad51b-32637402-fa000-16146a366a7c8',
    'lxsdk': '16146a366a7c8-08cd0a57dad51b-32637402-fa000-16146a366a7c8',
    '_hc.v': 'ec20d90c-0104-0677-bf24-391bdf00e2d4.1517308569',
    's_ViewType': '10',
    'cy': '16',
    'cye': 'wuhan',
    '_lx_utm': 'utm_source%3DBaidu%26utm_medium%3Dorganic',
    '_lxsdk_s': '1614abc132e-f84-b9c-2bc%7C%7C34'

}
requests.adapters.DEFAULT_RETRIES = 5
s = requests.session()
s.keep_alive = False
for i in list:
    url = "https://www.dianping.com/shop/%s/review_all" % i
    r = requests.get(url, headers=headers, cookies=cookies,proxies = proxies)
    # print r.text
    soup = BeautifulSoup(r.text, 'lxml')

    lenth = soup.find_all(class_='PageLink').__len__() + 1
    #print lenth
    for j in xrange(lenth):
        urlIn = "http://www.dianping.com/shop/%s/review_all/p%s" % (i, j)
        re = requests.get(urlIn, headers=headers, cookies=cookies,proxies =proxies)
        soupIn = BeautifulSoup(re.text, 'lxml')
        title = soupIn.title.string[0:15]
        #print title
        coment = []
        coment = soupIn.select('.reviews-items li')

        for one in coment:
            try:
                if one['class'][0]=='item':
                    continue
            except(KeyError),e:
                pass
            name = one.select_one('.main-review .dper-info .name')
            #print name.get_text().strip()
            name = name.get_text().strip()
            star = one.select_one('.main-review .review-rank span')
            #print star['class'][1][7:8]
            star = star['class'][1][7:8]
            pl = one.select_one('.main-review .review-words')
            pl['class'] = {'review-words'}
            words = pl.get_text().strip()
            returnList.append([title,name,star,words])

file = open("/Users/huojian/Desktop/store_shop.sql","w")
for one in returnList:
    file.write("\n")
    file.write(unicode(one[0]))
    file.write("\n")
    file.write(unicode(one[1]))
    file.write("\n")
    file.write(unicode(one[2]))
    file.write("\n")
    file.write(unicode(one[3]))
    file.write("\n")