Python爬取蚂蜂窝教程

Python 同时被 2 个专栏收录
1 篇文章 0 订阅
2 篇文章 0 订阅

  最近因为项目需要,就去了解学习了Python爬虫的一些知识,并在此分享出学习过程中的难题和经验。
  先看最终程序输出

  {
 "website": "<a href="http://www.somboonseafood.com/" target="_blank" rel="nofollow">http://www.somboonseafood.com/</a>", 
 "comment": [
  "进去里面已经人满为患,服务生来往都是急匆匆的。我们前面还有一桌外国人在等位子。好在等待的时间不长,很快我们被带到了二楼。

菜单上有中英文的翻译。我们除了必点的咖喱蟹,还点了腰果鸡肉,酸辣鱿鱼,芒果糯米饭和冬阴功汤。建兴比较好的是菜品都有小份的,适合2人吃的。

这顿饭具体花了多少泰铢不记得了,反正折合人民币二百多吧。他家不能拉卡,只能付现金哦~", 
  "http://b3-q.mafengwo.net/s8/M00/4B/D5/wKgBpVXxM4aAdrXbACreEebl8Ug36.jpeg?imageMogr2%2Fthumbnail%2F%21200x150r%2Fgravity%2FCenter%2Fcrop%2F%21200x150%2Fquality%2F90", 
  "http://a1-q.mafengwo.net/s8/M00/4B/E8/wKgBpVXxM5GAD-uAAAuCjK25BIo42.jpeg?imageMogr2%2Fthumbnail%2F%21200x150r%2Fgravity%2FCenter%2Fcrop%2F%21200x150%2Fquality%2F90", 
  "http://n3-q.mafengwo.net/s8/M00/4B/EC/wKgBpVXxM5KAMPIxAAz_DjXUweA78.jpeg?imageMogr2%2Fthumbnail%2F%21200x150r%2Fgravity%2FCenter%2Fcrop%2F%21200x150%2Fquality%2F90", 
  "我们四个人点了红油咖喱蟹,粉丝闷虾,炒含羞草,还有芒果汁,柠檬汁。咖喱蟹很好吃,炒的很香很入味,如果将那红油用来拌饭,味道肯定很赞;粉丝闷虾也不错,四个人吃刚刚好;含羞草就有点老了,除此之外还有个酱油蒸石斑鱼,按斤卖的,一条快一千多了,不过肉质很劲道,吃多来还能塞牙缝呢,真的很新鲜", 
  "http://a1-q.mafengwo.net/s8/M00/FD/32/wKgBpVXsL3eAb2oLAAs12tssU2Y97.jpeg?imageMogr2%2Fthumbnail%2F%21200x150r%2Fgravity%2FCenter%2Fcrop%2F%21200x150%2Fquality%2F90", 
  "http://c3-q.mafengwo.net/s8/M00/FD/3C/wKgBpVXsL4OAOf0zAAjok0qVt-406.jpeg?imageMogr2%2Fthumbnail%2F%21200x150r%2Fgravity%2FCenter%2Fcrop%2F%21200x150%2Fquality%2F90", 
  "http://c1-q.mafengwo.net/s8/M00/FD/46/wKgBpVXsL5CAT3kGAAn4-5VHSAg78.jpeg?imageMogr2%2Fthumbnail%2F%21200x150r%2Fgravity%2FCenter%2Fcrop%2F%21200x150%2Fquality%2F90", 
  "咖喱螃蟹不错,就是螃蟹少了鸡蛋多了哈哈哈,感觉最好吃的是我们随便点的虾子,炸得超级脆然后上面裹的粉好好吃。三个菜加一瓶矿泉水1000多株,感觉有点小贵,因为感觉没有传说中的那么那么好吃哈哈哈", 
  "http://a2-q.mafengwo.net/s8/M00/78/4D/wKgBpVXYk4yAV9M9ABim-ixW7lg98.jpeg?imageMogr2%2Fthumbnail%2F%21200x150r%2Fgravity%2FCenter%2Fcrop%2F%21200x150%2Fquality%2F90", 
  "http://a2-q.mafengwo.net/s8/M00/78/52/wKgBpVXYk5CAbjcKABwM1hcyCZU62.jpeg?imageMogr2%2Fthumbnail%2F%21200x150r%2Fgravity%2FCenter%2Fcrop%2F%21200x150%2Fquality%2F90", 
  "http://b2-q.mafengwo.net/s8/M00/78/57/wKgBpVXYk5SAcFR0ABsEFh2YADQ90.jpeg?imageMogr2%2Fthumbnail%2F%21200x150r%2Fgravity%2FCenter%2Fcrop%2F%21200x150%2Fquality%2F90", 
  "他们这边的咖喱跟我们平时吃的不一样,偏甜一点!", 
  "这一顿才化了1000B多点,这里是不能刷卡的,所以记得带好现金再去!", 
  "http://b1-q.mafengwo.net/s8/M00/FE/BC/wKgBpVXdJmiAE16jAAW6XZkem8k36.jpeg?imageMogr2%2Fthumbnail%2F%21200x150r%2Fgravity%2FCenter%2Fcrop%2F%21200x150%2Fquality%2F90", 
  "http://n3-q.mafengwo.net/s8/M00/FE/97/wKgBpVXdJlWAVXipAAGLzCC7YP400.jpeg?imageMogr2%2Fthumbnail%2F%21200x150r%2Fgravity%2FCenter%2Fcrop%2F%21200x150%2Fquality%2F90", 
  "http://n1-q.mafengwo.net/s8/M00/FE/DD/wKgBpVXdJoCAaODiAAbf-o77ojA68.jpeg?imageMogr2%2Fthumbnail%2F%21200x150r%2Fgravity%2FCenter%2Fcrop%2F%21200x150%2Fquality%2F90", 
  "这顿饭是在曼谷吃的最贵的一餐,总共705铢。这家餐馆的味道也没有想象中多惊艳啦,发现其实泰国随便一家路边的拍档做的泰国菜味道都可以的。", 
  "http://n2-q.mafengwo.net/s8/M00/14/52/wKgBpVXVzzKAS4rrAA2AHp_Mk3w39.jpeg?imageMogr2%2Fthumbnail%2F%21200x150r%2Fgravity%2FCenter%2Fcrop%2F%21200x150%2Fquality%2F90", 
  "http://a2-q.mafengwo.net/s8/M00/14/55/wKgBpVXVzzaAUAa8AAr8LPWGCSA46.jpeg?imageMogr2%2Fthumbnail%2F%21200x150r%2Fgravity%2FCenter%2Fcrop%2F%21200x150%2Fquality%2F90", 
  "http://c3-q.mafengwo.net/s8/M00/14/5A/wKgBpVXVzzmAd_D9AAuDW5yUmOI43.jpeg?imageMogr2%2Fthumbnail%2F%21200x150r%2Fgravity%2FCenter%2Fcrop%2F%21200x150%2Fquality%2F90", 
  "在各大攻略了声名显赫的他果然火爆,下午3点了还是排长龙!建议大家一定事先打电话预约哦!招牌菜咖喱蟹还行吧,总体上比其他泰餐还是强,但价格也确实不便宜。", 
  "不过个人觉得是又贵又没有特色,连姐妹说的好吃到炸的咖哩蟹我个人觉得也没有米特拉的好吃,还不如一株粥(接下来会提到)。反正不建议去,当然也可以尝试一下被宰的感觉。", 
  "http://n3-q.mafengwo.net/s8/M00/D0/5D/wKgBpVXKGSmAFS9sAAPmXxP4odQ66.jpeg?imageMogr2%2Fthumbnail%2F%21200x150r%2Fgravity%2FCenter%2Fcrop%2F%21200x150%2Fquality%2F90", 
  "建兴酒家的菜也还行,在泰国物价里感觉应该不算便宜,尤其都是海鲜对比在普吉岛吃过的东西,一个天上一个地下。

   泰国特色咖喱蟹,口味跟日本咖喱不同,椰浆味道比较重,两人吃少要点就行,配泰国香米。", 
  "http://a2-q.mafengwo.net/s8/M00/7E/32/wKgBpVXJz5qAZmKMAAJIskX2Kdw52.jpeg?imageMogr2%2Fthumbnail%2F%21200x150r%2Fgravity%2FCenter%2Fcrop%2F%21200x150%2Fquality%2F90", 
  "出发之前就知道建兴酒家很出名,可是一直以为离我们很远,不方便去吃。偶然发现原来SIAM站也有,但是位置真的很不好找,谷人希都找不到,已经被Siam Paragon,SiamCenter,Siam Discovery搞混乱了,当时已经饿的不行了,皇天不负有心人,终于还是找到它了,晚上六点多一点,门外已经两排凳子,排排坐了。

记住是SIAM SQUARE ONE,大家去之前请做好功课,在SIAM CENTER的对面。而且最后就提前预约一下,我们等了快一个小时就才有位置,也许是享受美食,很久才会走一台。在等待的时候就已经把餐盘翻穿了,一坐下,不用等待,立刻点餐。完全忘记我们只有两个人在作战!

除了海鲜拼盘不好吃,其他都一级棒!海鲜拼盘的那个蘸料太奇怪了,又酸又辣还是绿色的。", 
  "http://n3-q.mafengwo.net/s8/M00/A8/40/wKgBpVXJ_ZmAeOxBAAjwGDzYEXg22.jpeg?imageMogr2%2Fthumbnail%2F%21200x150r%2Fgravity%2FCenter%2Fcrop%2F%21200x150%2Fquality%2F90", 
  "http://b3-q.mafengwo.net/s8/M00/A8/5A/wKgBpVXJ_a-ACPTQAAu-qwezLm881.jpeg?imageMogr2%2Fthumbnail%2F%21200x150r%2Fgravity%2FCenter%2Fcrop%2F%21200x150%2Fquality%2F90", 
  "http://b1-q.mafengwo.net/s8/M00/A8/6A/wKgBpVXJ_byAUD0UAArQl-EOgLw63.jpeg?imageMogr2%2Fthumbnail%2F%21200x150r%2Fgravity%2FCenter%2Fcrop%2F%21200x150%2Fquality%2F90", 
  "主打菜是咖喱蟹,确实很不错。味道偏甜,多吃会腻。", 
  "建兴酒家泰国菜比较正宗,海鲜很新鲜。点着那泰式咖喱蟹,大虾冬阴功,不知名的某鱼还有什么泰式的蔬菜等,一边吃的欢,一边感慨:跟着攻略走,果然美味不会错!等到结账买单时,服务员上来账单一看,5800多铢,傻眼了!", 
  "建兴酒家(CENTRAL AMBASSY店)咖喱蟹真是太棒了,咖喱蟹加了蛋黄,非常好吃。", 
  "http://n3-q.mafengwo.net/s8/M00/ED/38/wKgBpVXFfluASZmjAAIP8zhLkns30.jpeg?imageMogr2%2Fthumbnail%2F%21200x150r%2Fgravity%2FCenter%2Fcrop%2F%21200x150%2Fquality%2F90", 
  "http://n3-q.mafengwo.net/s8/M00/ED/6B/wKgBpVXFfoaARyp6AAG0hzU8HJY32.jpeg?imageMogr2%2Fthumbnail%2F%21200x150r%2Fgravity%2FCenter%2Fcrop%2F%21200x150%2Fquality%2F90", 
  "http://n2-q.mafengwo.net/s8/M00/E6/A1/wKgBpVXB97WASTNwAAsINW94q5M85.jpeg?imageMogr2%2Fthumbnail%2F%21200x150r%2Fgravity%2FCenter%2Fcrop%2F%21200x150%2Fquality%2F90", 
  "建兴酒家的咖喱蟹确实好吃,也不贵。", 
  "传说中的咖喱蟹出名的酒家。游记提到说不要轻易打的告诉司机去建兴酒家,因为可能会带你去山寨店,然后狠狠的砍。所以,请提前上建兴酒家的官网查询好具体地址,然后查找好附近的BTS,自行解决吧!

价格略贵,不过味道很好!咖喱蟹诚心推荐。"
 ], 
 "opentime": "openTime", 
 "description": "由华人创立的建兴酒家是曼谷一家老字号的海鲜餐厅,烹饪融合粤菜和泰国菜技法,国人比较容易接受。咖喱蟹是这里的招牌菜,炒含羞草、粉丝虾煲、蒜蓉虾也是这里的推荐菜。建兴酒家在曼谷有七家店,其中Samyan、Central Embassy、Siam Square One店是中午营业的,其他店的营业时间都是16:00-23:30。", 
 "travel": [
  "/i/1008978.html", 
  "/i/832751.html", 
  "/i/1096198.html", 
  "/i/811283.html", 
  "/i/881575.html", 
  "/i/885891.html", 
  "/i/850595.html", 
  "/i/962279.html", 
  "/i/1058436.html", 
  "/i/1250020.html", 
  "/i/1290693.html", 
  "/i/1285795.html", 
  "/i/1161749.html", 
  "/i/1078420.html", 
  "/i/1136733.html", 
  "/i/1008978.html", 
  "/i/832751.html", 
  "/i/1096198.html"
 ], 
 "telephone": "(66-02)2333104", 
 "rate": "4.1", 
 "location": "169, 169/7-12 Surawong Rd., Suriyawong, Bangrak, Bangkok 10500", 
 "ticket": "ticket", 
 "enname": "Somboon Seafood", 
 "name": "建兴酒家(Surawong店) "
}

PythonIDE选择及安装

在PythonIDE选择方面,我选择的是Pycharm,很方便快捷,下载地址:
[http://www.jetbrains.com/pycharm/download/](http://www.jetbrains.com/pycharm/download/)
PyCharm 的激活方式:
    1,推荐购买正版。
    2,可以选择试用,免费试用30天。
    3,网上找激活码:
(下面的激活码来自互联网,仅供学习交流之用)
    user name: EMBRACE
    key:
        14203-12042010
        0000107Iq75C621P7X1SFnpJDivKnX
        6zcwYOYaGK3euO3ehd1MiTT"2!Jny8
        bff9VcTSJk7sRDLqKRVz1XGKbMqw3G




正则表达式  

在学习爬虫之前还要有正则表达式的基础,这里贴出正则表达式的基本符号含义,
  正则表达式
  用的比较多的就是\d(数字)、\w(单词)、\W(非单词)、.、*、?、+



需要的库如下

  • re
  • urllib2
  • BeautifulSoup
  • json

    urllib2库用来抓取页面的html代码,在此之上可用re进行正则匹配,或BeautifulSoup进行匹配,最后匹配数据保存为json格式。



爬虫代码分析

我们首先需要爬取得页面为
列表页面
我们可以看到url为www.mafengwo.cn/group/s.php?q=曼谷&p=1&t=cate&kt=1。主要参数有q ,p ,t,其中q为城市名,p为页码,t为分类,cate为美食,kt为不影响参数。

需要获取该页面,detail为域名以后的参数,这个函数可以用于获得域名主页下的网页

      #获取下级页面
      def getDetailPage(detailURL):
        try:
            url = "http://www.mafengwo.cn"+detailURL"
            request = urllib2.Request(url)
            response = urllib2.urlopen(request)
            #利用urllib2的Request方法返回一个request对象并用urlopen打开
            page = response.read()
            #用read()方法读取页面内容,Input: print page Output: 页面html
            pageCode = re.sub(r'<br[ ]?/?>', '\n', page)
            #去掉html里的回车空行
            return pageCode
        except urllib2.URLError, e:
            if hasattr(e, "reason"):
                print e.reason
                return None

获得每家美食店铺的链接,首先进行元素检查查看链接位于的位置获得页面链接

    #获得美食单页店铺链接
    def getFoodHref(self,pageid):
        url = "/group/s.php?q="+self.city+"&p=" +str(pageid)+ "&t=cate&kt=1"
        page = getDetailPage(url)
        #调用getDetailPage获得页面
        soup = BeautifulSoup(page,'html.parser')
        #用BeautifulSoup进行页面解析
        FoodHref = []
        FoodLists =  soup.find(name="div",attrs={'data-category':'poi'}).ul
        FoodHrefList = FoodLists.find_all("h3")
        #找出<div class="_j_search_section" data-category="poi">标签下所有的<h3>标签的内容,结果为店铺列表的html
        for FoodHrefs in FoodHrefList:
            FoodWebsite = FoodHrefs.a['href']
            #对列表循环找出a标签href属性的值,即为店铺的url
            FoodHrefShort = str(FoodWebsite).replace('http://www.mafengwo.cn','')
            #去掉url前的域名,以便等会调用getDetaiL函数,传入它获得店铺页面
            FoodHref.append(FoodHrefShort)
        return FoodHref

接下来再次调用getDetailPage(),传入FoodHref,即可可以获得店铺的页面,通过BeautifulSoup进行信息获取了。但我在抓取的时候遇到一个问题。
页面信息


  这是一个信息齐全的店铺,但有的店铺没有网址,没有交通信息该怎么办。比如这个
信息不齐店铺
经过元素检查发现标签也是一样的,无法通过标签特有的属性或者class的值进行定向抓取。用<div class="bd">的子节点兄弟节点查也不行。后来想出一个方法。

先写一个匹配函数hasAttr,list参数为一个中文的完整信息名列表,在getShopInfo方法里通过循环列表内容与抓取的<div class="bd">标签内容匹配,如果返回True则表示存在该信息项,否则继续匹配下一项。比如上面的图,先匹配简介,匹配失败,继续匹配英文名字,也失败,知道匹配到地址,成功,保存地址下一个标签的内容。直到获得所有信息。

        #判断是否存在信息列表
    def hasAttr(self,list):
        soup = BeautifulSoup(page, 'html.parser')
        col = soup.find("div", class_="col-main").find("div", class_="bd")
        str_col = str(col)
        if list in str_col:
            return True
        else:
            return False

    #抓取店铺信息
    def getShopInfo(self,page):
            shopInfoList = ['brief','localName','location', 'telephone', 'website', 'ticket', 'openTime','shopName','shopScore']
            infoItem = ['简介', '英文名称', '地址', '电话', '网址', '门票', '开放时间','名字','星评']
            soup = BeautifulSoup(page, 'html.parser')
            shopName = soup.find("div", class_="wrapper").h1.string
            shopScore = soup.find("div", class_="col-main").span.em.string

            for i in range(0,6):
            #信息项循环查找
                if self.hasAttr(page, infoItem[i]):
                        pattern_shopinfo = re.compile(
                            '<div class="col-main.*?<div class="bd">.*?'+ infoItem[i] +'</h3>.*?>(.*?)</p>', re.S)
                        shopInfos = re.findall(pattern_shopinfo, page)
                        #存在该项则用正则取出其标签内容
                        for shopInfo in shopInfos:
                            shopInfoList[i] = shopInfo
                else:
                        #继续查找下一项
                        continue

                shopInfoList[7] = shopName
                shopInfoList[8] = shopScore
            return shopInfoList

最后将数据加入字典,如果一键对多值,比如dict = {a:[]},调用set default(键名,[]).append(列表值)
dict.setdefault('comment',[]).appnd(comment)

然后json.dumps(dict,indent=1).decode("unicode_escape")。indent参数是为了以json树形式表现出数据,如果内容中有中文要用decode("unicode_escape"),否则结果为”\u”的unicode编码



贴出完整代码,可通过修改最后实例MFW()内参数来改变城市名,通过修改函数saveFood()或者saveIntertainment()来分别获取该城市的美食与娱乐信息。

    #coding:utf-8
    import re
    import urllib2
    from bs4 import  BeautifulSoup
    import json
    import sys
    reload(sys)
    sys.setdefaultencoding('utf-8')


    class MFW:


    def __init__(self,city):
        self.siteURL = 'http://www.mafengwo.cn'
        self.city = city
        self.cityDict = {'曼谷': '11045_518', '清迈': '15284_179', '普吉岛': '11047_858', '苏梅': '14210_686', '芭堤雅': '11046_940'}

        self.id = self.cityDict[self.city]
        self.user_agent = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.85 Safari/537.36"
        self.headers = { 'User-Agent' :self.user_agent}

    #获得美食单页店铺链接
    def getFoodHref(self,pageid):
        url = "/group/s.php?q="+self.city+"&p=" +str(pageid)+ "&t=cate&kt=1"
        page = self.getDetailPage(url)
        soup = BeautifulSoup(page,'html.parser')
        FoodHref = []
        FoodLists =  soup.find(name="div",attrs={'data-category':'poi'}).ul
        FoodHrefList = FoodLists.find_all("h3")
        for FoodHrefs in FoodHrefList:
            FoodWebsite = FoodHrefs.a['href']
            FoodHrefShort = str(FoodWebsite).replace('http://www.mafengwo.cn','')
            FoodHref.append(FoodHrefShort)
        return FoodHref

    #获得旅店链接
    def getHotelHref(self, pageid):
        url = "/group/s.php?q=" + self.city + "&p=" + str(pageid) + "&t=hotel&kt=1"
        page = self.getDetailPage(url)
        soup = BeautifulSoup(page, 'html.parser')
        hotelHref = []
        hotelHrefLists = soup.find_all("div",class_="hot-about clearfix _j_hotel")
        for hotelHrefList in hotelHrefLists:
            hotelWebsite= hotelHrefList.a['href']
            hotelHrefShort = str(hotelWebsite).replace('http://www.mafengwo.cn', '')
            hotelHref.append(hotelHrefShort)
        return hotelHref

    #获得页面HTML
    def getPage(self):
        try:
            url = self.siteURL+"/baike/"+str(self.id)+".html"
            request = urllib2.Request(url, headers=self.headers)
            response = urllib2.urlopen(request)
            page = response.read()
            pageCode = re.sub(r'<br[ ]?/?>', '\n', page)
            return pageCode
        except urllib2.URLError, e:
            if hasattr(e, "reason"):
                print e.reason
                return None

    #获得下级WEB页面HTML
    def getDetailPage(self,detailURL):
        try:
            shopURL = self.siteURL + detailURL
            response = urllib2.urlopen(shopURL)
            detailPage = response.read()
            detailPageCode = re.sub(r'<br[ ]?/?>', '\n', detailPage)
            return detailPageCode
        except urllib2.URLError, e:
            if hasattr(e, "reason"):
                print e.reason
                return None

    #获得项目列表
    def getProject(self):
        page = self.getPage()
        soup = BeautifulSoup(page, 'html.parser')
        projectName = []
        projectId = {}
        projects = soup.find("div", class_="anchor-nav").stripped_strings
        for project in projects:
            projectName.append(project)
        for i in range(len(projectName)):
            projectId[i] = projectName[i]
        return projectId

    #获得店铺链接列表
    def getShopHref(self):
        page = self.getPage()
        soup = BeautifulSoup(page , 'html.parser')
        list = soup.find_all("div", class_="poi-card clearfix")
        shopHref = []
        for items in list:
            shopitem = items.find_all("div", class_="item")
            for item in shopitem:
                 shopHref.append(item.a['href'])
        return shopHref


    #抓取评论内容
    def getComment(self,page):
        soup = BeautifulSoup(page , 'html.parser')
        list = soup.find("div", class_="_j_commentlist")
        commentList = list.find_all("div", class_="comment-item")
        commentContent = []
        for item in commentList:
            commentContent.append(item.find('p').string)
            commentImas = item.find_all(name='img',attrs={'height':re.compile('.*?')})
            for commentIma in commentImas:
                commentContent.append(commentIma.get('src'))
        return  commentContent



    #抓取游记链接
    def getTravel(self,page):
        soup = BeautifulSoup(page, 'html.parser')
        items = soup.find_all("li", class_="post-item clearfix")
        travelHref = []
        for item in items:
            travelHref.append(item.find('a').get('href'))
        return travelHref

    #判断是否存在信息列表
    def hasAttr(self,page,list):
        soup = BeautifulSoup(page, 'html.parser')
        col = soup.find("div", class_="col-main").find("div", class_="bd")
        str_col = str(col)
        if list in str_col:
            return True
        else:
            return False

    #抓取店铺信息
    def getShopInfo(self,page):
            shopInfoList = ['brief','localName','location', 'telephone', 'website', 'ticket', 'openTime','shopName','shopScore']
            infoItem = ['简介', '英文名称', '地址', '电话', '网址', '门票', '开放时间','名字','星评']
            soup = BeautifulSoup(page, 'html.parser')
            shopName = soup.find("div", class_="wrapper").h1.string
            shopScore = soup.find("div", class_="col-main").span.em.string

            for i in range(0,6):
                if self.hasAttr(page, infoItem[i]):
                        pattern_shopinfo = re.compile(
                            '<div class="col-main.*?<div class="bd">.*?'+ infoItem[i] +'</h3>.*?>(.*?)</p>', re.S)
                        shopInfos = re.findall(pattern_shopinfo, page)
                        for shopInfo in shopInfos:
                            shopInfoList[i] = shopInfo
                else:
                        continue

                shopInfoList[7] = shopName
                shopInfoList[8] = shopScore
            return shopInfoList




    #抓取保存餐厅数据
    def saveFood(self):
        f = open(r'****.txt','w')
        a=0
        for i in range(51):

            try:
                foodHrefList = self.getFoodHref(i)
                for foodHref in foodHrefList:
                    page = self.getDetailPage(foodHref)
                    dict = {}.fromkeys(('description','enname','location','telephone','website','ticket','opentime','name','rate','comment','travel'))
                    shopInfos = self.getShopInfo(page)
                    dict['description'] = shopInfos[0]
                    dict['enname'] = shopInfos[1]
                    dict['location'] = shopInfos[2]
                    dict['telephone'] = shopInfos[3]
                    dict['website'] = shopInfos[4]
                    dict['ticket'] = shopInfos[5]
                    dict['opentime'] = shopInfos[6]
                    dict['name'] = shopInfos[7]
                    dict['rate'] = shopInfos[8]
                    comments = self.getComment(page)
                    dict['comment'] = comments
                    travels = self.getTravel(page)
                    dict['travel'] = travel
                             print json.dumps(dict,indent=1).decode("unicode_escape")
                    print ("=================================================================================" + "\n")

            except AttributeError, e:
                continue
        f.close()
        print "抓取完成"+"共"+str(a)+"条"

    #输出娱乐信息
    def saveIntertainment(self):
        f = open(r'****.txt','a')
        f.write('\n城市:' + self.city + '\n\n\n')
        shopProjects = self.getProject()
        for i in shopProjects.keys():
            f.write(str(i) + str(shopProjects[i]) + '\n')
        shopHrefList = self.getShopHref()
        for shopHref in shopHrefList:
            try:
                page = self.getDetailPage(shopHref)
                shopInfos = self.getShopInfo(page)
                for shopInfo in shopInfos:
                    f.write(str(shopInfo) + '\n')
                comments = self.getComment(page)
                for comment in comments:
                    f.write(str(comment) + '\n')
                travels = self.getTravel(page)
                for travel in travels:
                    f.write(str(travel) + '\n')
                f.write("======================================================================================================================" + '\n')
            except AttributeError, e:
                continue


        f.close()
        print "抓取完成"


        mfw = MFW('曼谷')
        mfw.saveFood()
  • 4
    点赞
  • 3
    评论
  • 6
    收藏
  • 一键三连
    一键三连
  • 扫一扫,分享海报

©️2021 CSDN 皮肤主题: 大白 设计师:CSDN官方博客 返回首页
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、C币套餐、付费专栏及课程。

余额充值