360应用市场爬虫

最新推荐文章于 2023-01-07 12:24:30 发布

qq_33161357

最新推荐文章于 2023-01-07 12:24:30 发布

阅读量729

点赞数

分类专栏： Python 文章标签： python-爬虫

本文链接：https://blog.csdn.net/qq_33161357/article/details/53407121

版权

Python 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

依据搜索内容的不同，爬取相应APP的相关内容。例如搜索“微信”：
http://zhushou.360.cn/search/index/?kw=%E5%BE%AE%E4%BF%A1

看到页面请求方式为GET，因此可以用以下方法。
若为POST，获取html的方法为urllib.request

一级页面定义一个类spider()：

init()
getsource(self,url)：用来获取网页源码（url–html）
getinfo(self,html) ：从html中提取信息

urllib2.urlopen(url).read() 获取网页源码：

    def getsource(self, url):
        response = urllib2.urlopen(url)
        html = response.read()
        return html

提取信息采用beautifulsoup完成：

    def getinfo(self, html):
        #用美味的汤将html转成soup
        soup = BeautifulSoup(html)
        #定义根节点1
        jd1 = soup.find_all('dl')   
        #定义根节点2
        jd2 = soup.find_all('div',attrs ={"class":"sdlft"})                  
        #获取根节点1下的子标签
        appName = jd1[0].find_all('span',attrs = {"class":"red"})                    
        description = jd1[0].find_all('p')
        #无法用soup匹配的标签采用正则表达式匹配：
        pattern = re.compile('\d\.\d')
        score = re.findall(pattern,str(jd2))[0]
        link = re.findall(r"(?<=href=\").+?(?=\")|(?<=href=\').+?(?=\')", str(jd1))[0]
        #获取根节点2下的子标签
        downNum =jd2[0].find_all('p',attrs = {"class":"downNum"})
        #提取soup中的标签
        name = appName[0].string
        des = description[0].string
        down = downNum[0].string
        pagelink ="http://zhushou.360.cn"+ link
        return name,des,score+'分',down,pagelink

二级页面定义nextPage类：

init
getsource(self,url)
getinfo(self,html)

实现方法同上，根据需求提取信息。需要注意的是，二级页面的url的获取，此次爬取的一级页面中，在href中已经有二级页面的url地址，因此只需要将一级页面的href拿到即可作为二级页面的输入。

存储爬取到的信息，定义savefun()类：

init
save(self)

    def save(self):
        #定义字典
        dict = {}
        dict['appName'] = (myspider.getinfo(html)[0].encode("utf-8"))
        dict['description'] = urllib.quote(myspider.getinfo(html)[1].encode("utf-8"))
        dict['score'] = myspider.getinfo(html)[2]
        dict['downLoad'] = myspider.getinfo(html)[3]
        dict['href'] = myspider.getinfo(html)[4]
        dict['tag'] = mypage.getinfo(html2)[0]
        dict['language'] = mypage.getinfo(html2)[1]
        print dict

        fp = file("02.txt", 'w')
        #print json.dumps(dict)
        fp.write(json.dumps(dict))
        fp.close()

主程序：

if __name__ == '__main__':
    keyword = "微信"
    base_url = "http://zhushou.360.cn/search/index/?kw=%s" % urllib.quote(keyword.encode("utf-8"))
    myspider = spider()
    html = myspider.getsource(base_url)
    #myspider.getinfo(html)
    base_url2 = myspider.getinfo(html)[4]
    mypage = nextPage()
    html2 = mypage.getsource(base_url2)
    mypage.getinfo(html2)

    mysave = savefun()
    mysave.save()

Tips：