使用Python爬取马蜂窝网站的游记和照片

最新推荐文章于 2024-09-16 13:54:48 发布

Lei_baobao

最新推荐文章于 2024-09-16 13:54:48 发布

阅读量4.1k

点赞数 3

文章标签： python

本文链接：https://blog.csdn.net/lei_baobao/article/details/105198738

版权

使用Python爬取马蜂窝网站的游记和照片

特殊原因需要在马蜂窝上爬取一些游记和照片作为后续分析处理的数据，参考网上一些类似的爬虫文章，自己尝试了一下，这次爬取的是马蜂窝上所有有关苏州的游记（包括游记内容和照片）

在这里插入图片描述
我们进入一个游记，观察它的html，定位照片和游记内容所在的标签
照片所在标签
游记内容所在标签
知道了照片和游记内容的标签位置后，我们就可以利用Xpath定位到我们想要爬取的内容然后将它抓取下来了

#保存照片
def savePhoto(soup,path):
    img=soup.xpath('//img/@data-src') #这里改为照片节点的路径表达式
    total_img=0
    for myimg in img:
	    total_img += 1
	    print(myimg)
	    urllib.request.urlretrieve(myimg,path+'%s.jpg'%total_img)
    print("总共保存"+str(total_img)+"张照片")

#保存文本
def saveText(soup,path):
    path=path+'游记内容.txt'
    file = open(path, 'w', encoding='utf-8') #将游记内容另存为txt文档
    time=soup.xpath('//ul/li[1]/text()') #这里抓取时间
    for t in time:
        t=re.sub('\\n','',t) 
        t=re.sub('\\.','',t)
        t=t.strip()
        file.write(t+'\n')

    text=soup.xpath('//div/p/text()')
    for mytext in text:
        mytext=re.sub('\\r\n','',mytext) 
        mytext=re.sub('\\.','',mytext)
        mytext=mytext.strip()
        file.write(mytext+'\n')
    file.close

因为要爬取所有有关苏州的游记，所以我们要遍历所有的游记，但是观察游记的网址，例如：http://www.mafengwo.cn/i/18949141.html，这里的数字不同游记也就不同，但是观察发现这些数字是随机的没有规律可循，我们不能通过改变这些来达到遍历每个游记的目的。但是我们可以返回上一级页面观察它的地址，http://www.mafengwo.cn/yj/10207/2-0-1.html，仔细观察不难发现😎，这里的1是第一页，我们可以通过改变它来达到遍历每一页，然后我们可以在每一页上抓取该页上的游记链接。
在这里插入图片描述

#页数范围
min = 1
max = 2

#遍历所有游记
for i in range(min,max):
    url="http://www.mafengwo.cn/yj/10207/2-0-"+str(i)+".html"
    request=urllib.request.Request(url,data=None,headers=headerS)
    response=urllib.request.urlopen(request)
    response=response.read()
    buff = BytesIO(response)
    response = gzip.GzipFile(fileobj=buff)
    soup = etree.HTML(response.read().decode('utf-8'))
    hrefs=soup.xpath('//li/h2/a[@class="title-link"]/@href')
    for href in hrefs:
        #print(href) #/i/18970390.html
        url="http://www.mafengwo.cn"+str(href)
        request=urllib.request.Request(url,data=None,headers=headerS)
        response=urllib.request.urlopen(request)
        response=response.read()
        buff = BytesIO(response)
        response = gzip.GzipFile(fileobj=buff)
        soup = etree.HTML(response.read().decode('utf-8'))
        print("连接成功!")
        start(soup,path)