Python爬虫马蜂窝结伴信息

最新推荐文章于 2023-12-01 10:04:04 发布

TangToming

最新推荐文章于 2023-12-01 10:04:04 发布

阅读量1.5k

点赞数 1

分类专栏：机器学习文章标签：机器学习

欢迎转载，转载麻烦备注出处和链接

本文链接：https://blog.csdn.net/TangToming/article/details/88814291

版权

机器学习专栏收录该内容

1 篇文章 0 订阅

订阅专栏

本文介绍了如何使用Python爬虫获取马蜂窝结伴信息，针对动态网页，通过分析请求URL和POST参数，揭示了翻页和筛选条件的逻辑。利用BeautifulSoup和正则表达式处理数据，并提到在实际爬取过程中遇到的IP被禁问题，提示未来将考虑添加代理IP进行优化。

摘要由CSDN通过智能技术生成

最近在做旅游攻略，通过马蜂窝找相应的结伴信息，在马蜂窝网页版上面找对应结伴信息非常困难，刷选相关信息做的非常的不友好，要自己一个个页面点击才能找到对应信息，而且没有对应的条件刷选按钮。还有一个问题就是，找大西北结伴信息中很多车队的帖子和跟帖，这样就更加妨碍我寻找对应信息的。后面想快速找到对应信息，于是有了爬虫马蜂窝结伴的信息想法。
打开马蜂窝结伴首页：http://www.mafengwo.cn/together/ 刷选“拉萨”和“一个月内”等条件，然后刷选到对应的页面信息，选择翻页发现ip地址并没有改变，由此可以知道马蜂窝结伴页的页面是动态网页来的，在谷歌浏览器打开网页，按“F12”进入chrome浏览器的开发工具，点击“Network”->XHR（有时候是JS），然后我们点击上面的页面跳转栏的“2”跳转到第二页，然后我们可以看到开发工具左边的框里出现了一个新的请求，即左下图的最下面那一行（蓝色那条），我们用鼠标点击它，就可以在右边显示出该请求的headers的相关信息。在Headers中我们可以知道：Requests URL就是该网页真正请求的URL，而且由Request Method可以知道这是一个post请求，而下面的Request Headers就是该请求所需要设置的headers参数。因为这是一个post请求，所以我们要查看一下post请求提交了那些数据，所以我们可以在右边的Headers中继续往下拉来查看。
在这里插入图片描述

所以由上图的Query String Parameters我们可以知道，post请求上传了几个关键的数据：flag、offset、mddid、timeflag和timestart，
timeFlag就是 “热门结伴、最新发表、即将出发”中一个，flag=3就是热门结伴，默认为热门结伴；
Offset就是对应页面的偏移页面，翻页就是通过这个参数；
Mddid：目的地编码，现在拉萨是10442；
所以根据其中一个网址http://www.mafengwo.cn/together/travel/more?flag=3&offset=1&mddid=10442&timeFlag=3&timestart=，可以猜测结伴网页构成为：http://www.mafengwo.cn/together/travel/more? 加上参数：flag=3&offset=1&mddid=10442&timeFlag=3&timestart=

#获取总共有多少页
def get_page(url):
    response=urllib.request.urlopen(url).read()
    response=json.loads(response.decode("utf-8").encode('GBK'))
    total=response['data']['total']

    page=int(total/12)+1
    print ("page:",page)
    return page

获取有多少页面后通过构造对应的动态网址然后获取页面详细信息，针对页面明细信息，获取到发送结伴人的基本信息和结伴描述，后面还有报名的用户列表、关注的用户列表、评论的用户和评论的明细，最后将结果保存在excel中。

在处理解释页面明细中，使用了BeautifulSoup和正则进行处理，关注、报名和评论等列表等处理成excel能够存储的格式。
另外还有一个问题，在测试的时候，没有使用代理也没有对爬虫速度进行限制，爬到一半就被马蜂窝禁掉了。后续持续优化添加代理IP进行优化。
下面为爬虫的完整代码：



#######################################################
import xlrd
import xlwt
import xlsxwriter
import requests
import re
import json
import urllib.request
from lxml import etree
from bs4 import BeautifulSoup
from openpyxl import workbook  # 写入Excel表所用

#获取总共有多少页
def get_page(url):
    response=urllib.request.urlopen(url).read()
    response=json.loads(response.decode("utf-8").encode('GBK'))
    total=response['data']['total']

    page=int(total/12)+1
    print ("page:",page)
    return page

	
#循环爬取页数
def get_list(page):
    for i in range(0,page):
        url='http://www.mafengwo.cn/together/travel/more?'+flag+'&'+'offset='+str(i)+'&'+mddid+'&'+'timeFlag=3&timestart='
        get_matehtml(url) 
        time.sleep(8)
        print ("#"*20)
        print ("i,url:",i,url)

		

#获取链接网址
def get_matehtml(url):
    response=json.loads(urllib.request.urlopen(url).read().decode("utf-8").encode('GBK'))
    html=response['data']['html']
    pat1='a href=\"(.*?)" target=\"_blank\">'
    rst1 = re.compile(pat1).findall(html)
    i=1
    for html in rst1:
        print ("get_matehtml:",'http://www.mafengwo.cn'+html)
        get_detail('http://www.mafengwo.cn'+html)

	
#获取每个结伴明细信息
def get_detail(html):
    global togetherlist
    response=urllib.request.urlopen(html).read().decode("utf-8")
    #获取标题
    pat1='<title>(.*?)</title>'
    title=re.compile('<title>(.*?)</title>').findall(response)[0]
    print("title:",title)
    see=re.compile('<span>(.*?)</span>人').findall(response)[0]
    sign=re.compile('<span>(.*?)</span>人').findall(response)[1]
    follow=re.compile('<span>(.*?)</span>人').findall(response)[2]
    soup = BeautifulSoup(urllib.request.urlopen(html).read(), 'lxml')
    title=soup.title.get_text()
    gooff=re.compile('出发时间：(.*?)</span>').findall(response)[0]
    days=re.compile('大约：(.*?)</span>').findall(response)[0]
    des=re.compile('目的地：(.*?)</span>').findall(response)[0]
    fro=re.compile('出发地：(.*?)</span>').findall(response)[0]
    num=re.compile('希望人数：(.*?)</span>').findall(response)[0]
    enrollment=re.compile('<span><em>(.*?)</em>').findall(response)[0]
    female=re.compile('<span>MM(.*?) <i').findall(response)[0]
    male=re.compile('<span>GG(.*?) <i').findall(response)[0]
    description_tmp=soup.select('div[class="desc _j_description"]')[0]
    description=description_tmp.get_text()
	
    joinlist_tmp=soup.select('.mod-joinlist div ul li .name')
    attention_tmp=soup.select('.mod-attentionUser div ul li a')
    comment_tmp=soup.select('.mod-comment ul div .comm_con')
    #获取报名的列表
    joinlist=get_joinlist(joinlist_tmp)
    #获取关注的列表
    attentionlist=get_attentionlist(attention_tmp)
    #获取评论列表
    commentlist=get_commentlist(comment_tmp)
    togetherlist.append([html,title,see,sign,follow,gooff,days,des,fro,num,enrollment,female,male,description,joinlist,attentionlist,commentlist])

	
#获取报名结伴列表整理
def get_joinlist(html):
    joinlist=[]
    for tmp in html:
        tmp1=tmp.attrs['href']
        tmp2=tmp.get_text()
        tmp3=[tmp1,tmp2]
        joinlist.append(tmp1)
    return '-'.join(joinlist)

	
#获取关注结伴列表整理
def get_attentionlist(html):
    #print ("html:",html)
    attentionlist=[]
    for tmp in html:
        attentionlist.append(tmp.attrs['href'])
    return '-'.join(attentionlist)

	
#获取评论列表
def get_commentlist(html):
    #print ("html:",html)
    commentlist=[]
    for tmp in html:
        #print ("tmp:",tmp.select('.comm_info span'))
        comm_id=tmp.select('.comm_info a')[0].attrs['href']
        comm_name=tmp.select('.comm_info a')[0].get_text()
        comm_grade=tmp.select('.comm_info a')[1].get_text()
        comm_time=tmp.select('.comm_info span')[0].get_text()
        comm_word=tmp.select('.comm_word')[0].get_text()
        #print ("tmp_comment:",comm_id,comm_name,comm_grade,comm_time,comm_word)
        commentlist.append(comm_word)
    return '@@'.join(commentlist)
	  

#把爬虫结果写入到excel中	
def data_write(file_path, datas):
    print ("begin to write:",len(datas))
	
    workbook = xlsxwriter.Workbook('D:\\python_data\\data_operation\\test03.xlsx')  #生成表格
    worksheet = workbook.add_worksheet(u'sheet1')   #在文件中创建一个名为TEST的sheet,不加名字默认为sheet1
    	#将数据写入第 i 行，第 j 列
    i = 0
    for data in datas:
        for j in range(len(data)):
            #print ("i,j,data[j]:",i,j,type(data[j]),data[j])
            worksheet.write(i,j,data[j])
        i = i + 1
	
    workbook.close()

	
	
#爬取马蜂窝结伴信息入口：
if __name__=='__main__':	
    flag='flag=3'
    offset='offset=0'
    mddid='mddid=10442'
    timelag='3'

    base_url='http://www.mafengwo.cn/together/travel/more?'+flag+'&'+offset+'&'+mddid+'&'+'timeFlag=3&timestart='
    print ("base_url:",base_url)

    page=get_page(base_url)
    data3 = urllib.request.urlopen(url).read()
    data3 = data3.decode("utf-8").encode('GBK')
    global togetherlist
    togetherlist=[]
    get_list(page)  
    print ("len(togetherlist):",len(togetherlist))

    #创建Excel表并写入数据
    data_write("D:\\python_data\\data_operation\\test2.xlsx", togetherlist)



###########################################

参考资料：