最近帮同学写了个爬虫,爬取微博上景点评论以及携程上景点评论.下面都是以景点夫子庙为例
微博爬虫
在微博搜索夫子庙关键词,然后得到网页链接,我们用审查元素分析,应该是ajax模式,于是得到它的请求头,分析它的参数,应该就是只有一个page变量.其请求头网页.
接下来我们就可以爬取在夫子庙这个地点发的微博了.注意,因为第一页有点不一样,所以我是从第二页开始爬取的,不过微博有个反爬机制,你最多只能爬取150页.
这边得到的是csv文件,之后再手动转成excel文件就行.具体转法:先右键csv文件,记事本打开,然后另存为txt文件,新建一个excel文档,把txt文件拖进来,然后把格式处理下,如自动换行啥的,最后另存为excel文件就行.
import time
import requests
from urllib.parse import urlencode
from pyquery import PyQuery as pq
import csv
baseurl = 'https://m.weibo.cn/api/container/getIndex?'
headers = {
'Host': 'm.weibo.cn',
'Referer': 'https://m.weibo.com/p/100101B2094655D764AAFC4799',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest',
}
def get_page(page):
params = {
'containerid':'100101B2094655D764AAFC4799_-_weibofeed',
'page':page
}
url = baseurl + urlencode(params)
try:
response = requests.get(url,headers=headers)
if response.status_code == 200:
return response.json()
except requests.ConnectionError as e:
print('ERROR',e.args)
def parse_page(json, page: int):
if json:
items = json.get('data').get('cards')[0].get('card_group')
for index, item in enumerate(items):
if page == 1:
continue
else:
item = item.get('mblog')
id= item.get('id')
created_at = item.get('created_at')
text = pq(item.get('text')).text()
with open("fuzimiao.csv", "a", encoding='utf-8') as csvfile:
fieldnames = ['id', 'created_at', 'text']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writerow({'id': id, 'created_at': created_at, 'text': text})
max_page = 150
if __name__ == '__main__':
for page in range(153, max_page + 1):
print(page)
json = get_page(page)
parse_page(json,page)
time.sleep(2)
携程爬虫
携程的这个爬虫代码是我在网上借鉴了一位老哥的,做了写修改.其中url中的poiID参数是各景点的参数,这个需要自己在对应的景点页面审查元素去获取,然后那个502那个是随便设置的最大页数,应该可以自行更改.
import csv
import requests
from bs4 import BeautifulSoup
headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3",
"Connection": "keep-alive",
"Host": "you.ctrip.com",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36"
}
for i in range(1,502):
try:
print(i)
url="http://you.ctrip.com/destinationsite/TTDSecond/SharedView/AsynCommentView?poiID=75702&districtId=702&districtEName=Yangshuo&pagenow=%d&order=3.0&star=0.0&tourist=0.0&resourceId=22079&resourcetype=2"%(i)
html=requests.get(url,headers=headers)
html.encoding="utf-8"
soup=BeautifulSoup(html.content)
block=soup.find_all(class_="comment_single")
for j in block:
text=j.find(class_="heightbox").text
time = j.find(class_="time_line").text
with open("fuzimiao_xiecheng.csv", "a", encoding='utf-8') as csvfile:
fieldnames = ['time', 'text']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writerow({'time': time, 'text': text})
except:
pass