南瓜屋
- 安装 selenium ,安装PhantomJS
- 测试成功之后可以先看一下selenium的基本使用方法 selenium使用
- 当理解之后,我们就开始做第一个demo
先分析南瓜屋的网页,顺便看几个故事吧,先放松放松,然后一天就过去了,哈哈。说笑的,怎么可能。
第二天
咳咳咳~ 今天呢,我们开始具体的分析网页吧,首先我们看到首页,一般来说爬虫需要先找到URL的规律,然后我们就往下滚动,接着滚,一直滚,始终是没有第二页的链接,所以可以很大猜测是后台Ajax或者js管理的。在NetWork下的XHR可以发现url的规律
https://story.hao.360.cn/api/recommend/storyList?user_id=8a6b83d87bd37e3fff9d8c5480e1c191&session_id=afb30b4dbfad9cb60a5a25bb834a80b9&action=2&page=1&per_page=10
https://story.hao.360.cn/api/recommend/storyList?user_id=8a6b83d87bd37e3fff9d8c5480e1c191&session_id=afb30b4dbfad9cb60a5a25bb834a80b9&action=1&page=2&per_page=10
https://story.hao.360.cn/api/recommend/storyList?user_id=8a6b83d87bd37e3fff9d8c5480e1c191&session_id=afb30b4dbfad9cb60a5a25bb834a80b9&action=1&page=3&per_page=10
https://story.hao.360.cn/api/recommend/storyList?user_id=8a6b83d87bd37e3fff9d8c5480e1c191&session_id=afb30b4dbfad9cb60a5a25bb834a80b9&action=1&page=4&per_page=10
找到规律了吗?
除了第一个的action=2之外,之后的URL变化的就只有page,测试:将第一个的action变成1访问,看成不成功。结果是可以的。
但是要拿到所有的故事信息,就要访问详情页。分析详情页的URL
https://story.hao.360.cn/story/MtTcQkq5NHOBPD
挺简单的,主要是后面那一串字母,查看后发现,首页URL可以返回一个data
好了,整个的URL都已经解析完成。现在开始写代码。下面代码只是爬取一页,想要多的自己增加。
也可以使用其他的,例如xpath,beautiful,re等,我这里就当做是selenium的入门吧
Pumpkin_House.py
-----------------------------------------------------------
import json
import time
import requests
import random
from selenium import webdriver
class PumpKinHouse():
def __init__(self):
# 定义浏览器,让爬虫伪装成不同的浏览器,可有效降低被反爬
user_agent = ['Mozilla/5.0 (Windows NT 6.1; rv:50.0) Gecko/20100101 Firefox/50.0',
'Mozilla/5.0 (Windows NT 6.3; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0',
'Mozilla/5.0 (X11; Linux x86_64; rv:50.0) Gecko/20100101 Firefox/50.0',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36',
'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0; Trident/5.0)',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/602.2.14 (KHTML, like Gecko) Version/10.0.1 Safari/602.2.14',
'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36',
'Mozilla/5.0 (iPad; CPU OS 10_1_1 like Mac OS X) AppleWebKit/602.2.14 (KHTML, like Gecko) Version/10.0 Mobile/14B100 Safari/602.1',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:49.0) Gecko/20100101 Firefox/49.0',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:50.0) Gecko/20100101 Firefox/50.0',
'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:45.0) Gecko/20100101 Firefox/45.0',
'Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit/604.1.38 (KHTML, like Gecko) Version/11.0 Mobile/15A372 Safari/604.1'
]
##定义随机函数抽取浏览器访问
num = random.randint(0, 9)
user_agent = user_agent[num]
self.headers = {'user_agent': user_agent, # 伪装
'Connection': 'keep - alive', # 保持连接不断
'Host': 'story.hao.360.cn',
'Referer': 'https://story.hao.360.cn/plaza' # 增加伪装
}
self.fp=open('story.txt', 'a+', encoding='utf-8')
def req_url(self,url):
response=[]
try:
response=requests.get(url=url,headers=self.headers)
except ConnectionError or ConnectionRefusedError:
time.sleep(2)
self.req_url(url=url)
return response
def get_id(self,response):
response=response.text.encode('utf-8')
id_list=[]
data=json.loads(response)['data']['data']
for ids in data:
id=ids['id']
id_list.append(id)
return id_list
def get_detail(self,url):
broswer = webdriver.PhantomJS()
broswer.get(url)
broswer.implicitly_wait(2)
try:
title = broswer.find_element_by_css_selector('.title').text
username = broswer.find_element_by_class_name('username').text
times = broswer.find_element_by_css_selector('.time.fr').text
content = broswer.find_element_by_css_selector('.content.clearfix').text
print(title)
self.fp.write('\n'.join([title, username, times, content]))
self.fp.write('\n' + '=' * 50 + '\n')
except ConnectionRefusedError or ConnectionError:
time.sleep(2)
self.get_detail(url=url)
def main(self):
urls=['https://story.hao.360.cn/api/recommend/storyList?user_id=8a6b83d87bd37e3fff9d8c5480e1c191&session_id=afb30b4dbfad9cb60a5a25bb834a80b9&action=2&page={}&per_page=10'.format(i) for i in range(1,2)]
for url in urls:
response=self.req_url(url=url)
id_list=self.get_id(response=response)
for id in id_list:
detail_url = 'https://story.hao.360.cn/story/' + id
self.get_detail(detail_url)
if __name__=='__main__':
PKH=PumpKinHouse()
PKH.main()
获得的txt文件如下,感觉还不错。