使用Python爬取新闻之前必须要想清楚需要准备什么,因为界面新闻不需要登陆,所以可以省略cookies的设置,我们需要准备一下几个事情:
1.安装PhantomJs
2.获得自己浏览器的User-Agent
3.分析界面新闻的网站代码
现在只爬取一部分,如下图所示:
主要就是一些如何分析element的过程,就不详细讲了,bs4可以使用Pip安装如何安装pip,
安装pip后,通过命令
python3 -m pip install beautifulsoup4
python3 -m pip install selenium
就可以安装bs4和selenium了。
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
#浏览器的Use-agent信息
UA = 'Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:31.0) Gecko/20100101 Firefox/31.0'
ua = dict(DesiredCapabilities.PHANTOMJS)
ua["phantomjs.page.settings.userAgent"] = UA
driver = webdriver.PhantomJS(executable_path='/home/zhonglushu/software/phantomjs-2.1.1/bin/phantomjs', desired_capabilities = ua)
driver.get('http://www.jiemian.com/')
soup = BeautifulSoup(driver.page_source, 'html.parser')
ul = soup.find(class_ = 'bxslider')
liSet = set()
for li in ul.children:
img = li.find('img')
if img == -1:continue
#print(img)
#图片
if 'src' in img.attrs:
imgUrl = img.attrs['src']
if imgUrl in liSet:continue
print(imgUrl)
liSet.add(imgUrl)
div = li.find(class_ = 'slider-header')
if div == -1:continue
aTag = div.find('a')
if aTag == -1: continue
#标题
title = aTag.string;
print(title)
if 'href' in aTag.attrs:
href = aTag.attrs['href']
print(href)
#评论
comment = li.find(class_ = 'comment')
if comment == -1:continue
emTag = comment.find('em');
if emTag == -1:continue
print(emTag.string)
#阅读
collect = li.find(class_ = 'collect')
if collect == -1:continue
emTag = collect.find('em')
if emTag == -1:continue
print(emTag.string)
print('\n')
界面新闻上述html代码中ul的class改变了,后续观察是否是动态改变,这里附上一段mac上修改了class的代码,其实各个平台都差不多:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
#function
def invalidFc(tag):
if tag == -1 or tag == None:
return True
else:
return False
#浏览器的Use-agent信息
UA = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.106 Safari/537.36'
ua = dict(DesiredCapabilities.PHANTOMJS)
ua["phantomjs.page.settings.userAgent"] = UA
driver = webdriver.PhantomJS(executable_path='/Users/rambo.huang/bin/phantomjs-2.1.1-macosx/bin/phantomjs', desired_capabilities = ua)
driver.get('http://www.jiemian.com/')
soup = BeautifulSoup(driver.page_source, 'html.parser')
ul = soup.find(class_ = 'slider-body')
if invalidFc(ul):
print('page source has no slider-body')
else:
liSet = set()
liArray = ul.find_all("li")
for li in liArray:
img = li.find('img')
if invalidFc(img) or invalidFc(img.attrs):
continue
else:
#图片
if 'src' in img.attrs:
imgUrl = img.attrs['src']
if imgUrl in liSet:
continue
print(imgUrl)
liSet.add(imgUrl)
div = li.find(class_ = 'slider-header')
if not invalidFc(div):
aTag = div.find('a')
if not invalidFc(div):
#标题
title = aTag.string;
print(title)
if 'href' in aTag.attrs:
href = aTag.attrs['href']
print(href)
commentFooter = li.find(class_ = 'slider-footer')
#评论
comment = commentFooter.find(class_ = 'comment')
if not comment == None:
emTag = comment.find('em');
if not emTag == None:
print(emTag.string)
#阅读
collect = li.find(class_ = 'collect')
if not collect == None:
emTag = collect.find('em')
if not emTag == None:
print(emTag.string)
print('\n')
抓去数据后,可以参考 Python的数据库连接Pymsql将数据存入数据库中。