python爬虫教程很多,
本文以爬取博客为例
1.
Beautiful Soup
Beautiful Soup是python的一个库,最主要的功能是从网页抓取数据
为节约篇幅,安装方法自行百度
解析器:
下表列出了主要的解析器,以及它们的优缺点:
解析器 | 使用方法 | 优势 | 劣势 |
---|---|---|---|
Python标准库 | BeautifulSoup(markup,"html.parser") |
|
|
lxml HTML 解析器 | BeautifulSoup(markup, "lxml") |
|
|
lxml XML 解析器 | BeautifulSoup(markup, ["lxml","xml"]) BeautifulSoup(markup, "xml") |
|
|
html5lib | BeautifulSoup(markup,"html5lib") |
|
|
推荐使用lxml作为解析器,因为效率更高. 在Python2.7.3之前的版本和Python3中3.2.2之前的版本,必须安装lxml或html5lib, 因为那些Python版本的标准库中内置的HTML解析方法不够稳定.
用法简单介绍:
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("index.html"))
print(soup.prettify())
soup.title
# <title>The Dormouse's story</title>
soup.title.name
# u'title'
soup.title.string
# u'The Dormouse's story'
soup.title.parent.name
# u'head'
soup.p
# <p class="title"><b>The Dormouse's story</b></p>
soup.p['class']
# u'title'
soup.a
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
soup.find(id="link3")
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
从文档中找到所有<a>标签的链接:
for link in soup.find_all('a'):
print(link.get('href'))
# http://example.com/elsie
# http://example.com/lacie
# http://example.com/tillie
从文档中获取所有文字内容:
print(soup.get_text())
# The Dormouse's story
#
说明:这个get_text( )特别好用,因为我要把文章内容提取出来,之前考虑用正则去除哪些网页标签符合,也有用nltk来去除的,后来发现直接可以用这个方法。
2.介绍这么多,下面上我的工程源码
#!/usr/bin/env python
#coding=utf-8
#
# Copyright 2017 liuxinxing
#
from bs4 import BeautifulSoup
import urllib2
import datetime
import time
import PyRSS2Gen
import re
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
class RssSpider():
def __init__(self):
self.myrss = PyRSS2Gen.RSS2(title='OSChina',
link='http://my.oschina.net',
description=str(datetime.date.today()),
pubDate=datetime.datetime.now(),
lastBuildDate = datetime.datetime.now(),
items=[]
)
self.xmlpath=r'./oschina.xml'
self.baseurl="http://www.oschina.net/blog"
#if os.path.isfile(self.xmlpath):
#os.remove(self.xmlpath)
def useragent(self,url):
i_headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36","Referer": 'http://baidu.com/'}
req = urllib2.Request(url, headers=i_headers)
html = urllib2.urlopen(req).read()
return html
def enterpage(self,url):
pattern = re.compile(r'\d{4}\S\d{2}\S\d{2}\s\d{2}\S\d{2}')
rsp=self.useragent(url)
# print rsp
soup=BeautifulSoup(rsp, "html.parser")
# print soup
timespan=soup.find('div',{'class':'blog-content'})
# print timespan
timespan=str(timespan).strip().replace('\n','').decode('utf-8')
# match=re.search(r'\d{4}\S\d{2}\S\d{2}\s\d{2}\S\d{2}',timespan)
# timestr=str(datetime.date.today())
# if match:
# timestr=match.group()
#print timestr
ititle=soup.title.string
print ititle
div=soup.find('div',{'class':'BlogContent'})
# print type(div)
doc = div.get_text()
# print type(doc)
return ititle,doc
def getcontent(self):
rsp=self.useragent(self.baseurl)
# print rsp
soup=BeautifulSoup(rsp, "html.parser")
# print soup
ul=soup.find('div',{'id':'topsOfRecommend'})
# print ul
for div in ul.findAll('div',{'class':'box-aw'}):
# div=li.find('div')
# print div
if div is not None:
alink=div.find('a')
if alink is not None:
link=alink.get('href')
print link
if self.isbloglink(link):
title,doc =self.enterpage(link)
self.savefile(title,doc)
def isbloglink(self,link):
express = r".*/blog/.*"
mo = re.search(express, link)
if mo:
return True
else:
return False
def savefile(self,title,doc):
doc = doc.decode('utf-8')
with open("./data/"+title+".txt",'w') as f:
f.write(doc)
if __name__=='__main__':
rssSpider=RssSpider()
rssSpider.getcontent()
# rssSpider.enterpage("https://my.oschina.net/diluga/blog/1501203")
文件中还有一些生成rss的东西,可以忽略,因为最开始想做成爬取后生成rss源,后来想直接推送给kindle就没用了。
参考:
https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html