python实现kindle每天推送博客2----python实现爬取博客内容

最新推荐文章于 2022-07-06 16:32:33 发布

会飞的蚂蚁_

最新推荐文章于 2022-07-06 16:32:33 发布

阅读量915

点赞数 1

分类专栏：语言基础文章标签：博客爬虫 kindle python 服务器

本文链接：https://blog.csdn.net/xx123er/article/details/76670274

版权

语言基础专栏收录该内容

5 篇文章 0 订阅

订阅专栏

python爬虫教程很多，

本文以爬取博客为例

Beautiful Soup

Beautiful Soup是python的一个库，最主要的功能是从网页抓取数据

为节约篇幅，安装方法自行百度

解析器：

下表列出了主要的解析器,以及它们的优缺点:

解析器	使用方法	优势	劣势
Python标准库	`BeautifulSoup(markup,"html.parser")`	Python的内置标准库执行速度适中文档容错能力强	Python 2.7.3 or 3.2.2)前的版本中文档容错能力差
lxml HTML 解析器	`BeautifulSoup(markup, "lxml")`	速度快文档容错能力强	需要安装C语言库
lxml XML 解析器	`BeautifulSoup(markup, ["lxml","xml"])` `BeautifulSoup(markup, "xml")`	速度快唯一支持XML的解析器	需要安装C语言库
html5lib	`BeautifulSoup(markup,"html5lib")`	最好的容错性以浏览器的方式解析文档生成HTML5格式的文档	速度慢不依赖外部扩展

推荐使用lxml作为解析器,因为效率更高. 在Python2.7.3之前的版本和Python3中3.2.2之前的版本,必须安装lxml或html5lib, 因为那些Python版本的标准库中内置的HTML解析方法不够稳定.

用法简单介绍：

from bs4 import BeautifulSoup

soup = BeautifulSoup(open("index.html"))
print(soup.prettify())
soup.title
# <title>The Dormouse's story</title>

soup.title.name
# u'title'

soup.title.string
# u'The Dormouse's story'

soup.title.parent.name
# u'head'

soup.p
# <p class="title"><b>The Dormouse's story</b></p>

soup.p['class']
# u'title'

soup.a
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.find(id="link3")
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
从文档中找到所有<a>标签的链接:
for link in soup.find_all('a'):
    print(link.get('href'))
    # http://example.com/elsie
    # http://example.com/lacie
    # http://example.com/tillie
从文档中获取所有文字内容:
print(soup.get_text())
# The Dormouse's story
#

说明：这个get_text( )特别好用，因为我要把文章内容提取出来，之前考虑用正则去除哪些网页标签符合，也有用nltk来去除的，后来发现直接可以用这个方法。

2.介绍这么多，下面上我的工程源码

#!/usr/bin/env python
#coding=utf-8
#
#   Copyright 2017 liuxinxing
#

from bs4 import BeautifulSoup
import urllib2

import datetime
import time
import PyRSS2Gen
import re
import sys
reload(sys)
sys.setdefaultencoding('utf-8')


class RssSpider():
    def __init__(self):
        self.myrss = PyRSS2Gen.RSS2(title='OSChina',
                                    link='http://my.oschina.net',
                                    description=str(datetime.date.today()),
                                    pubDate=datetime.datetime.now(),
                                    lastBuildDate = datetime.datetime.now(),
                                    items=[]
                                    )
        self.xmlpath=r'./oschina.xml'

        self.baseurl="http://www.oschina.net/blog"
        #if os.path.isfile(self.xmlpath):
            #os.remove(self.xmlpath)
    def useragent(self,url):
        i_headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36","Referer": 'http://baidu.com/'}
        req = urllib2.Request(url, headers=i_headers)
        html = urllib2.urlopen(req).read()
        return html

    def enterpage(self,url):
        pattern = re.compile(r'\d{4}\S\d{2}\S\d{2}\s\d{2}\S\d{2}')
        rsp=self.useragent(url)
        # print rsp
        soup=BeautifulSoup(rsp, "html.parser")
        # print soup
        timespan=soup.find('div',{'class':'blog-content'})
        # print timespan
        timespan=str(timespan).strip().replace('\n','').decode('utf-8')
        # match=re.search(r'\d{4}\S\d{2}\S\d{2}\s\d{2}\S\d{2}',timespan)
        # timestr=str(datetime.date.today())
        # if match:
        #     timestr=match.group()
            #print timestr
        ititle=soup.title.string
        print ititle
        div=soup.find('div',{'class':'BlogContent'})
        # print type(div)
        doc = div.get_text()
        # print type(doc)
        return ititle,doc

    def getcontent(self):
        rsp=self.useragent(self.baseurl)
        # print rsp
        soup=BeautifulSoup(rsp, "html.parser")
        # print soup
        ul=soup.find('div',{'id':'topsOfRecommend'})
        # print ul
        for div in ul.findAll('div',{'class':'box-aw'}):
            # div=li.find('div')
            # print div
            if div is not None:
                alink=div.find('a')
                if alink is not None:
                    link=alink.get('href')
                    print link
                    if self.isbloglink(link):
                        title,doc =self.enterpage(link)
                        self.savefile(title,doc)

    def isbloglink(self,link):
        express = r".*/blog/.*"
        mo = re.search(express, link)
        if mo:
            return True
        else:
            return False

    def savefile(self,title,doc):
        doc = doc.decode('utf-8')
        with open("./data/"+title+".txt",'w') as f:
            f.write(doc)



if __name__=='__main__':
    rssSpider=RssSpider()
    rssSpider.getcontent()
    # rssSpider.enterpage("https://my.oschina.net/diluga/blog/1501203")

文件中还有一些生成rss的东西，可以忽略，因为最开始想做成爬取后生成rss源，后来想直接推送给kindle就没用了。

参考：

https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html