python爬虫笔记

最新推荐文章于 2024-09-09 23:28:21 发布

dongge0519

最新推荐文章于 2024-09-09 23:28:21 发布

阅读量118

点赞数

文章标签： python 爬虫

原文链接：http://www.cnblogs.com/Mr-Rice/p/3838677.html

版权

初学python爬虫，感觉光看看学不到什么，自己瞎折腾了一个爬百度贴吧小说。

自己感觉不是很实用，不过还是写下来备忘下，也算留个纪念。

#! /usr/bin/env/python 27
# coding:gbk

import urllib2
import re

def findurl(i):
    pattern=re.compile('/p/\d{10}')
    Match=re.search(pattern,i).group()
    url='http://tieba.baidu.com'+Match+'?see_lz=1'
    return url

def findtitle(i):
    pattern=re.compile(u'\xb5\xda.+\xd5\xc2.+\xa1\xbf')
    title=re.search(pattern,i).group()
    return title

def main():
    name=raw_input('请输入贴吧名字：')
    fo=open(name+'.txt','w+')
    name=urllib2.quote(name)
    url='http://tieba.baidu.com/f/good?kw='+name+'&cid=0&pn='
    for index in xrange(550,-1,-50):
        page=urllib2.urlopen(url+str(index)).read()
        pattern=re.compile(u'<a href="/p/\d{10}" title="\xb5\xda.+\xd5\xc2.+" t')
        result=re.findall(pattern,page)
        for each in reversed(result):
            #fo.writelines(findtitle(each)+'\n')
            article=urllib2.urlopen(findurl(each)).read()
            pattern=re.compile('id="post_content.*?>(.*?)</div>')
            content=re.search(pattern,article).group()
            pattern=re.compile('.+>')
            stripl=re.search(pattern,content).group()
            result=content.replace('<br>','\n').rstrip('</div>').lstrip(stripl)
            fo.writelines(result+'\n')
    fo.close()
    print 'Done!'

if __name__=='__main__':
    main()

转载于:https://www.cnblogs.com/Mr-Rice/p/3838677.html