python3.x + BeautifulSoup CSDN小爬虫

最新推荐文章于 2020-01-14 10:41:46 发布

chengyunlin7120

最新推荐文章于 2020-01-14 10:41:46 发布

阅读量46

点赞数

原文链接：https://my.oschina.net/u/2402525/blog/1554297

版权

前言

很长一段时间里都想研究下爬虫，说到爬虫，大家比较推存的还是python，奈何第一语言不是python，使用Java做个一些练手后，感觉在解析网页内容上比较繁琐，带着拥抱变化的心态，最终撸完了python的相关基础知识。蹭着知识的新鲜和热乎劲，速度找了CSDN博客来发挥发挥，记录下来，鞭策和监督自己，与努力的小伙伴们共勉。

一.目标

根据指定用户名获取CSDN的博客文章详情，包括文章id，标题，正文，标签，阅读人数，是否原创，并且把数据保存到数据库中。

二.分析

欲善其事，先利其器，抓起数据第一步就是对待抓起数据的网页进行结构分析，步骤为：网页入口->跳转链接->待抓起页。

三.获取目标链接

1.文章列表页

输入图片说明

2.获取列表页链接

输入图片说明
分析文章列表页代码，发现当前页所有文章都在一个class='article_list'的div下，故先获取此div，再获取里面所有的文章链接。

def getOnePageLinks(user, no=1):
    pageLinks=[]
    url = __rootUrl + '/' + user + '/article/list/' + str(no)
    html = urlopen(url)
    bsObj = BeautifulSoup(html)
    try:
        articleListObj = bsObj.find('div', {'id': 'article_list'})
        # 获取文章链接
        titleLinkLists = articleListObj.findAll('a', href=re.compile('[0-9]$'))
        for link in titleLinkLists:
            if link.attrs['href'] is not None:
                articleUrl = __rootUrl + link.attrs['href']
                if articleUrl not in pageLinks:
                    pageLinks.append(articleUrl)
    except BaseException as e:
        logging.error('get article link error:',e)

    return pageLinks

3.获取所有文章列表页链接

通过分析发现csdn博客文章列表页的地址格式为：${host}/用户名/article/list/index，根据索引的变化可获取所有的文章链接。

def getAllPageLinks(user):
    pageLinks = []
    index = 1
    while index > 0:
        print('index=' + str(index))
        tempPageLinks = getOnePageLinks(user, index)
        if(tempPageLinks is not None and len(tempPageLinks) > 0):
            index += 1
            pageLinks += tempPageLinks
        else:
            index = 0
    return pageLinks

四.目标页数据抽取

分析抓取页html格式，数据针对性抽取

def getTargetData(targetUrl):
    html = urlopen(targetUrl)
    bsObj = BeautifulSoup(html)
    bsInfoObj = bsObj.find('div',{'class':'container clearfix'})
    title = bsInfoObj.find('h1',{'class':'csdn_top'}).text
    original = bsInfoObj.find('div',{'class':'artical_tag'}).find('span',{'class':'original'}).get_text()
    time = bsInfoObj.find('div',{'class':'artical_tag'}).find('span',{'class':'time'}).get_text()
    view = bsInfoObj.find('ul',{'class':'right_bar'}).find('button').get_text()
    tagsObj = bsInfoObj.find('ul',{'class':'article_tags clearfix csdn-tracking-statistics'}).findAll('a')
    tagsList = []
    for value in tagsObj:
        try:
            tagsList.append(value.text)
        except Exception as e:
            logging.error(e)
    tarsStr = ','.join(tagsList)

    content = bsInfoObj.find('div',{'id':'article_content'}).get_text()

五.数据存储

def save(title, original, publishDate, view, tagsStr, content):
    cursor = connection.cursor()
    try:
        sql = 'INSERT INTO csdnblog (title,copyright,date,view,tags,content) VALUES (%s, %s, %s, %s, %s, %s)'
        cursor.execute(sql,(title, original, publishDate, view, tagsStr, content))
        connection.commit()
    except Exception as e:
        logging.error('execute sql',e)
    finally:
        cursor.close()

查看成果：
输入图片说明

源码获取

转载于:https://my.oschina.net/u/2402525/blog/1554297

chengyunlin7120

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python3.x + BeautifulSoup CSDN小爬虫

前言很长一段时间里都想研究下爬虫，说到爬虫，大家比较推存的还是python，奈何第一语言不是python，使用Java做个一些练手后，感觉在解析网页内容上比较繁琐，带着拥抱变化的心态，最终撸完了python的相关基础知识。蹭着知识的新鲜和热乎劲，速度找了CSDN博客来发挥发挥，记录下来，鞭策...
复制链接

扫一扫