使用BeautifulSoup的简单小爬虫

最新推荐文章于 2024-10-18 17:17:14 发布

frostmourne_lk

最新推荐文章于 2024-10-18 17:17:14 发布

阅读量686

点赞数 1

分类专栏： python 文章标签： python 爬虫

本文链接：https://blog.csdn.net/Frostmourne_LK/article/details/78422974

版权

python 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

最近稍微看了点python的入门， runoob上面的入门过了一遍 python的菜鸟教程。网上看爬虫用BeautifulSoup就能简单的尝试下，就学着写了个百度贴吧的，算是小爬虫吧。。。

安装BeautifulSoup

先从官网上down下来然后解压再用python安装
官网地址 https://www.crummy.com/software/BeautifulSoup/#Download
具体还是网上搜吧超级多

爬取模块

其实贴吧的网址还是比较容易拼接的所以有挺多人拿贴吧练手来着

def start(self):
    for i in range(self.topic_limit/50):
        self.spide_listpage(i * 50)

因为计划着要翻页嘛拼接的页码就是这么个格式做个循环调用方法

def spide_listpage(self, num):
    url = self.baseUrl + "&pn=" + str(num)
    html = urllib2.urlopen(url).read()
    soup = BeautifulSoup(html, 'html.parser')
    topic_list = soup.findAll('a', attrs={'class': 'j_th_tit '})
    for topic in topic_list:
        if self.keyword in topic['title']:
           print topic['title'], (self.domain +  topic['href']).strip()
           self.theUrl = (self.domain + topic['href']).strip()
           break

html就是拼接出来的地址，然后利用beautifulsoup来进行读取，在找到所有class里面带 j_th_tit样式的然后再把对应的标题和超链接打印出来
这个思路嘛就是找html里面对应的css样式，毕竟同类的格式肯定是一样的这个估计大家都懂就不赘述了

然后循环把含有keyword的提取打印出来

文件写入模块

爬取出来索性就写入txt文档好啦

class writeInFile:

    def __init__(self, url):
        self.url = url

    def getTheWeb(self):
        html = urllib2.urlopen(self.url).read()
        soup = BeautifulSoup(html, 'html.parser')
        context_list = soup.findAll('div', 'd_post_content j_d_post_content ')
        for context in context_list:
            # print context.text
            self.wirteFile(context.text)

    def wirteFile(self, text):
        with open( 'spider.txt', 'a') as f:
            f.write(text)
            f.write('\n')