Python定向爬虫——校园论坛帖子信息

引言

写这个小爬虫主要是为了爬校园论坛上的实习信息,主要采用了Requests库

源码

URLs.py

主要功能是根据一个初始url(包含page页面参数)来获得page页面从当前页面数到pageNum的url列表

import re

def getURLs(url, attr, pageNum=1):
    all_links = []
    try:
        now_page_number = int(re.search(attr+'=(\d+)', url, re.S).group(1))
        for i in range(now_page_number, pageNum + 1):
            new_url = re.sub(attr+'=\d+', attr+'=%s' % i, url, re.S)
            all_links.append(new_url)
        return all_links
    except TypeError:
        print "arguments TypeError:attr should be string."
uni_2_native.py

由于论坛上爬取得到的网页上的中文都是unicode编码的形式,文本格式都为 &#XXXX;的形式,所以在爬得网站内容后还需要对其进行转换

import sys
import re
reload(sys)
sys.setdefaultencoding('utf-8')

def get_native(raw):
    tostring = raw
    while True:
        obj = re.search('&#(.*?);', tostring, flags=re.S)
        if obj is None:
            break
        else:
            raw, code = obj.group(0), obj.group(1)
            tostring = re.sub(raw, unichr(int(code)), tostring)
    return tostring
存入SQLite数据库:saveInfo.py
# -*- coding: utf-8 -*-

import MySQLdb


class saveSqlite():
    def __init__(self):
        self.infoList = []

    def saveSingle(self, author=None, title=None, date=None, url=None,reply=0, view=0):
        if author is None or title is None or date is None or url is None:
            print "No info saved!"
        else:
            singleDict = {}
            singleDict['author'] = author
            singleDict['title'] = title
            singleDict['date'] = date
            singleDict['url'] = url
            singleDict['reply'] = reply
            singleDict['view'] = view
            self.infoList.append(singleDict)

    def toMySQL(self):
        conn = MySQLdb.connect(host='localhost', user='root', passwd='', port=3306, db='db_name', charset='utf8')
        cursor = conn.cursor()
        # sql = "select * from info"
        # n = cursor.execute(sql)
        # for row in cursor.fetchall():
        #     for r in row:
        #         print r
        #     print '\n'
        sql = "delete from info"
        cursor.execute(sql)
        conn.commit()

        sql = "insert into info(title,author,url,date,reply,view) values (%s,%s,%s,%s,%s,%s)"
        params = []
        for each in self.infoList:
            params.append((each['title'], each['author'], each['url'], each['date'], each['reply'], each['view']))
        cursor.executemany(sql, params)

        conn.commit()
        cursor.close()
        conn.close()


    def show(self):
        for each in self.infoList:
            print "author: "+each['author']
            print "title: "+each['title']
            print "date: "+each['date']
            print "url: "+each['url']
            print "reply: "+str(each['reply'])
            print "view: "+str(each['view'])
            print '\n'

if __name__ == '__main__':
    save = saveSqlite()
    save.saveSingle('网','aaa','2008-10-10 10:10:10','www.baidu.com',1,1)
    # save.show()
    save.toMySQL()
主要爬虫代码
import requests
from lxml import etree
from cc98 import uni_2_native, URLs, saveInfo

# 根据自己所需要爬的网站,伪造一个header
headers ={
    'Accept': '',
    'Accept-Encoding': '',
    'Accept-Language': '',
    'Connection': '',
    'Cookie': '',
    'Host': '',
    'Referer': '',
    'Upgrade-Insecure-Requests': '',
    'User-Agent': ''
}
url = 'http://www.cc98.org/list.asp?boardid=459&page=1&action='
cc98 = 'http://www.cc98.org/'

print "get infomation from cc98..."

urls = URLs.getURLs(url, "page", 50)
savetools = saveInfo.saveSqlite()

for url in urls:
    r = requests.get(url, headers=headers)
    html = uni_2_native.get_native(r.text)

    selector = etree.HTML(html)
    content_tr_list = selector.xpath('//form/table[@class="tableborder1 list-topic-table"]/tbody/tr')

    for each in content_tr_list:
        href = each.xpath('./td[2]/a/@href')
        if len(href) == 0:
            continue
        else:
            # print len(href)
            # not very well using for, though just one element in list
            # but I don't know why I cannot get the data by index
            for each_href in href:
                link = cc98 + each_href
            title_author_time = each.xpath('./td[2]/a/@title')

            # print len(title_author_time)
            for info in title_author_time:
                info_split = info.split('\n')
                title = info_split[0][1:len(info_split[0])-1]
                author = info_split[1][3:]
                date = info_split[2][3:]

            hot = each.xpath('./td[4]/text()')
            # print len(hot)
            for hot_num in hot:
                reply_view = hot_num.strip().split('/')
                reply, view = reply_view[0], reply_view[1]
            savetools.saveSingle(author=author, title=title, date=date, url=link, reply=reply, view=view)

print "All got! Now saving to Database..."
# savetools.show()
savetools.toMySQL()
print "ALL CLEAR! Have Fun!"


  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
以下是一个简单的 Python 爬虫实现,用于从北京租房信息网站上爬取租房信息并将其导入到 Excel 文件中。 首先,我们需要安装以下 Python 库: - requests:用于发送 HTTP 请求和获取网页内容。 - Beautiful Soup:用于解析 HTML 和 XML 网页内容。 - openpyxl:用于操作 Excel 文件。 可以使用 pip 命令进行安装: ``` pip install requests beautifulsoup4 openpyxl ``` 接下来,我们需要编写 Python 代码: ```python import requests from bs4 import BeautifulSoup from openpyxl import Workbook # 发送 HTTP 请求并获取网页内容 url = 'https://bj.zu.anjuke.com/fangyuan/p1/' response = requests.get(url) html = response.text # 使用 Beautiful Soup 解析网页内容 soup = BeautifulSoup(html, 'html.parser') houses = soup.select('.zu-itemmod') # 创建 Excel 文件并添加表头 wb = Workbook() ws = wb.active ws.append(['标题', '链接', '小区', '面积', '租金']) # 遍历租房信息并将其添加到 Excel 文件中 for house in houses: title = house.select('.zu-info h3 a')[0].text.strip() link = house.select('.zu-info h3 a')[0]['href'] community = house.select('.details-item')[0].text.strip() area = house.select('.details-item')[1].text.strip() price = house.select('.zu-side strong')[0].text.strip() ws.append([title, link, community, area, price]) # 保存 Excel 文件 wb.save('beijing_rent.xlsx') ``` 该爬虫程序将会从北京租房信息网站的第一页开始爬取租房信息,包括标题、链接、小区、面积和租金,并将其添加到 Excel 文件中。你可以根据需要修改代码以实现更多功能。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值