python3 爬取数据并保存到MySQL

一、效果示意:
1.输出:
在这里插入图片描述
2.爬取效果:在这里插入图片描述
3.表结构:在这里插入图片描述

二、核心代码:
1、导入:

import requests
from bs4 import BeautifulSoup
import time
import pymysql

2、爬取数据方法:

# 爬取数据
def get_information(page=0):
    url = 'https://bbs.hupu.com/bxj-postdate-' + str(page+1)
    headers={
        "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36",
        "Referer": "https://bbs.hupu.com/bxj"
    }
    r = requests.get(url,headers=headers)
    soup = BeautifulSoup(r.content.decode("utf-8"),"html.parser")
    out = soup.find("ul",attrs={"class":"for-list"})
    datas = out.find_all('li')
    datas_list = []
    try:
        for data in datas:
            title = data.find('a', attrs={"class":"truetit"}).text.split()[0]
            artical_link = "https://bbs.hupu.com" + data.find('a', attrs={"class": "truetit"}).attrs['href']
            author = data.find('a', class_="aulink").text
            author_link = data.find('a', class_="aulink").attrs['href']
            create_time = data.find('a', style="color:#808080;cursor: initial; ").text
            lastest_reply = data.find('span', class_='endauthor').text

            datas_list.append({"title":title,"artical_link":artical_link,"author":author,"author_link":author_link,"create_time":create_time,"lastest_reply":lastest_reply})
    except:
        None
    return datas_list

3、存储到Mysql(核心):

if __name__ == "__main__":
    config = {
          'host':'localhost',
          'port':3306,
          'user':'localhost',
          'password':'123456',
          'charset':'utf8',
          'database':'hx_users',
    }
    connection = pymysql.connect(**config)  # 创建连接

    try:
        cur = connection.cursor()  # 创建游标
        for page in range(2):
            datas = get_information(page)
            for data in datas:
                cur.execute("INSERT INTO hupu_datas (title, artical_link, author, author_link,create_time, lastest_reply) VALUES(%s,%s,%s,%s,%s,%s)",(data['title'], data['artical_link'], data['author'], data['author_link'], data['create_time'], data['lastest_reply']))
            print("正在爬取第%s页"%(page+1))
            time.sleep(1)
    except:
        connection.rollback()           # 若出错了,则回滚
    finally:
        cur.close()  # 关闭游标
        connection.commit()  # 提交事务
        connection.close()  # 关闭连接
  • 0
    点赞
  • 10
    收藏
    觉得还不错? 一键收藏
  • 4
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值