python爬虫需要mysql_python爬虫之连接mysql

该博客介绍了如何使用Python从百度获取最热新闻标题,并将这些数据存储到本地运行的MySQL数据库中。首先,博主创建了数据库和表结构,然后使用PyQuery库解析HTML页面,提取新闻排名、标题和热度。接着,通过pymysql模块连接到数据库并执行插入操作。整个过程简洁明了,适合有一定数据库基础的读者学习。
摘要由CSDN通过智能技术生成

准备工作

运行本地数据库服务器

mysql -u root -p

安装pymysql

pip install pymysql

建表

CREATE DATABASE crawls;

// show databases;

use db;

CREATE TABLE IF NOT EXISTS baiduNews('

'id INT PRIMARY KEY NOT NULL AUTO_INCREMENT,'

'ranking VARCHAR(30),'

'title VARCHAR(60),'

'datetime TIMESTAMP,'

'hot VARCHAR(30));

// show tables;

pymysql连接数据库

db = pymysql.connect(host='localhost', port=3306, user='root', passwd='123456',

db='crawls', charset='utf8')

cursor = db.cursor()

cursor.execute(sql_query)

db.commit()

用python操作mysql还是比较简单的,如果有一点数据库基础的话,可以直接上手,最后一定不要忘了写commit提交,不然数据只是缓存,存不到数据库里

完整示例

爬取百度上最热的几个新闻标题,并存储到数据库,太懒了没写注释-_- (确保本地mysql服务器已经打开)

'''

Get the hottest news title on baidu page,

then save these data into mysql

'''

import datetime

import pymysql

from pyquery import PyQuery as pq

import requests

from requests.exceptions import ConnectionError

URL = 'https://www.baidu.com/s?wd=%E7%83%AD%E7%82%B9'

headers = {

'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36',

'Upgrade-Insecure-Requests': '1'

}

def get_html(url):

try:

response = requests.get(url, headers=headers)

if response.status_code == 200:

return response.text

return None

except ConnectionError as e:

print(e.args)

return None

def parse_html(html):

doc = pq(html)

trs = doc('.FYB_RD table.c-table tr').items()

for tr in trs:

index = tr('td:nth-child(1) span.c-index').text()

title = tr('td:nth-child(1) span a').text()

hot = tr('td:nth-child(2)').text().strip('"')

yield {

'index':index,

'title':title,

'hot':hot

}

def save_to_mysql(items):

try:

db = pymysql.connect(host='localhost', port=3306, user='root', passwd='123456',

db='crawls', charset='utf8')

cursor = db.cursor()

cursor.execute('use crawls;')

cursor.execute('CREATE TABLE IF NOT EXISTS baiduNews('

'id INT PRIMARY KEY NOT NULL AUTO_INCREMENT,'

'ranking VARCHAR(30),'

'title VARCHAR(60),'

'datetime TIMESTAMP,'

'hot VARCHAR(30));')

try:

for item in items:

print(item)

now = datetime.datetime.now()

now = now.strftime('%Y-%m-%d %H:%M:%S')

sql_query = 'INSERT INTO baiduNews(ranking, title, datetime, hot) VALUES ("%s", "%s", "%s", "%s")' % (

item['index'], item['title'], now, item['hot'])

cursor.execute(sql_query)

print('Save into mysql')

db.commit()

except pymysql.MySQLError as e:

db.rollback()

print(e.args)

return

except pymysql.MySQLError as e:

print(e.args)

return

def check_mysql():

try:

db = pymysql.connect(host='localhost', port=3306, user='root', passwd='123456',

db='crawls', charset='utf8')

cursor = db.cursor()

cursor.execute('use crawls;')

sql_query = 'SELECT * FROM baiduNews'

results = cursor.execute(sql_query)

print(results)

except pymysql.MySQLError as e:

print(e.args)

def main():

html = get_html(URL)

items = parse_html(html)

save_to_mysql(items)

#check_mysql()

if __name__ == '__main__':

main()

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值