python入侵数据库数据库_Python抓拍博客园文章，并存入数据库

最新推荐文章于 2023-06-11 00:08:09 发布

Can Li

最新推荐文章于 2023-06-11 00:08:09 发布

阅读量195

点赞数

文章标签： python入侵数据库数据库

本文链接：https://blog.csdn.net/weixin_36087895/article/details/113505297

版权

在学习python后，想做个爬虫，抓取博客园文章。

爬虫思路很简单，模拟浏览器访问网页，得到网页的html代码，再根据页面结构，从html中提取自己需要的内容。

本文代码主要分为3个部分：

1、读取博客园首页文章链接。

https://www.cnblogs.com/是博客园的首页，列出了文章，分析页面内容，读取文章的链接。

这需要看页面的结构，可以使用浏览器，再浏览页面代码，选择元素，看界面上选中哪一部分，根据自己的需要，可以看到对应模块的代码。

2、对于每个页面，分析页面内容。

这需要看页面结构。

3、分析完页面内容后，将需要的数据插入到数据库中。

数据设计：

-- 博客系统爬虫模块

-- 1、创建库

drop database if exists blog_service_spider; -- 直接删除数据库，不提醒

create database blog_service_spider; -- 创建数据库

use blog_service_spider; -- 选择数据库

-- table structure for table `spider_page`

drop table if exists `spider_page`;

create table `spider_page` (

`id` varchar(60) not null comment '主键',

`create_time` datetime default current_timestamp comment '创建时间',

`creator` varchar(60) not null comment '创建人id',

`modified_time` datetime default null on update current_timestamp comment '修改时间',

`modifier` varchar(60) default null comment '修改人id',

`title` varchar(100) default null comment '文章标题',

`title_url` varchar(100) default null comment '文章地址',

`content` text default null comment '文章内容',

`post_time` datetime default null comment '文章发表时间',

`author` varchar(100) default null comment '作者',

`author_page` varchar(100) default null comment '作者主页',

primary key (`id`)

) engine=innodb default charset=utf8 comment='抓取的文章';

python代码如下：

'''

File Name： webspider

Author： tim

Date： 2018/7/27 14:36

Description：网页爬虫。抓取博客园首页文章，放入数据库中。

放入数据库中的内容：标题、作者、发表时间、文章内容、文章地址、作者主页

'''

from urllib import request

import ssl

from bs4 import BeautifulSoup

import pymysql

import uuid

# 传入url，读取url，将返回的页面转换成BeautifulSoup对象

def html_parser(url):

ssl._create_default_https_context = ssl._create_unverified_context # 加入ssl

req = request.Request(url) # 构建请求

# 代理，模拟浏览器在访问，避免被屏蔽

req.add_header('User-Agent',

'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36')

resp = request.urlopen(req) # 发起请求

html = resp.read().decode('utf-8') # 转码

bf = BeautifulSoup(html, "html.parser") # 将页面转换成BeautifulSoup对象

return bf

# 分析单个网页

def read_page(url):

bf = html_parser(url) # 获取BeautifulSoup对象

# 内容分析

post_info = bf.find('div', class_='post') # 有用的内容区域，下面的查找从该有用区域中进一步获取

title = post_info.find(id='cb_post_title_url').get_text() # 文章标题

title_url = post_info.find(id='cb_post_title_url')['href'] # 文章地址

content = post_info.find(id='cnblogs_post_body') # 文章内容

postdate = post_info.find(id='post-date').get_text() # 文章发表时间

author = post_info.find('div', class_='postDesc').find('a').get_text() # 作者

author_page = post_info.find('div', class_='postDesc').find('a')['href'] # 作者主页

'''print(title)

print(title_url)

print(content)

print(postdate)

print(author)

print(author_page)'''

# 分析完每个页面后，将页面内容插入到数据库中

operate_db(title, title_url, content, postdate, author, author_page)

# 分析博客园首页文章列表

def read_post_list():

bf = html_parser('https://www.cnblogs.com/')

post_list = bf.find(id='post_list').find_all('div', class_="post_item")

for post in post_list:

page_url = post.find('div', class_='post_item_body').h3.a['href']

# 读取每篇文章的url，分别进行页面分析

read_page(page_url)

# 操作数据库

def operate_db(title, title_url, content, postdate, author, author_page):

# 打开数据库连接

conn = pymysql.connect('localhost', 'root', 'root', 'blog_service_spider')

# 使用cursor()方法获取操作游标

cursor = conn.cursor()

# 执行的sql

insert_sql = "insert into spider_page (id,creator,title,title_url,content,post_time,author,author_page) values(%s,%s,%s,%s,%s,%s,%s,%s)"

# 生成的ID

id = str(uuid.uuid1())

# 文章内容

str_content = str(content)

# 创建人

creator = 'admin'

try:

cursor.execute(insert_sql,

(id, creator, title, title_url, str_content, postdate, author, author_page)) # 执行sql语句

conn.commit() # 提交到数据库执行

except Exception as e:

# 如果执行sql语句出现问题，则执行回滚操作

conn.rollback()

print(e)

finally:

# 不论try中的代码是否抛出异常，这里都会执行

# 关闭游标和数据库连接

cursor.close()

conn.close()

# start

if __name__ == '__main__':

read_post_list()

Can Li

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python入侵数据库数据库_Python抓拍博客园文章，并存入数据库

在学习python后，想做个爬虫，抓取博客园文章。爬虫思路很简单，模拟浏览器访问网页，得到网页的html代码，再根据页面结构，从html中提取自己需要的内容。本文代码主要分为3个部分：1、读取博客园首页文章链接。https://www.cnblogs.com/是博客园的首页，列出了文章，分析页面内容，读取文章的链接。这需要看页面的结构，可以使用浏览器，再浏览页面代码，选择元素，看界面上选中哪一部分...
复制链接

扫一扫