爬虫爬取豆瓣电影TOP250信息
目录
一、配置运行环境
1.pycharm编译器
下载安装pychram编译器。官网下载地址
2.MySQL数据库
安装mysql数据库用以保存数据并制表。官网下载地址
MySQL安装教程:教程
3.Navicat for mysql
Navicat是一个强大的MySQL数据库管理和开发工具,为mysql提供图形用户界面。官网下载地址
4.连接pymysql
5.引入第三方模块
(1)requests
(2)pymysql
(3)beautiful soup4
二、运行项目
1.设置项目文件夹并新建Python Package
代码实现:
import re
import requests
import pymysql
from bs4 import BeautifulSoup
qy = open('C:/Users/轻烟/Desktop/db.txt',mode='a',encoding='utf-8')#这里是要存入的文件目录
for i in range(1):
headers = {#这里模拟浏览器进行访问
'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36',
'Host': 'movie.douban.com'
}
res = 'https://movie.douban.com/top250?start='+str(25*i)#25次
r = requests.get(res, headers=headers, timeout=10)#设置超时时间
soup = BeautifulSoup(r.text, "html.parser")#设置解析方式,也可以使用其他方式。
div_list = soup.find_all('div', class_='item')
movies = []
for each in div_list:
movie = {}
moviename = each.find('div', class_='hd').a.span.text.strip()
movie['title'] = moviename
rank = each.find('div', class_='pic').em.text.strip()
movie['rank'] = rank
info = each.find('div', class_='bd').p.text.strip()
info = info.replace('\n', "")
info = info.replace(" ", "")
info = info.replace("\xa0", "")
director = re.findall(r'[导演:].+[主演:]', info)[0]
director = director[3:len(director) - 6]
movie['director'] = director
release_date = re.findall(r'[0-9]{4}', info)[0]
movie['release_date'] = release_date
plot = re.findall(r'[0-9]*[/].+[/].+', info)[0]
plot = plot[1:]
plot = plot[plot.index('/') + 1:]
plot = plot[plot.index('/') + 1:]
movie['plot'] = plot
star = each.find('div', class_='star')
star = star.find('span', class_='rating_num').text.strip()
movie['star'] = star
movies.append(movie)
print(movie,file=qy)#保存到文件中
con = pymysql.connect(host = 'localhost', user = 'root',password = '123456',database ='python',
charset = 'utf8',port = 3306)
print('连接成功->')
cursor = con.cursor()#创建一个游标
print('开始创建表->')
cursor.execute("""create table douban
( title char(40),
ranks char(40),
director char(40),
release_date char(40),
plot char(100),
star char(40))
""")
print('完成表的创建,开始插入数据->')#下面开始插入数据
for i in movies:
cursor.execute("insert into douban(title,ranks,director,release_date,plot,star) "
"values(%s,%s,%s,%s,%s,%s)",(i['title'],i['rank'],i['director'],
i['release_date'],i['plot'],i['star']))
print('插入数据完成')
cursor.close()
con.commit()
con.close()