今天来爬取豆瓣电影250排行榜并存入mysql中
目标地址:豆瓣电影250
打开f12,可以看到我们所需要的信息全部都在
- 标签下,先爬取一页的内容
一开始还是先导入我们所需要的库,这3个库都可以直接使用pip下载,多的不再赘述: -
import requests from bs4 import BeautifulSoup import pymysql
写入url和headers
url = 'https://movie.douban.com/top250' headers = { "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36" }
调用requests库的get方法模拟浏览器向豆瓣发出访问请求,并获取内容
req = requests.get(url = url,headers = headers) html_data = req.text
创建bs对象
bs = BeautifulSoup(html_data)
开始解析我们所需要的内容:
list = bs.find(class_='grid_view').find_all('li') for item in list: item_name = item.find(class_='title').string item_img = item.find('a').find('img').get('src') item_index = item.find(class_='').string item_score = item.find(class_='rating_num').string item_author = item.find('p').text item_intr = item.find(class_='inq').string
现在我们已经把一页的内容爬取出来了,接下来吧我们所爬取的内容写入数据库,首先先连接到数据库:
connect = pymysql.Connect( host='localhost', port=3306, user='root', password='123456', db='douban', charset='utf8' )
我使用的是Navicat,host为localhost,端口号为3306,创建了一个名为douban的数据库,并创建了一张名为movie_imfromation的表来存放我们所爬取到的信息**(这里当时打快了,信息单词应为information)**
获取游标:cursor = connect.cursor()
用sql语句把获取到的内容写入数据库:
sql = "INSERT INTO movie_imfromation(name,img,score,author,intr) VALUES ('%s','%s','%s','%s','%s');" % (item_name,item_img,item_score,item_author,item_intr) cursor.execute(sql) cursor.close() connect.commit() connect.close()
完整代码:
import requests from bs4 import BeautifulSoup import pymysql connect = pymysql.Connect( host='localhost', port=3306, user='root', password='123456', db='douban', charset='utf8' ) url = 'https://movie.douban.com/top250' headers = { "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36" } req = requests.get(url = url,headers = headers) html_data = req.text bs = BeautifulSoup(html_data) list = bs.find(class_='grid_view').find_all('li') for item in list: item_name = item.find(class_='title').string item_img = item.find('a').find('img').get('src') item_index = item.find(class_='').string item_score = item.find(class_='rating_num').string item_author = item.find('p').text item_intr = item.find(class_='inq').string print('爬取电影:' + item_index + ' | ' + item_name + ' | ' + item_img + ' | ' + item_score + ' | ' + item_author+' | ' + item_intr) cursor = connect.cursor() sql = "INSERT INTO movie_imfromation(name,img,score,author,intr) VALUES ('%s','%s','%s','%s','%s');" % (item_name,item_img,item_score,item_author,item_intr) cursor.execute(sql) cursor.close() connect.commit() connect.close()
结果:
接下来写函数版:import requests from bs4 import BeautifulSoup import pymysql connect = pymysql.Connect( host='localhost', port=3306, user='root', password='123456', db='douban', charset='utf8' ) def requests1(url,headers): req = requests.get(url = url,headers = headers) return req.text # 解析内容 def bs_douban(soup): list = soup.find(class_='grid_view').find_all('li') for item in list: item_name = item.find(class_='title').string item_img = item.find('a').find('img').get('src') item_index = item.find(class_='').string item_score = item.find(class_='rating_num').string item_author = item.find('p').text item_author = str(item_author).replace("'","''") if (item.find(class_='inq') != None): item_intr = item.find(class_='inq').string item_intr = str(item_intr).replace("'", "''") else: item_intr = None cursor = connect.cursor() sql = "INSERT INTO movie_imfromation(name,img,score,author,intr) VALUES ('%s','%s','%s','%s','%s');" % ( item_name, item_img, item_score, item_author, item_intr) print(sql) cursor.execute(sql) cursor.close() connect.commit() def main(page): url = 'https://movie.douban.com/top250?start=' + str(page * 25) + '&filter=' headers = { "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36" } html_data = requests1(url,headers) soup = BeautifulSoup(html_data, 'lxml') bs_douban(soup) if __name__ == '__main__': for i in range(0 , 10): main(i)
在写函数的时候,还是遇到了很多问题,比如有的电影的短评中会有单引号,这时就会导致写入的sql语句出现报错,解决的办法就是用replace方法把单引号替换成为双引号就可以继续执行
item_intr = str(item_intr).replace("'", "''")
还有就是在爬取的过程中,在后面的有些电影没有短评,这时就会报错,无法写入到数据库中,解决办法:
写一个判断,如果为空,给它赋值为None,sql语句就能正常执行。if (item.find(class_='inq') != None): item_intr = item.find(class_='inq').string item_intr = str(item_intr).replace("'", "''") else: item_intr = None
本人为爬虫小白,如果本文中有哪些内容有错,还请各位大佬指正。