很久没有写爬虫了,先写一个入门爬虫找找手感
时光网排行
- 时光网的top排行反爬机制相对来说比较 简单(不是),所以作为练习实例实现一下。
一,利用的是selenium库并使用无头模式提升爬取速率
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('disable-infobars')
options.add_argument('-headless') # 无头模式
# 手动找到webdriver的Chome的启动器
browser = webdriver.Chrome(executable_path
=r'C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe', options=options)
二,观察时光网页面url并表示出来
num = ['', '-2', '-3', '-4', '-5', '-6', '-7', '-8', '-9', '-10']
Name = []
Director = []
Actor = []
Introduction = []
for j in num:
browser.get('http://www.mtime.com/top/tv/top100/index' + j +'.html')
- num表示top100中不同页面的url,并把所想爬取的信息分别创建一个列表
三,爬取页面并将数据保存在csv文件里
import pandas as pd
for i in range(1, 11):
Name.append(browser.find_element_by_xpath('//*[@id="asyncRatingRegion"]/li[' + str(i) + ']/div[3]/h2/a').text)
try:
Director.append(browser.find_element_by_xpath(
'//*[@id="asyncRatingRegion"]/li[' + str(i) + ']/div[3]/p[1]/a').text)
except:
Director.append('该剧无导演~')
try:
Actor.append(browser.find_element_by_xpath(
'//*[@id="asyncRatingRegion"]/li[' + str(i) + ']/div[3]/p[2]/a').text)
except:
Actor.append('该剧无主演~')
try:
Introduction.append(browser.find_element_by_xpath(
'//*[@id="asyncRatingRegion"]/li[' + str(i) + ']/div[3]/p[3]').text)
except:
Introduction.append('该剧集无简介~~')
movies = pd.DataFrame.from_dict(
{'影名': Name, '导演': Director, '主演': Actor, '简介': Introduction}, orient='index')
movies.to_csv(r'D:\ time100.csv', encoding='utf-8')
-
用的是xpath解析来获取网页中所需信息,谷歌浏览器能自动复制为xpath格式还是挺方便的,根据规律自身带入参数进行批量爬取信息。
-
尝试几次爬取后发现榜单里有些电影的部分信息缺失,导致了报错或者保存的文件错位,而且拉闸的WiFi网络让我直接跪了orz…,于是加了try和except来保证程序遇到各个异常情况也能正常的运行下去,利用pandas库的DataFrame来做文件的格式并保存
四,完整代码
from selenium import webdriver
import pandas as pd
options = webdriver.ChromeOptions()
options.add_argument('disable-infobars')
options.add_argument('-headless') # 无头模式
browser = webdriver.Chrome(executable_path
=r'C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe', options=options)
num = ['', '-2', '-3', '-4', '-5', '-6', '-7', '-8', '-9', '-10']
Name = []
Director = []
Actor = []
Introduction = []
for j in num:
browser.get('http://www.mtime.com/top/tv/top100/index' + j + '.html')
for i in range(1, 11):
Name.append(browser.find_element_by_xpath('//*[@id="asyncRatingRegion"]/li[' + str(i) + ']/div[3]/h2/a').text)
try:
Director.append(browser.find_element_by_xpath(
'//*[@id="asyncRatingRegion"]/li[' + str(i) + ']/div[3]/p[1]/a').text)
except:
Director.append('该剧无导演~')
try:
Actor.append(browser.find_element_by_xpath(
'//*[@id="asyncRatingRegion"]/li[' + str(i) + ']/div[3]/p[2]/a').text)
except:
Actor.append('该剧无主演~')
try:
Introduction.append(browser.find_element_by_xpath(
'//*[@id="asyncRatingRegion"]/li[' + str(i) + ']/div[3]/p[3]').text)
except:
Introduction.append('该剧集无简介~~')
movies = pd.DataFrame.from_dict(
{'影名': Name, '导演': Director, '主演': Actor, '简介': Introduction}, orient='index')
movies.to_csv(r'D:\ time100.csv', encoding='utf-8')
print(movies)
browser.close()