网络爬虫篇(一)
现在的网络爬虫使用的越来越多,在数据挖掘领域也是很重要的一部分内容,Python作为一个脚本语言,在爬虫领域发挥了重要的作用。网上关于网络爬虫的例子很多,大多基于urllib库或者requests库和bs4库,因为想要做一些关于爬虫的项目,于是找了一些资料来学习,以下是一篇简单的关于网络爬虫的例子。(抓取豆瓣电影的基本信息)以此来说明这几个库的基本的函数和实现方式。
基本实现代码
写这部分代码时间是:2017年3月23日23:06:32 因为各大网站都会不定期的进行改版,因此这里不保证以下代码在后期还能正常生效,这里主要提供一些基本的思路和自己的一些学习技巧。
import requests
from bs4 import BeautifulSoup as bs
from urllib import parse
import time
URL_GET = 'https://movie.douban.com/subject_search'
def url_api():
"""
Build the url for requests, we can change the rang then get more page.
:return: a generator of url.
"""
for number in range(0, 50):
page = number * 15
param = {'start': page, 'search_text': '科幻'}
url = '?'.join([URL_GET, '%s']) % parse.urlencode(param)
yield url
def get_response(url):
res = []
response = requests.get(url)
if response.status_code == 200:
time.sleep(1)
res.append(response)
return res
else:
raise Exception('RequestError')
def local_data(response):
"""
Return name actors score and people number of the movie. This method will ignore the
none score.
:param response: class 'requests'
:return: a list contain name actors score and people number.
"""
amovie = []
bsobj = bs(response.text, "lxml")
movies = bsobj.find_all('div', {'class': 'pl2'})
for movie in movies:
name = movie.find('a').get_text().strip('\n').replace(' ', '')
actors = movie.find('p').get_text().strip('\n').replace(' ', '')
try:
comment = movie.find('div').find_all('span')
score = comment[1].get_text()
comm = comment[2].get_text()[1:-4]
amovie.append((name, actors, score, comm))
except Exception as e:
pass
return amovie
def local_movie_page(response):
"""
Return the name and index page of the movie.
:param response: class 'requests'
:return: a dict contains name and url
"""
movie_page = {}
bsobj = bs(response.text, "lxml")
mov_url = bsobj.find_all('table', {'class', ""})
for url in mov_url:
ac = url.find('a')
name = ac.find('img')
movie_page[name['alt']] = ac['href']
return movie_page
def write_file(filename, diction):
"""
Write the movie name and the index url to a file.
:param filename: file of save the movies
:param diction: a dict contains movies and index url
:return:
"""
with open(filename, 'a+', encoding='utf-8') as f:
for movie in diction:
f.write(movie)
if __name__ == '__main__':
# 定位电影的主页
for url in url_api():
res = get_response(url)
for i in res:
abc = local_movie_page(i)
for key, value in abc.items():
write_file('b.txt', '{0}:{1}'.format(key, value)+'\n')
报错信息说明
TypeError: expected string or buffer
说明:类型错误,传入的参数出问题,应该检查函数需要的参数类型和自己传入参数类型是否一致。