目的:爬取最新的美剧里面的基本信息
总共5条信息 url:https://www.meijutt.com/new100.html
环境:win10
废话不说,开始~
1.创建工程
scrapy startproject meiju100
scrapy gensipider meiju(爬虫名称) url(上面给的url)
2.填写item.py 定义你需要爬取的信息
名字可随意填写
3.定义 爬虫.py 这里就是要给出解析
import scrapy
from meiju100.items import Meiju100Item
class MeijuSpider(scrapy.Spider):
name = 'meiju'
allowed_domains = ['www.meijutt.com/new100.html']
start_urls = ['http://www.meijutt.com/new100.html']
def parse(self, response):
items = []
selector = response.xpath('//ul[@class="top-list fn-clear"]/li')
for div in selector:
item = Meiju100Item()
item['storyName'] = div.xpath('./h5/a/text()').extract()
item['storyState'] = div.xpath('./span[1]/font/text()').extract()
item['category'] = div.xpath('./span[2]/text()').extract()
item['tvStation'] = div.xpath('./span[3]/text()').extract()
item['updateTime'] = div.xpath('./div[2]/font/text()').extract()
if item['updateTime']:
pass
else:
item['updateTime'] = div.xpath('./div[2]/text()').extract()
items.append(item)
return items
加上if 是因为有些信息可能不能完全匹配
4.定义下载器中间件,这里写了随机使用UA,防止被封
from meiju100.UA import ua_list
import random
class UserAgentmiddleware():
def process_request(self, request, spider):
agent = random.choice(ua_list)
request.headers['User-Agent'] = agent
UA.py 就定义了一个list
5.定义数据管道(存入mongodb数据库)
具体参考官方API 这里修改了一下
from scrapy.conf import settings
import pymongo
class Meiju100Pipeline(object):
def __init__(self):
host = settings['MONGODB_HOST']
port = settings['MONGODB_PORT'] # 端口
db_name = settings['MONGODB_DBNAME']
client = pymongo.MongoClient(host=host, port=port)
db = client[db_name]
self.post = db[settings['MONGODB_DOCNAME']] # collections的名字
def process_item(self, item, spider):
movie = dict(item)
self.post.insert(movie)
return item
6.配置好setting文件
DOWNLOADER_MIDDLEWARES = { 'meiju100.middlewares.UserAgentmiddleware': 543, }
ITEM_PIPELINES = {
'meiju100.pipelines.Meiju100Pipeline': 301,
}
MONGODB_HOST = '127.0.0.1'
MONGODB_PORT = 27017
MONGODB_DBNAME = 'text2'
MONGODB_DOCNAME = 'meiju'
根据需要的情况自己定义好
7.运行
scrpay crawl meiju
最后通过命令行或者可视化工具RoBo 3T
c成功!!!