python爬虫(一)——爬取咚漫漫画信息（非下载漫画）

最新推荐文章于 2024-08-11 22:06:01 发布

留小星

最新推荐文章于 2024-08-11 22:06:01 发布

阅读量1.6k

点赞数

分类专栏：爬虫文章标签：爬虫 python xpath redis

本文链接：https://blog.csdn.net/jerry_liufeng/article/details/109155935

版权

爬虫专栏收录该内容

5 篇文章

订阅专栏

python爬虫——咚漫漫画信息（非下载漫画）

目的：回顾爬虫，联系redis存储
使用python库：requests、bs4、redis、lxml
redis存储结构：使用hash格式存储，其name使用漫画的名称，里面存储于漫画相关的键值对。
使用的编辑器为jupyter notebook，操作系统linux，当然对于python代码而言，操作系统影响不大，移植没难度。

一、导入函数

import os,sys
import requests
import bs4
import redis
from lxml import etree
# import urllib.request # 如果不下载图片，不用导入

二、连接redis

# 连接redis
pool = redis.ConnectionPool(host='localhost',port=6379,decode_responses=True)
r = redis.Redis(connection_pool=pool)
# 验证连接
print(r.ping())

三、爬虫设计

def deal_url(url):
    '''解析网页'''
    response = requests.get(url)
    text = response.text
    return text

def parse_index_page(text):
    '''从首页获取全部漫画地址'''
    html = etree.HTML(text)
    anime_urls = html.xpath("//ul[@class='daily_card daily_limit_img_container']//a/@href")
    for i,url in enumerate(anime_urls):
        url = 'http://'+url.strip('//')
        print(url)
        anime_urls[i] = url
    return anime_urls

def parse_detail_page(text):
    '''
    获取每个漫画的详细信息
	anime_name——漫画名称
	anime_style——漫画类型
	anime_score——漫画评分
	anime_author——漫画作者
	'''
    html = etree.HTML(text)
    anime_name = ''.join(html.xpath("//div[@class='info']/h1/text()")).strip()
    anime_style = ''.join(html.xpath("//div[@class='info']/h2/text()")).strip()
    anime_score = ''.join(html.xpath("//em[@class='cnt']/text()")).strip()
    anime_summary = ''.join(html.xpath("//p[@class='summary']/text()")).strip()
    anime_author = html.xpath("//div[@class='info']//span[@class='author']/text()")
    if len(anime_author)==0:
        anime_author='未知作家'
    else:
        anime_author = '/'.join(anime_author).strip()
           
    print(anime_name)
    print(anime_author)
    print('*'*50)
    r.hset(anime_name,'anime_name',anime_name)
    r.hset(anime_name,'anime_style',anime_style)
    r.hset(anime_name,'anime_score',anime_score)
    r.hset(anime_name,'anime_author',anime_author)
    

def spider():
    url = 'https://www.dongmanmanhua.cn/dailySchedule?weekday=MONDAY'
    text = deal_url(url)
    anime_urls = parse_index_page(text)
    for i,anime_url in enumerate(anime_urls):
        anime_text=deal_url(anime_url)
        parse_detail_page(anime_text)
        

if __name__ == '__main__':
    spider()