（21）爬虫beautiful soup库实战

最新推荐文章于 2020-12-20 12:48:48 发布

小蜗笔记

最新推荐文章于 2020-12-20 12:48:48 发布

阅读量119

点赞数

分类专栏：爬虫实战模块

本文链接：https://blog.csdn.net/qq_42830971/article/details/107442922

版权

爬虫实战模块专栏收录该内容

50 篇文章 11 订阅

订阅专栏

import requests
from fake_useragent import UserAgent
from time import sleep
from random import randint
from bs4 import BeautifulSoup

def get_html(url):
    headers={
        'User-Agent':UserAgent().firefox
    }
    proxies = {
        "http": "http://35.236.158.232:8080"
    }
    sleep(randint(3,9))
    html_response=requests.get(url,headers=headers,proxies=proxies)
    html_response.encoding = 'utf-8'

    if html_response.status_code == 200:
        return html_response.text
    else:
        print(html_response.status_code)

def parse_html(html):
    soup = BeautifulSoup(html,'lxml')
    name = soup.select('div.book-info > h1 > em')[0].text
    all_info={
        'names': name
    }
    return all_info

def parse_get_url(index_url):
    html_r = get_html(url = index_url)
    soup = BeautifulSoup(html_r,'lxml')
    all_a = soup.select('div.name-box > a.name')
    all_url=[]
    for a in all_a:
        all_url.append(a.attrs['href'])
    return ['https:{}'.format(movie_url) for movie_url in all_url]

def main():
    index_url = 'https://www.qidian.com/rank'
    print(index_url)
    movie_url_list = parse_get_url(index_url=index_url)
    print(movie_url_list)
    for movie_url in movie_url_list:
        response = get_html(movie_url)
        outcome = parse_html(response)
        print(outcome)

if __name__ == '__main__':
    main()
    print('爬虫程序结束')

小蜗笔记

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
0
评论
（21）爬虫beautiful soup库实战

import requestsfrom fake_useragent import UserAgentfrom time import sleepfrom random import randintfrom bs4 import BeautifulSoupdef get_html(url): headers={ 'User-Agent':UserAgent().firefox } proxies = { "http": "http://35
复制链接

扫一扫