关于爬虫初期学习爬取小说的问题

最新推荐文章于 2024-04-28 19:45:27 发布

VIP文章 linushio

最新推荐文章于 2024-04-28 19:45:27 发布

阅读量466

点赞数 1

文章标签：爬虫 python

本文链接：https://blog.csdn.net/qq_42941698/article/details/84294040

版权

刚学完爬虫基础,由于这是没有学习框架时候的代码,可能会显得有些啰嗦,不过里边有很多自己的想法,可以参考并提出意见.

由于当时写的比较匆忙,没有用到面向对象的思想去编写代码,所以这里只是用到了函数

import json
import os
import re
import urllib.request
import time
from pprint import pprint
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import urllib.parse

#这里由于爬取的小说网站内容是js加载的,所以我采用selenium模拟浏览器的方法
def gethtml(url):
chrome_options = Options()
chrome_options.add_argument(’–headless’)
chrome_options.add_argument(’–disable-gpu’)

path = r'E:\pycharm\课件\chromedriver_win32\chromedriver.exe'

driver = webdriver.Chrome(executable_path=path,chrome_options=chrome_options)

url = url
driver.get(url)
time.sleep(7)

# pprint(driver.page_source)

return driver.page_source

def set_request(url):
headers = {
‘User-Agent’: ‘Mozilla/5.0 (Windows

最低0.47元/天解锁文章

linushio

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
1
评论
关于爬虫初期学习爬取小说的问题

刚学完爬虫基础,由于这是没有学习框架时候的代码,可能会显得有些啰嗦,不过里边有很多自己的想法,可以参考并提出意见.由于当时写的比较匆忙,没有用到面向对象的思想去编写代码,所以这里只是用到了函数import jsonimport osimport reimport urllib.requestimport timefrom pprint import pprintfrom bs4 i...
复制链接

扫一扫