python如何爬取网站所有目录_python 爬取网站的博客目录

最新推荐文章于 2024-07-21 22:35:51 发布

weixin_39699070

最新推荐文章于 2024-07-21 22:35:51 发布

阅读量470

点赞数

文章标签： python如何爬取网站所有目录

本文档介绍了如何使用requests-html库配合youtube-dl进行信息提取。首先升级pip和urllib3，然后安装requests-html。接着，创建一个继承自InfoExtractor的类XxxIE，定义匹配URL的正则表达式，并实现提取方法。在该方法中，通过requests-html下载网页，解析JSON数据并提取关键信息。最后，展示了下载和解析多个页面的逻辑。

摘要由CSDN通过智能技术生成

Apple iPhone 11 (A2223) 128GB 黑色移动联通电信4G手机双卡双待

4999元包邮

去购买 >

第一步，安装 requests-html升级 pippip install --upgrade pip升级 urllib3sudo python3 -m pip install urllib3 --upgrade安装 requests-htmlsudo python3 -m pip install requests-html

第 1.1 步，给项目，安装 requests-html修改 setup.py 文件,

添加install_requires=[

'requests-html',

],修改 launch.json

添加"pythonPath": "/usr/bin/python3"命令行，安装sudo python3 -m setup installpython 文件中，使用from requests_html import HTMLSession

第 2 步，继续使用 youtube - dl新建一个信息提取类class XxxIE(InfoExtractor):建立匹配正则_VALID_URL = r'https?://(?:www\.|m\.)?xxx\.com.+posts?.+'

对应源代码

启动后，先走 YoutubeDL.py 文件的def extract_info(self, url, download=True, ie_key=None, extra_info={},

process=True, force_generic_extractor=False):

# ...

for ie in ies:

if not ie.suitable(url):

continue

# ...再走 extractor 文件夹下 common.py 文件的@classmethod

def suitable(cls, url):

if '_VALID_URL_RE' not in cls.__dict__:

cls._VALID_URL_RE = re.compile(cls._VALID_URL)

# ...

2.1 剩下的交给class XxxIE(InfoExtractor):先在 extractor 文件夹下的extractors.py

中引用一下XxxIE 中下载爬取，即可from requests_html import HTML

class XxxIE(InfoExtractor):

_GEO_COUNTRIES = ['CN']

IE_NAME = 'xxx: blog'

IE_DESC = 'wo qu'

_VALID_URL = r'https?://(?:www\.|m\.)?xxx\.com.+posts?.+'

_TEMPLATE_URL = '%s://www.xxx.com/%s/posts/%s/'

_LIST_VIDEO_RE = r']+?href="(?P/%s/sound/(?P\d+)/?)"[^>]+?title="(?P[^>]+)">'

def _real_extract(self, url):

scheme = 'https' if url.startswith('https') else 'http'

print("start ya yay ya")

print("\n\n\n")

self.downloadX(url, 1)

small = list(range(2, 20))

for index in small:

# ?page=2

src = url + "?page=" + str(index)

self.downloadX(src, index)

print("\n\n\n")

return {}

def downloadX(self, src, index):

audio_id = 123456

webpage = self._download_webpage(src, audio_id,

note='Download sound page for %s' % audio_id,

errnote='Unable to get sound page')

html = HTML(html=webpage)

# print(webpage)

jsonElement = html.find('#js-initialData')

jsonInfo = jsonElement[0].text

jsonX = json.loads(jsonInfo)

dic = jsonX['initialState']['entities']['articles']

print("page: " + str(index) + " : ")

for k, v in dic.items():

# pprint(v)

t = v.get('title')

print(t)

print("\n")

代码链接

weixin_39699070

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。