有些小伙伴在练习爬虫的时候由于xpath定位不熟练,导致反复的进行请求,导致请求次数过多,接下来就很血腥!!!
为此绞尽脑汁做了一个方法:
1.对目标网站进行请求,
2.将请求到的html结构保存到本地文件
3.接下来读取本地文件,对照网页进行定位
import json
import requests
import lxml.etree
from pypinyin import pinyin,Style
from fake_useragent import UserAgent
print('此程序是按照页数进行下载量的下载每页40首音乐')
print('----------------------------------------')
print('操作1.输入需要下载的关键字。2.输入需要下载的数量')
print('--------------------------------------------')
print(' ')
muiss_name = input(f'请输入搜索的关键词:')
page_in = int(input(f'请输入需下载的数量:'))
def user_muiss_name(user_name,page):
muiss_pinyin = pinyin(muiss_name,style=Style.NORMAL)#返回的数据类型 [['ni'], ['hao']]
muiss_json = ''.join([''.join(p) for p in muiss_pinyin])
url_list = []
for i in range(1,page_in + 1):
url = f'你的网址'
url_list.append(url)
return url_list
def user_html_url(user_muiss):
for i in user_muiss:
ua = UserAgent()
headers = {'User-Agent':ua.random}
response = requests.get(i,headers=headers)
html_parser = lxml.etree.HTMLParser()
html = lxml.etree.fromstring(response.text,parser=html_parser)
with open('html.text', 'w', encoding='utf-8') as e:
tree = lxml.etree.ElementTree(html)
e.write(lxml.etree.tostring(tree, pretty_print=True, method="html", encoding="utf-8").decode())
user_muiss_url = user_muiss_name(user_name=muiss_name,page=page_in)
user_html = user_html_url(user_muiss=user_muiss_url)
*******************************************************************
with open('html.text','r') as e:
content = e.read()
html_parser = lxml.etree.HTMLParser()
html_tree = lxml.etree.fromstring(content,parser=html_parser)
**************************************************************
接下来就可以使用xpath进行定位提取数据了,因为数据是本地的,所以说就可以慢慢的搞了
由于是在本地文件进行树的提取,就可以多练习几遍,等找到了需要的数据后,在进行爬虫的实战定位就可以了