思路
1、运用requests模块获取整个页面代码
2、xpath解析数据,找到主页面跳转到子页面的链接,
3、获取子页面的代码后用xpath功能解析,获取所需要的下载路径和名称
第一步:导包
import requests from lxml import etree
第二步:指定url
url = "https://sc.chinaz.com/ppt/"
第三步:进行UA伪装(此处代码没打全,要全部复制的)
headers = { 'User-Agent': 'Mozilla/5.0' }
第四步:请求发送
page_text = requests.get(url=url, headers=header) #手动设定响应数据的编码格式 page_text.encoding = 'UTF-8' response = page_text.text
第五步:主页面数据解析,获得跳转到子页面的链接
tree = etree.HTML(response) div_list = tree.xpath('//div[@class="bot-div"]') for li in div_list: a_list = "https://sc.chinaz.com" + li.xpath('./a/@href')[0] a_list_name = li.xpath('./a/text()')[0]
第六步:子页面请求
child_response = requests.get(url=a_list,headers=header) child_response.encoding = 'UTF-8' child_re = child_response.text
第七步:子页面定位,获取下载链接和名字
tree = etree.HTML(child_re) child_div = tree.xpath('//div[@class="download-url"]') for li in child_div: download = li.xpath('./a/@href')[0] download_path = li.xpath('./a/text()')[0] print(a_list_name,download_path,download)
进行不同页面数据获取
规律:
第1页的链接是:“https://sc.chinaz.com/ppt/”
第2页的链接是:“https://sc.chinaz.com/ppt/index_2.html”
第3页的链接是:“https://sc.chinaz.com/ppt/index_3.html”
故将url加一个判断:
base_url = "https://sc.chinaz.com/ppt/" page = input('输入要抓取的页码:') if page == 1: url = base_url if page != 1: url =(f"https://sc.chinaz.com/ppt/index_{page}.html")
如果是第一页,则url使用原来的,不是第一页url使用重新赋值的链接。if语句也可以简写成:
url = base_url if page == 1 else f"https://sc.chinaz.com/ppt/index_{page}.html"
在 if page == 1: 之前,可以再添加判断条件,如果输入的不是正整数,则放回重新输入,否则将输入转换为整数
if not page.isdigit() or int(page) < 1: print("输入的页码不是一个有效的正整数,请重新输入。") else: page = int(page) # 将输入转换为整数
再嵌套在while 循环里面,就可以不断获取到所想要页面的数据
完整代码
import requests
from lxml import etree
if __name__ == '__main__':
while True:
base_url = "https://sc.chinaz.com/ppt/"
page = input('输入要抓取的页码:')
if not page.isdigit() or int(page) < 1:
print("输入的页码不是一个有效的正整数,请重新输入。")
else:
page = int(page) # 将输入转换为整数
if page == 1:
url = base_url
if page != 1:
url =(f"https://sc.chinaz.com/ppt/index_{page}.html")
# url = base_url if page == 1 else f"https://sc.chinaz.com/ppt/index_{page}.html"
header = {
'User-Agent': 'Mozilla/5.0 '#没写完
}
page_text = requests.get(url=url, headers=header)
page_text.encoding = 'UTF-8'
response = page_text.text
tree = etree.HTML(response)
#主页面
div_list = tree.xpath('//div[@class="bot-div"]')
for li in div_list:
a_list = "https://sc.chinaz.com" + li.xpath('./a/@href')[0]
a_list_name = li.xpath('./a/text()')[0]
# print(a_list_name,a_list)
#子页面获取
child_response = requests.get(url=a_list,headers=header)
child_response.encoding = 'UTF-8'
child_re = child_response.text
tree = etree.HTML(child_re)
child_div = tree.xpath('//div[@class="download-url"]')
for li in child_div:
download = li.xpath('./a/@href')[0]
download_path = li.xpath('./a/text()')[0]
print(a_list_name,download_path,download)