适用:网站翻页url地址不变
F12打开网路XHR发现是get请求:
复制Request URL得到如下json数据
所以可以编写如下代码:
import requests
import json
ls = []
url = 'http://www.cistc.gov.cn/handlers/cistcMenuInfoList.ashx?columnid=221&isall=1&keyword=&year=&pagenum=1'
text = requests.get(url).text
js = json.loads(text)
for i in js['infolist']:
wanzhengURL = 'http://www.cistc.gov.cn/' + i['InfoUrl']
ls.append(wanzhengURL)
print(ls)
运行代码得到每页内每个资讯的完整URL链接的列表:
将代码优化为多页内循环的脚本:
import requests
import json
ls = []
for page in range(1, 339):
url = 'http://www.cistc.gov.cn/handlers/cistcMenuInfoList.ashx?columnid=221&isall=1&keyword=&year=&pagenum=' + str(page)
r = requests.get(url)
js = json.loads(r.text)
# print(js)
for i in js['infolist']:
wanzhengURL = 'http://www.cistc.gov.cn/' + i['InfoUrl']
ls.append(wanzhengURL)
print(ls)
运行代码得到全部页内每个资讯的完整URL链接的列表:
新建一个 .py 文件,编写如下代码:
import time
import requests
import re
from bs4 import BeautifulSoup
import json
def get某一页面的HTML源码(url):
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:73.0) Gecko/20100101 Firefox/73.0'}
try:
r = requests.get(url, headers=headers, timeout=8)
r.raise_for_status()
r.encoding = r.apparent_encoding
# time.sleep(0.3)
return r.text
except:
print('{}爬取出错'.format(url)) #获得url 的 HTML源码 # 获得HTML源码
def get对应页面内的文本并写入本地(i):
r = get某一页面的HTML源码(i)
js = json.loads(r)
bt = js['InfoTitle']
soup = BeautifulSoup(js['InfoContent'], 'html.parser')
nr = soup.text
try:
with open(r'D:\Tech-texts\techtest20-8-06\中国国际科技合作网-信息.txt', 'a+', encoding='utf-8') as f:
# f.write('\n\n\n\n-------------------------------------------------\n\n\n\n')
f.write('\n')
# f.write(bt)
# f.write('\n')
f.write(nr)
f.write('\n')
except:
print('解析出错')
ls = ['http://www.cistc.gov.cn/handlers/cistcInfo.ashx?infoid=100649&contentLenth=&column=221', 'http://www.cistc.gov.cn/handlers/cistcInfo.ashx?infoid=100639&contentLenth=&column=221']
# 由于链接很多 只截取部分
num = 1
for i in ls:
try:
print('第{}个视频\n共有{}个'.format(num, len(ls)))
get对应页面内的文本并写入本地(i)
except:
print('瞅你写的这破代码又报错')
num += 1
运行结果如下: