Day1-获取协和首页其中一个子项的所有a标签
介绍:
样本信息:协和
处理内容:按索引字符串,遍历当前页面及其相关页面,当页面中包含索引值时,返回相应的地址。用于对页面的数据检索。
part1 带请求头爬取数据
import requests
from bs4 import BeautifulSoup
header = {
"user-agent":"你自己的信息"
}
res = requests.get('https://www.pumch.cn/index.html',header)
html = res.text
soup = BeautifulSoup(html,'html.parser')
items=soup.find('ul',class_='links').find_all('li')
part2 获取首页信息
#获取协和首页链接信息
def get_home():
home = []
for item in items:
url='https://www.pumch.cn'
name = item.find('a').get('href').strip()
str_list = list(name)
if(name.find(url)< 0):
str_list.insert(0, url)
str = ''.join(str_list)
home.append(str)
else:
home.append(name)
return home
#print(get_home())
输出数据:
['https://www.pumch.cn/patient.html', 'https://www.pumch.cn/learning.html', 'https://www.pumch.cn/centenary.html', 'https://www.pumch.cn/staff.html', 'https://www.pumch.cn/en.html']
part3 获取其中一页的所有a标签
但在这里获取出的a标签并不是完全可以的,还需要进行加工
res_learning = requests.get(get_home()[1],header)
html_learning = res_learning.text
soup1 = BeautifulSoup(html_learning)
tags=soup1.find_all('a')
#print(tags)
#for tag in tags:
# print(tag.get('href'))
输出结果展示特征部分:
/html/index.html?scene_id=51117035
/register.html
http://telemedicine.pumch.cn
http://paper.pumch.cn/
http://mjpumch.cbpt.cnki.net/WKC3/WebPublication
javascript:void(0);
javascript:;
#
None
......
part4 数据加工
处理规则:
1.带javascript的数据、None的数据、#的不要,
2.以http:开头的保留
3.字符串中不包含https://www.pumch.cn的数据,且不满足以两个条件的在首位置插入https://www.pumch.cn
##根据关键字查找
def get_learn():
home = []
for tag in tags:
url='https://www.pumch.cn'
if(tag.get('href')!=None):
learning_a = tag.get('href')
str_list = list(learning_a)
#包含javascript、#、None什么都不处理
if((learning_a.find('javascript:;') >= 0) | (learning_a.find('javascript:void(0);') >= 0) | (learning_a.find('#') >= 0)):
home = home
#处理不包含url头的数据
elif((learning_a.find(url)< 0)&(learning_a.find('http://') < 0)):
str_list.insert(0, url)
str = ''.join(str_list)
print(str)
home.append(str)
else:
home.append(learning_a)
return home
#print(tag.get('href'))
print(get_learn())
输出结果:
https://www.pumch.cn/register.html
https://www.pumch.cn/visitinfo.html
https://www.pumch.cn/reportquery.html
https://www.pumch.cn/centenary.html
https://www.pumch.cn/learning.html
https://www.pumch.cn/research/gudie.html
https://www.pumch.cn/research/gudie.html
https://www.pumch.cn/single/21556.html
https://www.pumch.cn/trend.html
https://www.pumch.cn/notice.html
......