爬虫-协和数据

最新推荐文章于 2024-10-03 09:02:12 发布

侯代翔

最新推荐文章于 2024-10-03 09:02:12 发布

阅读量234

点赞数

分类专栏：爬虫文章标签： python html 爬虫

本文链接：https://blog.csdn.net/weixin_44564140/article/details/119769290

版权

爬虫专栏收录该内容

1 篇文章 0 订阅

订阅专栏

Day1-获取协和首页其中一个子项的所有a标签

介绍：
样本信息：协和
处理内容：按索引字符串，遍历当前页面及其相关页面，当页面中包含索引值时，返回相应的地址。用于对页面的数据检索。

part1 带请求头爬取数据

import requests
from bs4 import BeautifulSoup

header = {
    "user-agent":"你自己的信息"
}

res = requests.get('https://www.pumch.cn/index.html',header)
html = res.text

soup = BeautifulSoup(html,'html.parser')
items=soup.find('ul',class_='links').find_all('li')

part2 获取首页信息

#获取协和首页链接信息
def get_home():
    home = []
    for item in items:
        url='https://www.pumch.cn'
        name = item.find('a').get('href').strip()
        str_list = list(name)
        if(name.find(url)< 0):
            str_list.insert(0, url)
            str = ''.join(str_list)
            home.append(str)
        else:
            home.append(name)
    return home
#print(get_home())

输出数据：

['https://www.pumch.cn/patient.html', 'https://www.pumch.cn/learning.html', 'https://www.pumch.cn/centenary.html', 'https://www.pumch.cn/staff.html', 'https://www.pumch.cn/en.html']

part3 获取其中一页的所有a标签

但在这里获取出的a标签并不是完全可以的，还需要进行加工

res_learning = requests.get(get_home()[1],header)

html_learning = res_learning.text
soup1 = BeautifulSoup(html_learning)

tags=soup1.find_all('a')
#print(tags)
#for tag in tags:
#    print(tag.get('href'))

输出结果展示特征部分：

/html/index.html?scene_id=51117035
/register.html
http://telemedicine.pumch.cn
http://paper.pumch.cn/
http://mjpumch.cbpt.cnki.net/WKC3/WebPublication
javascript:void(0);
javascript:;
#
None
......

part4 数据加工

处理规则：
1.带javascript的数据、None的数据、#的不要，
2.以http:开头的保留
3.字符串中不包含https://www.pumch.cn的数据，且不满足以两个条件的在首位置插入https://www.pumch.cn

##根据关键字查找
def get_learn():
    home = []
    for tag in tags:
        url='https://www.pumch.cn'
        if(tag.get('href')!=None):
            learning_a = tag.get('href')
            str_list = list(learning_a)
        
            #包含javascript、#、None什么都不处理
            if((learning_a.find('javascript:;') >= 0) | (learning_a.find('javascript:void(0);') >= 0) | (learning_a.find('#') >= 0)):
                home = home
            #处理不包含url头的数据
            elif((learning_a.find(url)< 0)&(learning_a.find('http://') < 0)):
                str_list.insert(0, url)
                
                str = ''.join(str_list)
                print(str)
                home.append(str)
            else:
                home.append(learning_a)
    return home
    #print(tag.get('href'))
print(get_learn())

输出结果：

https://www.pumch.cn/register.html
https://www.pumch.cn/visitinfo.html
https://www.pumch.cn/reportquery.html
https://www.pumch.cn/centenary.html
https://www.pumch.cn/learning.html
https://www.pumch.cn/research/gudie.html
https://www.pumch.cn/research/gudie.html
https://www.pumch.cn/single/21556.html
https://www.pumch.cn/trend.html
https://www.pumch.cn/notice.html
......