Python爬虫:获取CCF推荐列表的363个会议信息【内附详细教学 】

1 篇文章 0 订阅
1 篇文章 0 订阅

文章目录


一、效果展示

CCF推荐会议共有363个,本文通过爬虫把信息全部收集到一个xlsx表格中。图中我们按投稿DDL排序,可以轻松找到最近可投的会议。也可按专业类型、ccf等级排序,各取所需。

二、代码区

运行代码会生成一个表格文件all_ccf_conference_info.xlsx,用以存放所有会议信息。

import requests
from lxml import etree
import pandas as pd

def get_info(url,header):
    response_c = requests.get(url=url,headers=header).text
    tree_c=etree.HTML(response_c)
    leaf_type=tree_c.xpath('/html/body/div[2]/div/div/h5')
    type=(leaf_type[0].text).replace("\t","").replace("\r","").replace("\n","")
    leaf_short_name=tree_c.xpath('/html/body/div[2]/div/div/h3')
    short_name=leaf_short_name[0].text
    leaf_full_name=tree_c.xpath('/html/body/div[2]/div/div/h4')
    full_name=leaf_full_name[0].text
    leaf_text_ddl=tree_c.xpath('/html/body/div[2]/div/div/div[2]/h5[2]')
    text_ddl=leaf_text_ddl[0].text
    leaf_conference_time=tree_c.xpath('/html/body/div[2]/div/div/div[2]/h5[3]')
    conference_time=leaf_conference_time[0].text
    leaf_ccf_level=tree_c.xpath('/html/body/div[2]/div/div/div[2]/h5[5]')
    ccf_level=leaf_ccf_level[0].text
    leaf_place=tree_c.xpath('/html/body/div[2]/div/div/div[2]/h5[6]')
    place=leaf_place[0].text
    leaf_guanwang=tree_c.xpath('/html/body/div[2]/div/div/div[5]/input')
    guanwang=leaf_guanwang[0].attrib['onclick']
    return type,short_name, full_name, text_ddl[-10:], ccf_level[-2:], conference_time[-10:], place[6:], guanwang[9:-2]

def paccf(url,info_dic):
    header={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:123.0) Gecko/20100101 Firefox/123.0'}
    response = requests.get(url=url,headers=header).text
    tree=etree.HTML(response)
    for table in range(3):
        father_div=f'/html/body/div[2]/div[2]/div[{table+1}]'
        count_of_child_divs = tree.xpath('count(' + father_div + '/div)')
        number_of_child_divs = int(count_of_child_divs)
        for j in range(1,number_of_child_divs):
            leaf=tree.xpath(f'/html/body/div[2]/div[2]/div[{table+1}]/div[{j+1}]/div[2]/h4/a')
            #leaf=tree.xpath('/html/body/div[2]/div[2]/div[1]/div[3]/div[2]/h4/a')
            print(leaf[0].attrib['href'][-10:])
            C_num=leaf[0].attrib['href'][-10:]
            c = 'http://123.57.137.208/detail.jsp?url='+C_num
            type, short_name, full_name, text_ddl, ccf_level, conference_time, place, guanwang=get_info(c,header)
            info_dic['Abbreviation'].append(short_name)
            info_dic['Full_name'].append(full_name)
            info_dic['DDL'].append(text_ddl)
            info_dic['ccf_level'].append(ccf_level)
            info_dic['Conference_time'].append(conference_time)
            info_dic['Type'].append(type)
            info_dic['Location'].append(place)
            info_dic['Official website'].append(guanwang)
            print(short_name, full_name, text_ddl, ccf_level, conference_time,type, place, guanwang)
            print("")
    return info_dic

info_dic={}
info_dic['Abbreviation']=[]
info_dic['Full_name']=[]
info_dic['DDL']=[]
info_dic['ccf_level']=[]
info_dic['Conference_time']=[]
info_dic['Type']=[]
info_dic['Location']=[]
info_dic['Official website']=[]

for i in range(1,11):
    url=f'http://123.57.137.208/ccf/ccf-{i}.jsp'
    info_dic=paccf(url,info_dic)
df = pd.DataFrame(info_dic)
df.to_excel('all_ccf_conference_info.xlsx', index=False)

三、原理详解


如果看到这说明你不是个单纯的白嫖党,想必是有bear来

首先定义一个字典,预存所有的信息,最后要保存到一个表格里。


info_dic={}
info_dic['Abbreviation']=[]
info_dic['Full_name']=[]
info_dic['DDL']=[]
info_dic['ccf_level']=[]
info_dic['Conference_time']=[]
info_dic['Type']=[]
info_dic['Location']=[]
info_dic['Official website']=[]

然后去Call4Paper首页看看。Call4Papers - CCF推荐列表

网站上没有排序的功能,而且十种专业类型没有合并在一起,想找到最近能投稿的会议劳神费眼。

可以发现计算机体系结构的url是

http://123.57.137.208/ccf/ccf-1.jsp

计算机网络的url是

http://123.57.137.208/ccf/ccf-2.jsp

交叉/新兴/综合的url是

http://123.57.137.208/ccf/ccf-10.jsp

每种类型的链接只是替换了ccf-后面的数字,所以可以这样遍历十个类型:

for i in range(1,11):
    url=f'http://123.57.137.208/ccf/ccf-{i}.jsp'
    info_dic=paccf(url,info_dic)

接下来要完成paccf函数,将每一个url的会议列表信息都爬下来,存到info_dic中。在首页,每个会都分为了A,B,C三类table。

按f12查看A类table的xpath(右击标签,复制->Xpath),是/html/body/div[2]/div[2]/div[1],B类是/html/body/div[2]/div[2]/div[2],C类是/html/body/div[2]/div[2]/div[3]

 

所以可以用下面的代码遍历table。而不同table一级的div标签下有不同数量的会议,所以要用xpath的count函数查看这个div标签下有几个div子标签。

for table in range(3):
        father_div=f'/html/body/div[2]/div[2]/div[{table+1}]'
        count_of_child_divs = tree.xpath('count(' + father_div + '/div)')
        number_of_child_divs = int(count_of_child_divs)

接下来随便进入一个会议的页面,用同样的方法找这些信息的xpath:会议名称,截稿日期,开会时间,ccf分类,地点,官网,专业分类。再用一些字符串截断来获取标签下的信息。

    for j in range(1,number_of_child_divs):
            leaf=tree.xpath(f'/html/body/div[2]/div[2]/div[{table+1}]/div[{j+1}]/div[2]/h4/a')
            #leaf=tree.xpath('/html/body/div[2]/div[2]/div[1]/div[3]/div[2]/h4/a')
            print(leaf[0].attrib['href'][-10:])
            C_num=leaf[0].attrib['href'][-10:]
            c = 'http://123.57.137.208/detail.jsp?url='+C_num
            type, short_name, full_name, text_ddl, ccf_level, conference_time, place, guanwang=get_info(c,header)
            info_dic['Abbreviation'].append(short_name)
            info_dic['Full_name'].append(full_name)
            info_dic['DDL'].append(text_ddl)
            info_dic['ccf_level'].append(ccf_level)
            info_dic['Conference_time'].append(conference_time)
            info_dic['Type'].append(type)
            info_dic['Location'].append(place)
            info_dic['Official website'].append(guanwang)
            print(short_name, full_name, text_ddl, ccf_level, conference_time,type, place, guanwang)
            print("")

总结

在网页里搜索太慢,还是在表格里筛选快速一些。直接附上表格供大家使用。


  • 53
    点赞
  • 36
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 3
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

Roymasterpiece

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值