招标网站信息爬取

Soonyang Zhang

已于 2022-11-30 10:01:45 修改

阅读量4.1k

点赞数 3

分类专栏：代码分享日常记录 tool 文章标签： beautifulsoup github http

于 2022-11-30 09:47:43 首次发布

本文链接：https://blog.csdn.net/u010643777/article/details/128108431

版权

代码分享同时被 3 个专栏收录

17 篇文章

订阅专栏

tool

8 篇文章

订阅专栏

日常记录

3 篇文章

订阅专栏

目标网站

某采购与招标网
代码链接code-repo

准备工作

参考博客[1]，使用谷歌浏览器的开发者工具，提取http的表单信息。
在这里插入图片描述
http post 中的表单信息，需要含有_qt信息。网站使用_qt做反爬虫措施。_qt由服务器返回，在不同的会话中，值是变化的。如果缺少_qt的信息，post的返回状态码是403。
在会话建立后，当客户端发送http get信息后，返回的页面中含有_qt的信息。主页另存为html，用文本编辑工具打开，可以看到_qt。

$.ajax({
			type : "POST",
			url : url,
			cache : false,
			processData : true,
			data : formData+ /*TZYz*/"&"//"/*"
				///'"'*/
			/*0!caigou8*/+'_'/*JTZY*/+'q'/*ZYzNh5*/+'t'/*zNhFm076*/+'='/**/
				+//"/*"
				//*/
				/*NhFmYiZj5"*/'TNjJTZYzNhFmYi'+/*"*'mYi5*///"
				/*JTZY*/'ZjN3IjYlFTOzgzNlFTZ2ImNxUWNwQ'///*ANjJTZYzNhFJyu*/
			,success : function(responseData) {
				$("#searchResult").html(responseData);
				if ($("#totalRecordNum").val() == 0) {
					var msg="<font color='red' size='3em'>查无结果！</font>";
					$("#searchResult").html(msg);
				} else {
					$("#searchResult").html(responseData);
				}
				
				$('body').loadingmask('close');
			}
		});

可以看到javascript中添加了一些注释，做了混淆。本项目中提取_qt值的代码，通过字符串"0!caigou8"定位第一行，"success"定位最后一行。

def _parser_qt(text):
    f= io.StringIO(text)
    content=[]
    locate_caigou8=False
    while True:
        line=f.readline()
        if not line:
            break
        else:
            if locate_caigou8 is False and "0!caigou8" in line:
                locate_caigou8=True
            if locate_caigou8:
                if "success" in line:
                    break
                t=line.strip()
                t=_remove_annotation(t)
                if len(t)>0:
                    content.append(t)
    length=len(content)
    qt=_strip_token(content[2],"'+")+_strip_token(content[3],"'+")
    return qt

获取网页索引信息

http post相关代码。如果想要请求其他省份的数据，请修form_data信息。

def request_page_index(session,cookie,qt,post_headers,page_index,retry=1,info_dir=None):
    success=False
    per_page_size=20
    province='YN'
    #https://zhuanlan.zhihu.com/p/31856224
    province_ch='云南'
    form_data={
        'page.currentPage':str(page_index),
        'page.perPageSize':str(per_page_size),
        'noticeBean.sourceCH':province_ch,
        'noticeBean.source':province,
        'noticeBean.title':'',
        'noticeBean.startDate':'',
        'noticeBean.endDate':'',
        '_qt':qt
    }
    post_headers['cookie']=cookie
    #https://blog.csdn.net/weixin_51111267/article/details/124616848
    post_content=post_emit(session,post_url,post_headers,form_data,retry)
    if post_content is None:
        logging.warn("fail to download notice page {}".format(page_index))
        return success
    success=True
    if info_dir:
        pathname=info_dir+province+"_"+str(page_index)+".txt"
        save_notice_page(pathname,post_content)
    return success

客户端发送http post消息后，服务器返回招标索引信息。对应的结果在浏览器中的展示如图。
在这里插入图片描述
save_notice_page提取返回网页中包含的项目名称和id信息。处理获的结果可以在page-index文件夹中看到。

def save_notice_page(pathname,html_content):
    title2id=hp.get_onclick_info(html_content)
    fm.save_title2id(title2id,pathname)

html_parser.py

def get_onclick_info(html_content):
   soup = BeautifulSoup(html_content, 'lxml')
   tr_all=soup.find_all("tr")
   length=len(tr_all)
   dic={}
   for i in range(length):
   #{'class': [], 'onmousemove': 'cursorOver(this)', 'onmouseout': 'cursorOut(this)', 'onclick': "selectResult('901888')"}
   # type(attrs) is  dict
       if tr_all[i].attrs and 'onclick' in tr_all[i].attrs:
           id=_parser_id(tr_all[i].attrs['onclick'])
           if len(id)>0:
               ahref=tr_all[i].find("a")
               if 'title' in ahref.attrs:
                   dic.update({ahref.attrs['title']:id})
               elif len(ahref.contents)>0:
                   dic.update({ahref.contents[0]:id})
   return dic

请求具体界面

根据page_index中保存的id信息，构造page_url，并下载网页内容，结构保存在resource文件夹中。

def download_notice_batch(index_list,resource_dir,suffix=".txt"):
   requests.packages.urllib3.util.ssl_.DEFAULT_CIPHERS += ':HIGH:!DH:!aNULL'
   for i in range(len(index_list)):
       user_agent=random.choice(ua_list)
       login_headers['User-Agent']=user_agent
       session=requests.session()
       path_name=index_list[i]
       id2title=fm.extrac_id2title(path_name)
       pos2=path_name.rfind('.')
       pos1=path_name.rfind('/')
       name=path_name[pos1+1:pos2]
       new_path=resource_dir+name+'/'
       fm.mkdir(new_path)
       for item in id2title.items():
           id=item[0]
           page_url=url_base+id+""
           dest_name=new_path+id+suffix
           if open_page(session,page_url,login_headers,dest_name) is False:
               logging.warn("download failure {} {}".format(name,dest_name))
       session.close()

批量处理数据

按照甲方要求，提取数据。如需提取其他信息，请对parser_notice_content函数进行定制。

def process_page_batch(index_list,resource_dir,csv_name,suffix=".txt"):
   csv_f=open(csv_name,"w")
   delimiter='|'
   csv_f.write("index|page_id|title|no_tax|with_tax|duration|company|people|tel|\n")