爬取HKSTP入驻企业的信息

最新推荐文章于 2024-09-13 22:22:02 发布

suvieu

最新推荐文章于 2024-09-13 22:22:02 发布

阅读量867

点赞数

分类专栏： #爬虫 PYTHON

本文链接：https://blog.csdn.net/suvieu/article/details/101108312

版权

PYTHON 同时被 2 个专栏收录

11 篇文章 0 订阅

订阅专栏

#爬虫

4 篇文章 0 订阅

订阅专栏

要求：

HKSTP企业目录：点进该网址查看HKSTP入驻的每一个公司并获取企业的信息，包括企业名称/邮箱/网址/电话/联系人/公司介绍

企业目录页面：
在这里插入图片描述

企业信息页面：
在这里插入图片描述

第一步导入相关模块

这里要注意一开始我直接通过from lxml import etree 会有报错，
网上查了下似乎从python3.5开始就无法直接导入etree模块了
解决方法是先引入html模块，通过html模块引入etree

import requests
from lxml import html
import re
import pandas as pd

url='https://www.hkstp.org/en/reach-us/company-directory/?i=&t=All&c=-1&s=-1&s=-1&k=&page=1'
    response = requests.get(url,headers=headers).text
    etree = html.etree
    htmldiv =etree.HTML(response)

第二步解析企业目录源代码

分析企业目录页面，找到link到企业信息页面的链接，

如下图单个企业信息页面的链接在 DIV[id=“companyList”]/ul/li/div/div/a 标签中的href属性中，

在这里插入图片描述
用xpath解析获得企业的链接，但是这个链接是不完整的，还要拼接一下

link = htmldiv.xpath('//*[@id="companyList"]/ul/li/div/div/a/@href'
url2='https://www.hkstp.org' +  str(link[0])

第三步解析企业信息页面

企业信息都在class="content-sub-title"的h1标签下，但是这里我遇到了几个挺头疼的问题
1）我们只想抓取企业名称/邮箱/公司网址/电话/联系人/公司介绍这几个类别，但是Info-list里还包括企业logo,地址，产品等信息
2）即使在我们想要抓取的企业名称/邮箱/网址/电话/联系人/公司介绍这几个分类里，除了企业名称是每一页里都有的，其他几个分类并不是每个公司都有的，而且邮箱和网址是在标签的属性里，电话，公司介绍这些是在标签的文本里

所以怎么用XPATH做筛选呢？这是身为爬虫小白的我还没有掌握的！
在这里插入图片描述

第四步正则表达式曲线救国

我想到的解决办法是用正则表达式分别抓取需要的类别，然后分别保存在对应的list中，
这样会有公司名称/电话/邮箱/网址/联系人/公司介绍 6个列表，每个列表中的元素都是一个元组(公司名称，类别信息)，比如电话号码的列表就是 [(公司名称1，电话1）(公司名称2，电话2)]，公司名称是作为关键字在最后要把6个列表用merge方法combine到一起的；如果某个公司的某个类别信息缺失，比如没有邮箱，那这个邮箱list中就不会有这个公司名称，最后combine的时候就会以空值代替(merge方法类似于vlookup)

name_pattern = re.compile(r'content-sub-title">(.*?)\s*?</h1>')
name = name_pattern.findall(response2)
name_list.extend(name)

tel_pattern = re.compile(r'content-sub-title">(.*?)\s*?</h1>[\s\S]*?Tel[\s\S]*?<p>(.*?)</p>')
tel = tel_pattern.findall(response2)
tel_list.extend(tel)

email_pattern = re.compile(r'content-sub-title">(.*?)\s*?</h1>[\s\S]*?Email[\s\S]*?mailto:(.*?)"')
mail = email_pattern.findall(response2)
mail_list.extend(mail)

web_pattern = re.compile(r'content-sub-title">(.*?)\s*?</h1>[\s\S]*?Website[\s\S]*?href="(.*?)"')
web = web_pattern.findall(response2)
web_list.extend(web)

Contact_pattern = re.compile(r'content-sub-title">(.*?)\s*?</h1>[\s\S]*?Contact Person[\s\S]*?<p>(.*?)</p>')
contact = Contact_pattern.findall(response2)
person_list.extend(contact)

intro_pattern = re.compile(r'content-sub-title">(.*?)\s*?</h1>[\s\S]*?Introduction[\s\S]*?<p>([\s\S]*?)</p>')
intro = intro_pattern.findall(response2)
intro_list.extend(intro)

第五步通过pandas 整合

这里我又遇到一个蛋疼的问题，就是我有6个list，而merge方法一次只能匹配两张表？所以我蛋疼的匹配了5次(如果有大神知道更简单的方法请务必告诉我，我查了下conact方法暂时也没找到合适的解决方法)

name_df = pd.DataFrame(name_list, columns=['name'])
tel_df = pd.DataFrame(tel_list,columns=['name','tel_no'])
mail_df = pd.DataFrame(mail_list, columns=['name', 'mail'])
web_df = pd.DataFrame(web_list, columns=['name', 'web'])
person_df = pd.DataFrame(person_list, columns=['name', 'contact_person'])
intro_df = pd.DataFrame(intro_list,columns=['name','Introduction'])

result=pd.merge(name_df,tel_df,on='name',how='left')
result1=pd.merge(result,mail_df,on='name',how='left')
result2=pd.merge(result1,web_df,on='name',how='left')
result3=pd.merge(result2,person_df,on='name',how='left')
result4=pd.merge(result3,intro_df,on='name',how='left')
result4.to_excel('HKSTP.xlsx')

这样基本上就完成啦！

附完整代码：

import requests
from lxml import html
import re
import pandas as pd
import time
from lxml import etree
start=time.time()
headers = {'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Mobile Safari/537.36'}
name_list =[]
tel_list = []
mail_list = []
web_list = []
person_list = []
intro_list = []
for n in range(1,48):
    print('正在爬去第{}页'.format(n))
    print(time.strftime("%H:%M:%S", time.localtime(time.time())))
    url='https://www.hkstp.org/en/reach-us/company-directory/?i=&t=All&c=-1&s=-1&s=-1&k=&page={}'.format(n)
    response = requests.get(url,headers=headers).text
    etree = html.etree
    htmldiv =etree.HTML(response)
    for i in range(1,21):
        link = htmldiv.xpath('//*[@id="companyList"]/ul/li[{}]/div/div/a/@href'.format(i))
        url2='https://www.hkstp.org' +  str(link[0])
        response2 = requests.get(url2,headers).text

        name_pattern = re.compile(r'content-sub-title">(.*?)\s*?</h1>')
        name = name_pattern.findall(response2)
        name_list.extend(name)

        tel_pattern = re.compile(r'content-sub-title">(.*?)\s*?</h1>[\s\S]*?Tel[\s\S]*?<p>(.*?)</p>')
        tel = tel_pattern.findall(response2)
        tel_list.extend(tel)

        email_pattern = re.compile(r'content-sub-title">(.*?)\s*?</h1>[\s\S]*?Email[\s\S]*?mailto:(.*?)"')
        mail = email_pattern.findall(response2)
        mail_list.extend(mail)

        web_pattern = re.compile(r'content-sub-title">(.*?)\s*?</h1>[\s\S]*?Website[\s\S]*?href="(.*?)"')
        web = web_pattern.findall(response2)
        web_list.extend(web)

        Contact_pattern = re.compile(r'content-sub-title">(.*?)\s*?</h1>[\s\S]*?Contact Person[\s\S]*?<p>(.*?)</p>')
        contact = Contact_pattern.findall(response2)
        person_list.extend(contact)

        intro_pattern = re.compile(r'content-sub-title">(.*?)\s*?</h1>[\s\S]*?Introduction[\s\S]*?<p>([\s\S]*?)</p>')
        intro = intro_pattern.findall(response2)
        intro_list.extend(intro)
    time.sleep(4)

name_df = pd.DataFrame(name_list, columns=['name'])
tel_df = pd.DataFrame(tel_list,columns=['name','tel_no'])
mail_df = pd.DataFrame(mail_list, columns=['name', 'mail'])
web_df = pd.DataFrame(web_list, columns=['name', 'web'])
person_df = pd.DataFrame(person_list, columns=['name', 'contact_person'])
intro_df = pd.DataFrame(intro_list,columns=['name','Introduction'])

result=pd.merge(name_df,tel_df,on='name',how='left')
result1=pd.merge(result,mail_df,on='name',how='left')
result2=pd.merge(result1,web_df,on='name',how='left')
result3=pd.merge(result2,person_df,on='name',how='left')
result4=pd.merge(result3,intro_df,on='name',how='left')
pd.set_option('display.max_columns', None)
result4.to_excel('HKSTP.xlsx')
end=time.time()
print("Running time: %s seconds"%(end - start))