获取linkedin上指定公司下的职员信息

最新推荐文章于 2024-03-30 09:36:17 发布

枪枪枪

最新推荐文章于 2024-03-30 09:36:17 发布

阅读量3.3k

点赞数 4

分类专栏：数据处理、分析文章标签： python

本文链接：https://blog.csdn.net/az9996/article/details/105887808

版权

数据处理、分析专栏收录该内容

9 篇文章 2 订阅

订阅专栏

前言

看到了一个很有意思的爬虫思路，在这里实践一下。
爬取过程中控制请求的频率，仅获取少量数据用以验证程序逻辑是否合理

参考资料

博文链接：
https://blog.csdn.net/bone_ace/article/details/71055153
github链接：
https://github.com/LiuXingMing/LinkedinSpider

思路

我这里使用的是原作者的思路三，借助第三方平台，例如：百度，获取linkedin中某公司的职员信息。
这也就意味着要先写一个爬取百度搜索内容的爬虫，目的是获得某公司职员在linkedin上的主页链接。
在得到linkedin上职员的主页链接后，访问这个主页，对主页上的信息进行爬取，这是第二个爬虫。

数据方面的处理可以使用：pandas、numpy
文本内容的爬取可以使用：re，BeautifulSoup4，lxml
对网页进行请求可以使用：requests、Webdriver

实践

实践过程中发现，使用requests库对百度、linkedin发起请求时容易失败，于是我尝试使用selenium库中的webdriver模块进行网页请求，发现效果很好。
使用BeautifulSoup4对网页信息进行爬取。

爬取百度搜索信息

抓取对象为美的，可以在百度搜索中搜 “美的 site:linkedin.com”

模拟登录，获取cookie

由于不适用cookie的情况下获取数据时容易出现不响应的情况，这里先使用webdriver模拟登录，获取cookie并保存到本地。

def login():
    driver = webdriver.Chrome()
    driver.get("https://passport.baidu.com/v2/?login")
    driver.find_element_by_id("TANGRAM__PSP_3__footerULoginBtn").click()
    driver.find_element_by_id("TANGRAM__PSP_3__userName").send_keys(ACCOUNT)
    driver.find_element_by_id("TANGRAM__PSP_3__password").send_keys(PASSWORD)
    driver.find_element_by_id("TANGRAM__PSP_3__submit").click()
    dict_cookies = driver.get_cookies()
    json_cookie=json.dumps(dict_cookies)
    print(json_cookie)
    with open('./cookie2.txt','w') as f:
        f.write(json_cookie)
    driver.close()

加载cookie，开始爬取信息

构造用于请求的url，number用于控制页数

url = 'http://www.baidu.com/s?ie=UTF-8&wd=' + quote(company_name) + '%20site%3Alinkedin.com&pn=' + str(number)

从本地读取cookie并加载到浏览器窗口中

driver = webdriver.Chrome()
    driver.get(url)
    with open('./cookie2.txt', 'r', encoding='utf8') as f:
        list_cookies = json.loads(f.read())
    # print("%%%%%%%%%%%",list_cookies)
    for cookie in list_cookies:
        if 'expiry' in cookie:
            del cookie['expiry']
        driver.add_cookie(cookie)

注意：必须先使用get发起请求，然后再加载cooki，而不是像requests那样在一个函数中一起进行，否则会出现domain无效的异常

爬取信息的代码：
一开始尝试用webdriver的点击事件控制翻页，发现效果并不好，于是采取在链接内加入str(number)来控制页数

        while page<=15:
        url = 'http://www.baidu.com/s?ie=UTF-8&wd=' + quote(company_name) + '%20site%3Alinkedin.com&pn=' + str(number)
        driver.get(url)
        hrefs = list(set(re.findall('"(http://www\.baidu\.com/link\?url=.*?)"', driver.page_source)))
        # print(driver.page_source)
        #print(hrefs)
        print(len(hrefs))

        headers = {
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
            'Accept-Encoding': 'gzip, deflate, compress',
            'Accept-Language': 'en-us;q=0.5,en;q=0.3',
            'Cache-Control': 'max-age=0',
            'Connection': 'keep-alive',
            'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:22.0) Gecko/20100101 Firefox/22.0'
        }

        for href in hrefs:
            try:
                linkedin_url = requests.get(href, headers, allow_redirects=False)
                if linkedin_url.status_code == 302:
                    real_url = linkedin_url.headers['Location']
            except Exception as e:
                real_url = ''
                continue

            if '/company/' and '/jobs/' in real_url:
                real_url=''
            else:
                true_url = re.sub(r'cn\.|ve\.|hk\.|de\.|jm\.|li\.', 'www.', real_url)
                #true_url=re.sub()
                print(true_url)
                if  '/company/' in true_url:
                    true_url=''
                else:
                    series = series.append(pandas.Series(true_url), ignore_index=True)

        #print(series.values)

        time.sleep(2)
        page=page+1
        number=number+10

    print(series.values)
    print(series.attrs)
    series.to_csv('url.csv',index=False)

将爬取到的信息整合成Dataframe类型后通过to_csv方法保存到本地。

这里获取到140多条职员的linkedin主页链接用于实践
在这里插入图片描述

爬取linkedin职员主页信息

同样，用webdriver模拟登录，获取cookie并保存

对职员主页的信息进行爬取

def message_get(source,url):
    #driver = webdriver.Chrome()
    #driver.get('https://www.linkedin.com/')

    soup=BeautifulSoup(source,'lxml')

    name=soup.find_all(name='li',attrs='inline t-24 t-black t-normal break-words')[0].get_text().replace(' ','')
    position=soup.find_all(name='h2',attrs='mt1 t-18 t-black t-normal break-words')
    if position==[]:
        position='None'

    else:
        position=position[0].get_text().replace(' ','')


    country = soup.find_all(name='li', attrs='t-16 t-black t-normal inline-block')
    if country==[]:
        country = 'None'
    else:
        country = country[0].get_text().replace(' ', '')

    friend_number = soup.find_all(name='span', attrs='t-16 t-black t-normal')
    if friend_number==[]:
        friend_number='None'
    else:
        friend_number=friend_number[0].get_text().replace(' ','')
    working_experiences = soup.find_all(name='ul', attrs='pv-profile-section__'
                                                          'section-info section-'
                                                          'info pv-profile-section__'
                                                          'section-info--has-no-more')
    print('-------------------------')
    print(working_experiences)

    if working_experiences==[]:
        working_experiences = soup.find_all(name='ul', attrs='pv-profile-section__'
                                                             'section-info section-'
                                                             'info pv-profile-section__'
                                                             'section-info--has-more')
        print('-------------------------')
        print(working_experiences)

    #type(working_experiences)
    #print(working_experiences)
    li = []
    for message in working_experiences:
        message = message.get_text().replace(' ', '')
        message = re.sub(r'\n', '', message)
        li.append(message)


    #print(li)

    name=re.sub(r'\n','',name)
    position=re.sub(r'\n','',position)
    friend_number=re.sub(r'\n','',friend_number)


    #print(name,
    #      position,
     #     country,
     #     friend_number,
     #     working_experiences,
     #     e1,
    #      e2)

    if li==[]:
        dic = {'name': name,
               'position': position,
               'country': country,
               'friend_number': friend_number,
               'working_experiences': 'None',
               'education_experiences': 'None'}
    elif len(li)==1:
        dic = {'name': name,
               'position': position,
               'country': country,
               'friend_number': friend_number,
               'working_experiences': li[0],
               'education_experiences': li[0]}
    elif len(li)>1:
        dic = {'name': name,
               'position': position,
               'country': country,
               'friend_number': friend_number,
               'working_experiences': li[0],
               'education_experiences': li[1]}


    return dic

对所有职员主页链接进行遍历,同时获取职员信息

这里仅爬取50个职员的信息用以验证思路正确性，获取完毕后存入DataFrame对象中，使用to_csv方法写入本地。

def get_linkedin_message():
    df=pd.read_csv('./url.csv')
    remove_re = df.drop_duplicates(keep='first')
    series=remove_re.loc[:,'0']
    urls=series.values
    driver = webdriver.Chrome()
    driver.get('https://www.linkedin.com/')
    with open('./cookie.txt','r',encoding='utf8') as f:
        list_cookies=json.loads(f.read())
    #print("%%%%%%%%%%%",list_cookies)
    for cookie in list_cookies:
        if 'expiry' in cookie:
            del cookie['expiry']
        driver.add_cookie(cookie)

    driver.get('https://www.linkedin.com/')
    number=0
    df2 = pd.DataFrame(columns=['name',
                                'position',
                                'country',
                                'friend_number',
                                'working_experiences',
                                'education_experiences'])
    for url in urls:
        driver.get(url)
        source = driver.page_source
        #print(source)
        dic=message_get(source,url)
        se1 = pd.Series(dic)
        df2=df2.append(se1,ignore_index=True)
        print(dic)
        number = number + 1
        print('-----------',number,'------------')
        random_number = np.random.uniform(1,3)
        time.sleep(round(random_number,2))
        if number==50:
            df2 = df2.drop_duplicates('name', keep='first')
            df2.to_csv('result.csv', index=False, encoding='GB18030')
            time.sleep(3)
            driver.close()
            print('完毕')
            exit()

    print(df2)

得到的结果：
在这里插入图片描述

总结

这些代码仅用于验证程序逻辑是否正确，因此并没有进行重构，代码显的臃肿、繁琐，后面会进行针对性的改进。

枪枪枪

关注

4
点赞
踩
11

收藏

觉得还不错? 一键收藏
1
评论
获取linkedin上指定公司下的职员信息

前言看到了一个很有意思的爬虫思路，在这里实践一下。爬取过程中控制请求的频率，仅获取少量数据用以验证程序逻辑是否合理参考资料博文链接：https://blog.csdn.net/bone_ace/article/details/71055153github链接：https://github.com/LiuXingMing/LinkedinSpider思路我这里使用的是原作者的思路三...
复制链接

扫一扫

专栏目录