前言
看到了一个很有意思的爬虫思路,在这里实践一下。
爬取过程中控制请求的频率,仅获取少量数据用以验证程序逻辑是否合理
参考资料
博文链接:
https://blog.csdn.net/bone_ace/article/details/71055153
github链接:
https://github.com/LiuXingMing/LinkedinSpider
思路
我这里使用的是原作者的思路三,借助第三方平台,例如:百度,获取linkedin中某公司的职员信息。
这也就意味着要先写一个爬取百度搜索内容的爬虫,目的是获得某公司职员在linkedin上的主页链接。
在得到linkedin上职员的主页链接后,访问这个主页,对主页上的信息进行爬取,这是第二个爬虫。
数据方面的处理可以使用:pandas、numpy
文本内容的爬取可以使用:re,BeautifulSoup4,lxml
对网页进行请求可以使用:requests、Webdriver
实践
实践过程中发现,使用requests库对百度、linkedin发起请求时容易失败,于是我尝试使用selenium库中的webdriver模块进行网页请求,发现效果很好。
使用BeautifulSoup4对网页信息进行爬取。
爬取百度搜索信息
抓取对象为美的,可以在百度搜索中搜 “美的 site:linkedin.com”
模拟登录,获取cookie
由于不适用cookie的情况下获取数据时容易出现不响应的情况,这里先使用webdriver模拟登录,获取cookie并保存到本地。
def login():
driver = webdriver.Chrome()
driver.get("https://passport.baidu.com/v2/?login")
driver.find_element_by_id("TANGRAM__PSP_3__footerULoginBtn").click()
driver.find_element_by_id("TANGRAM__PSP_3__userName").send_keys(ACCOUNT)
driver.find_element_by_id("TANGRAM__PSP_3__password").send_keys(PASSWORD)
driver.find_element_by_id("TANGRAM__PSP_3__submit").click()
dict_cookies = driver.get_cookies()
json_cookie=json.dumps(dict_cookies)
print(json_cookie)
with open('./cookie2.txt','w') as f:
f.write(json_cookie)
driver.close()
加载cookie,开始爬取信息
构造用于请求的url,number用于控制页数
url = 'http://www.baidu.com/s?ie=UTF-8&wd=' + quote(company_name) + '%20site%3Alinkedin.com&pn=' + str(number)
从本地读取cookie并加载到浏览器窗口中
driver = webdriver.Chrome()
driver.get(url)
with open('./cookie2.txt', 'r', encoding='utf8') as f:
list_cookies = json.loads(f.read())
# print("%%%%%%%%%%%",list_cookies)
for cookie in list_cookies:
if 'expiry' in cookie:
del cookie['expiry']
driver.add_cookie(cookie)
注意:必须先使用get发起请求,然后再加载cooki,而不是像requests那样在一个函数中一起进行,否则会出现domain无效的异常
爬取信息的代码:
一开始尝试用webdriver的点击事件控制翻页,发现效果并不好,于是采取在链接内加入str(number)来控制页数
while page<=15:
url = 'http://www.baidu.com/s?ie=UTF-8&wd=' + quote(company_name) + '%20site%3Alinkedin.com&pn=' + str(number)
driver.get(url)
hrefs = list(set(re.findall('"(http://www\.baidu\.com/link\?url=.*?)"', driver.page_source)))
# print(driver.page_source)
#print(hrefs)
print(len(hrefs))
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, compress',
'Accept-Language': 'en-us;q=0.5,en;q=0.3',
'Cache-Control': 'max-age=0',
'Connection': 'keep-alive',
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:22.0) Gecko/20100101 Firefox/22.0'
}
for href in hrefs:
try:
linkedin_url = requests.get(href, headers, allow_redirects=False)
if linkedin_url.status_code == 302:
real_url = linkedin_url.headers['Location']
except Exception as e:
real_url = ''
continue
if '/company/' and '/jobs/' in real_url:
real_url=''
else:
true_url = re.sub(r'cn\.|ve\.|hk\.|de\.|jm\.|li\.', 'www.', real_url)
#true_url=re.sub()
print(true_url)
if '/company/' in true_url:
true_url=''
else:
series = series.append(pandas.Series(true_url), ignore_index=True)
#print(series.values)
time.sleep(2)
page=page+1
number=number+10
print(series.values)
print(series.attrs)
series.to_csv('url.csv',index=False)
将爬取到的信息整合成Dataframe类型后通过to_csv方法保存到本地。
这里获取到140多条职员的linkedin主页链接用于实践
爬取linkedin职员主页信息
同样,用webdriver模拟登录,获取cookie并保存
对职员主页的信息进行爬取
def message_get(source,url):
#driver = webdriver.Chrome()
#driver.get('https://www.linkedin.com/')
soup=BeautifulSoup(source,'lxml')
name=soup.find_all(name='li',attrs='inline t-24 t-black t-normal break-words')[0].get_text().replace(' ','')
position=soup.find_all(name='h2',attrs='mt1 t-18 t-black t-normal break-words')
if position==[]:
position='None'
else:
position=position[0].get_text().replace(' ','')
country = soup.find_all(name='li', attrs='t-16 t-black t-normal inline-block')
if country==[]:
country = 'None'
else:
country = country[0].get_text().replace(' ', '')
friend_number = soup.find_all(name='span', attrs='t-16 t-black t-normal')
if friend_number==[]:
friend_number='None'
else:
friend_number=friend_number[0].get_text().replace(' ','')
working_experiences = soup.find_all(name='ul', attrs='pv-profile-section__'
'section-info section-'
'info pv-profile-section__'
'section-info--has-no-more')
print('-------------------------')
print(working_experiences)
if working_experiences==[]:
working_experiences = soup.find_all(name='ul', attrs='pv-profile-section__'
'section-info section-'
'info pv-profile-section__'
'section-info--has-more')
print('-------------------------')
print(working_experiences)
#type(working_experiences)
#print(working_experiences)
li = []
for message in working_experiences:
message = message.get_text().replace(' ', '')
message = re.sub(r'\n', '', message)
li.append(message)
#print(li)
name=re.sub(r'\n','',name)
position=re.sub(r'\n','',position)
friend_number=re.sub(r'\n','',friend_number)
#print(name,
# position,
# country,
# friend_number,
# working_experiences,
# e1,
# e2)
if li==[]:
dic = {'name': name,
'position': position,
'country': country,
'friend_number': friend_number,
'working_experiences': 'None',
'education_experiences': 'None'}
elif len(li)==1:
dic = {'name': name,
'position': position,
'country': country,
'friend_number': friend_number,
'working_experiences': li[0],
'education_experiences': li[0]}
elif len(li)>1:
dic = {'name': name,
'position': position,
'country': country,
'friend_number': friend_number,
'working_experiences': li[0],
'education_experiences': li[1]}
return dic
对所有职员主页链接进行遍历,同时获取职员信息
这里仅爬取50个职员的信息用以验证思路正确性,获取完毕后存入DataFrame对象中,使用to_csv方法写入本地。
def get_linkedin_message():
df=pd.read_csv('./url.csv')
remove_re = df.drop_duplicates(keep='first')
series=remove_re.loc[:,'0']
urls=series.values
driver = webdriver.Chrome()
driver.get('https://www.linkedin.com/')
with open('./cookie.txt','r',encoding='utf8') as f:
list_cookies=json.loads(f.read())
#print("%%%%%%%%%%%",list_cookies)
for cookie in list_cookies:
if 'expiry' in cookie:
del cookie['expiry']
driver.add_cookie(cookie)
driver.get('https://www.linkedin.com/')
number=0
df2 = pd.DataFrame(columns=['name',
'position',
'country',
'friend_number',
'working_experiences',
'education_experiences'])
for url in urls:
driver.get(url)
source = driver.page_source
#print(source)
dic=message_get(source,url)
se1 = pd.Series(dic)
df2=df2.append(se1,ignore_index=True)
print(dic)
number = number + 1
print('-----------',number,'------------')
random_number = np.random.uniform(1,3)
time.sleep(round(random_number,2))
if number==50:
df2 = df2.drop_duplicates('name', keep='first')
df2.to_csv('result.csv', index=False, encoding='GB18030')
time.sleep(3)
driver.close()
print('完毕')
exit()
print(df2)
得到的结果:
总结
这些代码仅用于验证程序逻辑是否正确,因此并没有进行重构,代码显的臃肿、繁琐,后面会进行针对性的改进。