爬取51job的职位信息

最新推荐文章于 2024-08-26 23:54:56 发布

浅汐王

最新推荐文章于 2024-08-26 23:54:56 发布

阅读量1.2k

点赞数

分类专栏： python

本文链接：https://blog.csdn.net/qq_32252917/article/details/78175046

版权

python 专栏收录该内容

39 篇文章 0 订阅

订阅专栏

  #!/usr/bin/python 

  #encoding:utf-8 

  #网站---源代码---python信息---匹配findall---写入文件 

  import urllib 

  import re 

  import 
  sys 

  reload(sys) 

  sys.setdefaultencoding('utf-8')#输出的内容是utf-8格式 

  #打开源码，获取网站 

  i=0; 

  def get_content(page): 

  url='http://search.51job.com/jobsearch/search_result.php?fromJs=1&jobarea=000000%2C00&district=000000&funtype=0000&industrytype=00&issuedate=9&providesalary=99&keyword=java&keywordtype=2&curr_page= 
 2&lang=c&stype=1&postchannel=0000&workyear=99&cotype=99&degreefrom=99&jobterm=99&companysize=99&lonlat=0%2C0&radius=-1&ord_field=0&list_type=0&fromType=14&dibiaoid=0&confirmdate=9'. 
 format(page) 

  a=urllib.urlopen(url) #打开网页 

  html=a.read() #读取源代码 

  html=html.decode('gbk') #从gbk转为unicode 

  # print html 

  return html 

  #匹配到正文 

  def get(html): 

  reg =re. 
 compile(r'class="t1 ">.*?<a target="_blank" title="(.*?)".*?<span class="t2"><a target="_blank" title="(.*?)".*?<span class="t3">(.*?)</span>.*?<span class="t4">(.*?)</span>.*?<span class="t5">(.*?)</span>', 
 re.S) 

  items=re. 
 findall(reg,html) 

  # print items #列表list 

  return items 

  # 
 多页，写入文件 

  for j in range(1,2000): 

  html=get_content(j) #调用获取源码 

  for i in get(html): 

  print i[0], i[1], i[2], i[3], i[4] 

  with open('51job.txt','a') as f: 

  f.write(i[0]+'\t'+i[1]+'\t'+i[2]+'\t'+i[3]+'\t'+i[4]+'\n') 

  f.close() 

浅汐王

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录