python--爬虫51job(2.2)

最新推荐文章于 2020-10-21 20:25:48 发布

哈哈哈也不行吗

最新推荐文章于 2020-10-21 20:25:48 发布

阅读量230

点赞数

分类专栏： python 文章标签： python BeautifulSoup 正则表达式

本文链接：https://blog.csdn.net/qq_40210633/article/details/83501035

版权

python 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

这里使用正则表达式方式

学习参考网址：https://blog.csdn.net/qq_38317509/article/details/79400094?utm_source=blogxgwz0

# -*- coding:utf-8 -*-
import urllib.request
import re

#获取原码
def get_content(page):
    url ='https://search.51job.com/list/010000%252C020000%252C030200%252C040000,000000,0000,00,9,99,python%2520java,2,'+ str(page)+'.html'
    a = urllib.request.urlopen(url)#打开网址
    html = a.read().decode('gbk')#读取源代码并转为unicode
    #print(html)
    return html
#这里获取前10页的信息方式函数，函数参数为页数（因为该网站不同页数网页为1.html,2.html……）
def get(html):
    reg = re.compile(r'class="t1 ">.*? <a target="_blank" title="(.*?)".*? <span class="t3">(.*?)</span>.*?<span class="t4">(.*?)</span>.*? <span class="t5">(.*?)</span>',re.S)
    items=re.findall(reg,html)
    return items
#进行查询信息方式，使用正则表达式，与刚才的方法区别有re.s,字符的处理等，这个我一会儿得看课本
#多页处理
for  j in range(1,11):
    print("正在爬取第"+str(j)+"页数据...")
    html=get_content(j)#调用获取网页原码
    for i in get(html):
            print(i[1]+'\t'+i[2]+'\t'+i[3]+'\n')
 #这里就调用函数；注意i[]里面数字对应内容

可以得到结果，但是这是对别人的代码进行的改造，具体还是正则表达式的用法是啥

看课本儿去喽

哈哈哈也不行吗

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python--爬虫51job(2.2)

这里使用正则表达式方式学习参考网址：https://blog.csdn.net/qq_38317509/article/details/79400094?utm_source=blogxgwz0# -*- coding:utf-8 -*-import urllib.requestimport re#获取原码def get_content(page): url ='htt...
复制链接

扫一扫

专栏目录