urlparse之urljoin() 爬虫必备

最新推荐文章于 2024-04-20 20:41:25 发布

XCCS_澍

最新推荐文章于 2024-04-20 20:41:25 发布

阅读量3.1k

点赞数

分类专栏：编程学习文章标签： urljoin

编程学习专栏收录该内容

29 篇文章 0 订阅

订阅专栏

转载于：https://blog.csdn.net/nciaebupt/article/details/7644757

首先导入模块，用help查看相关文档

复制代码

>>> from urlparse import urljoin
>>> help(urljoin)
Help on function urljoin in module urlparse:

urljoin(base, url, allow_fragments=True)
    Join a base URL and a possibly relative URL to form an absolute
    interpretation of the latter.

复制代码

1	`意思就是将基地址与一个相对地址形成一个绝对地址，然而讲的太过抽象`

接下来，看几个例子，从例子中发现规律。

复制代码

>>> urljoin("http://www.google.com/1/aaa.html","bbbb.html")
'http://www.google.com/1/bbbb.html'
>>> urljoin("http://www.google.com/1/aaa.html","2/bbbb.html")
'http://www.google.com/1/2/bbbb.html'
>>> urljoin("http://www.google.com/1/aaa.html","/2/bbbb.html")
'http://www.google.com/2/bbbb.html'
>>> urljoin("http://www.google.com/1/aaa.html","http://www.google.com/3/ccc.html")
'http://www.google.com/3/ccc.html'
>>> urljoin("http://www.google.com/1/aaa.html","http://www.google.com/ccc.html")
'http://www.google.com/ccc.html'
>>> urljoin("http://www.google.com/1/aaa.html","javascript:void(0)")
'javascript:void(0)'

复制代码

规律不难发现，但是并不是万事大吉了，还需要处理特殊情况，如链接是其本身，链接中包含无效字符等

url = urljoin("****","****") ### find()查找字符串函数，如果查到：返回查找到的第一个出现的位置。否则，返回-1 if url.find("'")!=-1: continue ### 只取井号前部分 url = url.split('#')[0] ### 这个isindexed()是我自己定义的函数，判断该链接不在保存链接的数据库中 if url[0:4]=='http' and not self.isindexed(url): ###newpages = set(),无序不重复元素集 newpages.add(url)

XCCS_澍

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
1
评论
urlparse之urljoin() 爬虫必备

转载于：https://blog.csdn.net/nciaebupt/article/details/7644757首先导入模块，用help查看相关文档&gt;&gt;&gt; from urlparse import urljoin&gt;&gt;&gt; help(urljoin)Help on function urljoin in module urlparse:u...
复制链接

扫一扫