python:爬虫系列-01

最新推荐文章于 2021-04-26 23:39:12 发布

南郭竽

最新推荐文章于 2021-04-26 23:39:12 发布

阅读量841

点赞数

分类专栏：使用python 大猫学python 文章标签： python 爬虫

本文链接：https://blog.csdn.net/DucklikeJAVA/article/details/73698636

版权

大猫学python 同时被 2 个专栏收录

31 篇文章 0 订阅

订阅专栏

使用python

8 篇文章 0 订阅

订阅专栏

看了《Learning Python》有一段时间了，差不多看到类的样子，一直没有去动手实践过。
于是决定动手写点小东西。也不知道该写点什么，于是打算入手爬虫。

参照网上的爬虫教程，写了一个简单爬取网页中链接的小练习。
- common_var.py

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @author : cat
# @date   : 2017/6/25.

user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"
headers = {"User-Agent": user_agent}

if __name__ == '__main__':
    pass

http_file.py

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @author : cat
# @date   : 2017/6/24.
from urllib import request
import ssl
from web.common_var import headers
import re

# regex from djiango
regex = re.compile(
    r'^(?:http|ftp)s?://'  # http:// or https://
    r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+(?:[A-Z]{2,6}\.?|[A-Z0-9-]{2,}\.?)|'  # domain...
    r'localhost|'  # localhost...
    r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})'  # ...or ip
    r'(?::\d+)?'  # optional port
    r'(?:/?|[/?]\S+)$', re.IGNORECASE)

csdn = 'http://www.csdn.com'


def get_urls(url_in=csdn, key="href="):
    """
    通过一个入口的URL爬取其中的全部的URL
    :param url_in: 入口的URL
    :param key: 'href='
    :return: urls set !
    """
    url_sets = set()
    ssl_context = ssl._create_unverified_context()
    req = request.Request(url_in, headers=headers)
    resp_bytes = request.urlopen(req, context=ssl_context)
    for line in resp_bytes:
        line_html = line.decode('utf-8')
        # print(line_html)
        if key in line_html:
            # print(line_html)
            index = line_html.index(key)
            sub_url = line_html[index + len(key):].replace('"', "#").split('#')[1]
            match = regex.search(sub_url)
            if match:
                # print(match.group())
                # yield match.group()
                url_sets.add(match.group())
                # print(url_sets)
    return url_sets


if __name__ == '__main__':
    # print(list(get_urls("http://news.baidu.com/?tn=news")))
    baidu_news = "http://news.baidu.com/?tn=news"
    urls = get_urls(baidu_news)
    # print(urls)
    for u in urls:
        print(u)
    print("total url size in this website({}) = {}"
          .format(baidu_news, len(urls)))

代码不算简洁，不过还算是易懂。

输出如下：

/web/http_file.py

https://baijia.baidu.com/s?id=1571043179126899
http://net.china.cn/chinese/index.htm
http://newsalert.baidu.com/na?cmd=0
http://tech.baidu.com/
http://tv.cctv.com/2017/06/24/VIDE9KYKPMTmLLENgIgdhyut170624.shtml
http://xinwen.eastday.com/a/170624122900408.html
http://shehui.news.baidu.com/
… # 后面还有很多URL，不全部贴出了。
…

total url size in this website(http://news.baidu.com/?tn=news) = 116

Process finished with exit code 0

next step：

下一步打算访问子链接，看看一共包含多少个链接。这似乎是一个浩大的工程，也不清楚会不会去完成…

南郭竽

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
python:爬虫系列-01

看了《Learning Python》有一段时间了，差不多看到类的样子，一直没有去动手实践过。于是决定动手写点小东西。也不知道该写点什么，于是打算入手爬虫。参照网上的爬虫教程，写了一个简单爬取网页中链接的小练习。 common_var.py#!/usr/bin/env python# -*- coding: utf-8 -*-# @author : cat# @date : 20
复制链接

扫一扫

专栏目录