爬虫爬取CSDN链接

最新推荐文章于 2023-05-28 16:53:24 发布

与人.ޓ

最新推荐文章于 2023-05-28 16:53:24 发布

阅读量582

点赞数 1

文章标签：爬虫文章 urllib

本文链接：https://blog.csdn.net/BlingBlingBlingd/article/details/102864071

版权

利用正则表达式和urllib库实现CSDN的爬取

这是小白我第一次写博客，也是第一次写爬虫，爬取了CSDN首页的部分URL（CSDN自行忽略…）。就简单记录一下此次过程。

re库

re库（正则表达式）是python3中一个很好用的匹配文本的模块，下面是re的使用规范。

还有一个很重要的是，区分开( ),[ ],{ }在正则表达式中的不同和作用。链接：https://www.cnblogs.com/langren1992/p/9782191.html

来源网络，侵删

urllib.requests的使用

url.requests为python提供了比较完全的爬取网页内容的功能。
url.requests.Requests(URL,data)可以向URL网站发送请求，data是作为请求的header内容一并发送。还有很多参数，并不知道具体怎么用，就不多说。url.requests.urlopen(respond)用来读取网站返回的内容，但是返回格式是http.client.HTTPResponse，所以我们需要使用read() 方法读取url.requests.urlopen(responde)的内容。但是同时我们还需要对读取的内容进行解码，大部分编码形式是使用的“utf-8”，但是也会有个别情况，此时可以使用python chardet库进行检测，从而确定编码格式

respond = urllib.request.Request(targetUrl)
respond = urllib.request.urlopen(respond)  # http.client.HTTPResponse
content = respond.read()
content = content.decode("utf-8")

最后部分

我为了可以把爬取的内容放在txt文件中，使用open函数实现此功能。
预先定义：

import re
import urllib.request

url = "https://blog.csdn.net/nav/python"
href = open("href", "w+")
html = open("html", "r")

之后，找到正确的URL地址，发现大部分都是

https://blog.csdn.net/qq_37338761/article/details/102824008
https://blog.csdn.net/qq_37338761/article/details/102824008
https://blog.csdn.net/qq_37338761/article/details/102824008

这才是我们想要的URL地址，所以可以把正则表达式写成’https?: //blog.csdn.net/\w+?/article/details/[0-9]+’

def get_href(html):
    content = html.read()
    link = re.compile(r'https?:\/\/blog.csdn.net\/\w+?\/article\/details\/[0-9]+').findall(content)
    for i in link:
        href.write(i)
        href.write("\n")
    return href

re.conpile()函数使用来定义正则表达式的匹配格式，从而实现在爬取的html中匹配到想要的URL，fillall()函数是遍历整个html，搜索出所有成功匹配的内容。

最后代码

import re
import urllib.request


url = "https://blog.csdn.net/nav/python"
href = open("href", "w+")
html = open("html", "r")



def get_href(html):
    content = html.read()
    link = re.compile(r'https?:\/\/blog.csdn.net\/\w+?\/article\/details\/[0-9]+').findall(content)
    for i in link:
        href.write(i)
        href.write("\n")
    return href

get_href(html)


def clear_href():
    href = open("href", "r+")
    list  = []
    content = href.read()
    content = content.split("\n")
    for i in content:
        if i not in list:
            list.append(i)
    for i in list:
        href1.write(i+"\n")
    return href1

clear_href()

这就是最后的全部代码，所用时间不是很多，所以简略的记录下来，以便以后能回忆起来，同时也希望别人能收益一二。

与人.ޓ

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
爬虫爬取CSDN链接

利用正则表达式和urllib库实现CSDN的爬取这是小白我第一次写博客，也是第一次写爬虫，爬取了CSDN首页的部分URL（CSDN自行忽略…）。就简单记录一下此次过程。re库re库（正则表达式）是python3中一个很好用的匹配文本的模块，re库的使用规范功能快捷键撤销：Ctrl/Command + Z重做：Ctrl/Command + Y加粗：Ctrl/Command + B斜...
复制链接

扫一扫