Python爬虫——爬取文章

最新推荐文章于 2024-05-02 12:54:59 发布

py.鸽鸽

最新推荐文章于 2024-05-02 12:54:59 发布

阅读量1.3k

点赞数 11

文章标签： python 爬虫开发语言 beautifulsoup

本文链接：https://blog.csdn.net/2301_80339607/article/details/135662554

版权

一、爬取文章

1.思路

使用requests爬取网站，再用BeautifulSoup提取网站内容，最后保存

2.需要下载的库

requests

2.31.0

beautifulsoup4

4.12.2

fake-useragent

1.4.0

可以使用pip install 库名下载

3.代码实现

import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
def link():
    url='这里输入要爬取的网站的url地址'
    header={
         'User-Agent':UserAgent().random
    }
    resqonse=requests.get(url,headers=header,timeout=10)
    #resqonse.encoding='gbk'
    r=resqonse.text
    return r

def extract(r):
    soup=BeautifulSoup(r,"html.parser")
    node=soup.find("div",id="content")
    content=node.text
    return content

def save(content):
    with open('../save/page.txt','w',encoding='utf-8') as f:
        f.write(content)

def main():
    r=link()
    content=extract(r)
    save(content)

if __name__=='__main__':
    main()

4.代码讲解

1.requests爬取网站

使用get函数，其中的参数有网站地址(url)，标头(headers)，延迟时间(timeout)

url是爬取的地址（不可缺少）

标头headers和延时时间timeout都是为了防止反爬。（只爬一次可以省）

其中

timeout是延长每次访问时间

headers是模拟用户访问浏览器，而一个用户多次访问有时会触发主服务器的反爬系统，所以需要多个用户访问。如下代码通过导入fake_useragent库，其中有多个用户访问，从中随机取一个。

from fake_useragent import UserAgent
header={
         'User-Agent':UserAgent().random
    }

最后，通过resqonse.text获取页面内容，感兴趣可以使用print(resqonse.text)打印出来看看，打印出来是前端(html)代码。

2.BeautifulSoup提取网站内容

分为三步：

第一步：

soup=BeautifulSoup(r,"html.parser")

BeautifulSoup是安装的解析器。它有两个参数，第一个参数是刚才通过requests获取的html内容，第二个参数是python内置的html解析库

第二步：

node=soup.find("div",id="content")

获取html标签中节点，小编当时html文本中的内容是在div中，它使用的是id选择器，名字是content。

找到的文本一般都在div中，有时可能是class，这时需要把id='content'改为class_='class名'

第三步：

content=node.text

获取节点上的文本。这时如果不保存，可以使用print(node.text)直接打印。

3.保存

with open('../save/page.txt','w',encoding='utf-8') as f:
       f.write(content)

如何保存上篇已经讲过，感兴趣可以看一看