【Python】爬虫入门

最新推荐文章于 2024-08-23 16:53:34 发布

想七想八不如11408

最新推荐文章于 2024-08-23 16:53:34 发布

阅读量1.1k

点赞数 7

分类专栏： Python 文章标签： python 爬虫

本文链接：https://blog.csdn.net/m0_74183164/article/details/135270650

版权

Python 专栏收录该内容

6 篇文章 0 订阅

订阅专栏

import requests
response=requests.get("https://books.toscrape.com/")
if response.ok:
    print(response.text)
else:
    print("请求失败")

requests库，用来构建和发送HTTP请求，需要提前安装，指令：

pip install requests

requests.get会返回一个响应码，含义可以对应查询：

HTTP 响应状态码 - HTTP | MDN (mozilla.org)

譬如：

418就说明网站只对浏览器提供服务，而我们用的是程序，就需要把程序伪装成浏览器。

指定User-Agent

import requests
headers={
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36 Edg/120.0.0.0"
}
response=requests.get("https://movie.douban.com/top250",headers=headers)
if response.ok:
    print(response.text)
else:
    print("请求失败")

请求头的User-Agent里面会给服务器提供一些客户端的信息，所以要去指定User-Agent的值。指定成什么？

1.鼠标右键，点击检查

2.点击网络

3.刷新网页，任意点击一条请求

4.翻找到User-Agent:，复制右边的内容即可。

什么是HTML网页结构？

一个网页有三大技术要素，HTML,CSS和JavaScrip。爬虫时最关心的是网页上的数据信息，所以主要和HTML打交道。

HTML常用标签类型

表格：

练习HTML常用标签

在VSCode里新建文本文件

整体缩进快捷键 CTRL+]

<!DOCTYPE html>
<html>
    <head>
        <title>标题</title>
    </head>

    <body>
        <div style="background-color:cadetblue">
            <h1>一级标题</h1>
            <h2>二级标题</h2>
            <h6>六级标题</h6>
            <p>文本段落
                文本段落
            </p>
            <p>文本段落</p>
            <p><u>文本段落</u></p>
        </div>

        <p><b><span style="background-color:blue">文本段落</span></b><i><span style="background-color:rgb(22, 92, 213)">文本段落</span></i></p>
        
        
        <img src="https://p.sda1.dev/14/013c83f4597979dd7c0fe9e446901462/0TY9%60FWR2R_EK_92VCK__RL.png" width="500px">

        <!--超链接-->
        <a href="https://blog.csdn.net/m0_74183164?spm=1011.2266.3001.5343" target="_blank">我的博客</a>
        
        <ol>
            <li>num1</li>
            <li>num2</li>
        </ol>

        <ul>
            <li>1</li>
            <li>2</li>
        </ul>

        <table border="1" class="data-table">
            <thead>
                <tr>
                    <td>父</td>
                    <td>Y</td>
                    <td>y</td>
                </tr>
            </thead>
            <tbody>
                <tr>
                    <td>X</td>
                    <td>XY</td>
                    <td>Xy</td>
                </tr>
                <tr>
                    <td>x</td>
                    <td>xY</td>
                    <td>xy</td>
                </tr>
            </tbody>
        </table>


    </body>
</html>

效果：标题

Beautiful soup

手动从HTML里找信息效率太低，Python有一个可以用来做HTML解析的库。

安装：

pip install bs4

引入：

from bs4 import BeautifulSoup

BeautifulSoup把看起来复杂的HTML内容解析成类似的树状结构：

All products | Books to Scrape - Sandbox

提取网页的全部价格、书名：

from bs4 import BeautifulSoup

import requests
headers={
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36 Edg/120.0.0.0"
}
content = requests.get("https://books.toscrape.com/",headers=headers).text
soup = BeautifulSoup(content,"html.parser")

all_prices=soup.findAll("p",attrs={"class": "price_color"})#findAll返回可迭代对象

for price in all_prices:
    print(price.string[2:])#string返回标签包围的文字

all_titles=soup.findAll("h3")#findAll返回可迭代对象
'''
for title in all_titles:
    all_names=title.findAll("a")
    for name in all_names:
        print(name.string)
'''
for title in all_titles:
    all_names=title.find("a")#find返回第一个对象
    print(all_names.string)

实战

爬取豆瓣电影top250

豆瓣电影 Top 250 (douban.com)

from bs4 import BeautifulSoup
import requests
headers={
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36 Edg/120.0.0.0"
}
for start_num in range(0,250,25):
    response = requests.get(f"https://movie.douban.com/top250?start={start_num}",headers=headers)
    html=response.text

    soup = BeautifulSoup(html,"html.parser")
    all_texts=soup.findAll("span",attrs={"class": "title"})#findAll返回可迭代对象
    for text in all_texts:
        s=text.string
        if "/" not in s:
            print(s)