爬虫篇——今天也是没有闲着。学习笔记

a2488220557

已于 2024-02-01 22:43:27 修改

阅读量2.3k

点赞数 45

文章标签：学习笔记爬虫 python 开发语言

于 2024-02-01 22:40:48 首次发布

本文链接：https://blog.csdn.net/a2488220557/article/details/135981890

版权

待学习——正则表达式(变简单)——re

——多线程（同时）——threading

其实爬虫对于我现在初学的理解就是：

先请求，然后获取，然后提前。

1.爬虫1

2.爬虫2

3.爬虫3

4.爬虫4

5.爬虫5

6.爬虫6

1.爬虫1:先来代码：：代码部分

import requests    #导入库
head = {"User-Agent":"Mozilla/5.0(Windows NT 10.0; Win64; x64)"}#加个“头”，模仿浏览器访问

response = requests.get("http://books.toscrape.com/",headers= head)#访问的地址，和加上了“头”

if response.ok:       #如果访问成功，就输出  ，没有就on
    print(response.text)
else:
    print("on")

head = {"User-Agent":"Mozilla/5.0(Windows NT 10.0; Win64; x64)"}#加个“头”，模仿浏览器

response = requests.get("http://books.toscrape.com/",headers= head)

#访问的地址，和加上了“头”——————就是通过代码给浏览器发生请求

因为有的网站他不想代码访问，不想让你爬。想从正常的浏览器访问。这个就是为啥要加个“头”了。

2.爬虫2——豆瓣的访问

import requests

headers = {
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36 Edg/121.0.0.0"

} #“头”
response = requests.get("https://movie.douban.com/top250",headers = headers)
if response.ok:
    print(response.text)
else:
    print("on")

这个“头”的获取——

随便打开一个网站。

随便点一个进去找到“User-Agent:”然后把后面的复制给“头”就可以了。注意书写格式：

"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36 Edg/121.0.0.0"

如果访问成功就会返回这个网站的信息。

3.爬虫3——这个是制作一个网站

<!DOCTYPE html>
<html lang="en">
    <head>
        <meta charset="UTF-8">
        <title>这是个一个网站</title>
    </head>
    <body>
         <div style="background-color:red;">
             <h1>一级标题</h1>
             <h2>一级标题</h2>
         </div>
        <h1>一级标题</h1>
        <h2>一级标题</h2>

        <h1>一级<span style="background-color:red;">12</span>标题</h1>
        <h2>一级标题</h2>

        <p>这还一个 <br></br><b>文本</b></p>
        <img src="C:\\Users\\24882\\Desktop\\_20240131220126.jpg",width="500px">
        <a href="https://www.baidu.com" target="_blank">百度</a>

        <ol>
            <li>我是第一项</li>
            <li>我是第一项</li>
        </ol>

        <ul>
            <li>我是第一项</li>
            <li>我是第一项</li>
        </ul>

        <table border="1">
            <thead>
                <tr>
                    <td>头部1</td>
                    <td>头部2</td>
                    <td>头部3</td>
                </tr>#行
                <tr>
                    <td>头部1</td>
                    <td>头部2</td>
                    <td>头部3</td>
                </tr>#行
            </thead>#头部

            <tbody>
                <tr>
                    <td>111</td>
                    <td>222</td>
                    <td>333</td>
                </tr>#行
                <tr>
                    <td>111</td>
                    <td>222</td>
                    <td>333</td>
                </tr>#行
                <tr>
                    <td>111</td>
                    <td>222</td>
                    <td>333</td>
                </tr>#行


            </tbody>#主体
        </table>
    </body>
</html>

一个网站以一个

<!DOCTYPE html>
<html lang="en">
<head> </head>————就是题目
<body> </body>————主体部分
</html>

里面代码的：

从上往下看

1. <meta charset="UTF-8"> ——很重要编码模式“UTF-8”我相信不陌生

2.<h1>一级标题</h1>
  <h2>一级标题</h2>
3.<div></div>就是我我这个部分的意思——这个就是我这个部分为红色

<div style="background-color:red;">
    <h1>一级标题</h1>
    <h2>一级标题</h2>
</div>

4.<span></spqn>就是我选中的部分的颜色是什么

<h1>一级<span style="background-color:red;">12</span>标题</h1>
<h2>一级标题</h2>

5.<p></p>这个就是文本（正文） <br></br>这个就是换行（你会发现在<p>里的换行没有用）

<p>这还一个 <br></br><b>文本</b></p>

6.<img>这个是加入图片，不过要先建立图床——啊这个还不会。

<img src="C:\\Users\\24882\\Desktop\\_20240131220126.jpg",width="500px">

7.<a>加入链接——其中<target="_blank">就是开启一个新的网页显示链接内容

<a href="https://www.baidu.com" target="_blank">百度</a>

8.<ol></ol> <ul></ul> 这个是ol为有序排列（前面有数字）<ul>这个无序排列

<ol>
    <li>我是第一项</li>
    <li>我是第一项</li>
</ol>

<ul>
    <li>我是第一项</li>
    <li>我是第一项</li>
</ul>

9.<table></table>就是表格

<table border="1">
    <thead>
        <tr>
            <td></td>
        </tr>#行
    </thead>#头部
    <tbody>
        <tr>   
        </tr>#行 
    </tbody>#主体
</table>

这个是形式，"border"就是边框的大小

<thead></thead>——头部

<tbody></tbody>——主体

<tr></tr>——就几行就有几个

<td></td>——内容

<table border="1">
    <thead>
        <tr>
            <td>头部1</td>
            <td>头部2</td>
            <td>头部3</td>
        </tr>#行
        <tr>
            <td>头部1</td>
            <td>头部2</td>
            <td>头部3</td>
        </tr>#行
    </thead>#头部

    <tbody>
        <tr>
            <td>111</td>
            <td>222</td>
            <td>333</td>
        </tr>#行
        <tr>
            <td>111</td>
            <td>222</td>
            <td>333</td>
        </tr>#行
        <tr>
            <td>111</td>
            <td>222</td>
            <td>333</td>
        </tr>#行
    </tbody>#主体
</table>

好家伙爬虫3完了

图片还要图床所以显示不成功

4.爬虫4——在练习的网站上获取价格

from  bs4 import BeautifulSoup
import requests
content = requests.get("http://books.toscrape.com/").text
soup = BeautifulSoup(content,"html.parser")
all_prices = soup.findAll("p",attrs={ "class": "price_color"})
for price in all_prices:
    print(price.string)#会把标签包围的文字，返回给我们

我们先打开这个网站：

我们的目的是提前价格，就要看价格的格式，和其他的特殊之处

<p class="price_color" _msttexthash="48503" _msthash="60">£51.77</p>

<p class="price_color" _msttexthash="48269" _msthash="65">£53.74</p>

<p> 元素表示文本的一个段落。

可以看出都在<p>中的“price_color”这个类中

1.soup = BeautifulSoup(content,"html.parser")——将这个网站的信息给Bea...这个库函数中，“html.parser”是解释器形式
2.all_prices = soup.findAll("p",attrs={ "class": "price_color"})——查看<p>中的类中的"price_color"

3.print(price.string)#会把标签包围的文字，返回给我们

本来是这个

然后加上那个就是

4.还可以加上切片

print(price.string[2:])#会把标签包围的文字，返回给我们

结束！！！！！

5.爬虫5——这个是提前那个书网站的书名

from  bs4 import BeautifulSoup
import requests
content = requests.get("http://books.toscrape.com/").text
soup = BeautifulSoup(content,"html.parser")
all_tittle = soup.findAll("h3")
for tittle in all_tittle:
    all_links = tittle.findAll("a")
    for link in all_links:
        print(link.string)

这个是那个网站上书名的截屏：

可以看出在<a>

在发现在<h3>中：

所以

     all_tittle = soup.findAll("h3")——所有的书名
    for tittle in all_tittle:
     all_links = tittle.findAll("a")———所有书名的"a"
for link in all_links:
print(link.string)——所有“a”的文字

结束！！！！！

6.爬虫6——这个是提取豆瓣250的榜单

import requests
from bs4 import BeautifulSoup
num=1
headers = {
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36 Edg/121.0.0.0"

}
for start_num in range(0,250,25):

    response = requests.get(f"https://movie.douban.com/top250?start={start_num}",headers = headers)
    if response.ok:
        html = response.text
        soup = BeautifulSoup(html,"html.parser")#"html.parser"这个是html的解析器
        all_titles = soup.findAll("span",attrs = {"class":"title"})
        for title in all_titles:
            title_string=title.string
            if "/" not in title_string:
                print(f"{num}.{title_string}")
                num+=1
    else:
        print("on")

先打开网站