第一个爬虫

最新推荐文章于 2022-05-08 21:30:28 发布

不才一首歌

最新推荐文章于 2022-05-08 21:30:28 发布

阅读量271

点赞数

分类专栏： python学习笔记文章标签：爬虫

本文链接：https://blog.csdn.net/Albert_ycl/article/details/80301371

版权

python学习笔记专栏收录该内容

34 篇文章 0 订阅

订阅专栏

1.爬虫的基本操作

URL指定内容获取到

- 发送HTTP请求：http://www.XXX.com.cn/X..

- 基于正则表达式获取内容

與情系统：

keys = ['XXX','XX','......',.......]

https://www.sogou.com/web?query=%s

https://search.sina.com.cn/?q=%s&c=new&from=channel&ie=utf-8

Python实现爬虫：

import requests

from bs4 import beatifulsoup

response = requests.get('http://www.XXX.com')

response.text

obj = beatifulsoup(response.text,......)

标签对象 = obj.find('a') #找到匹配成功的第一个标签

标签对象.find(...)

[标签对象,标签对象,]=obj.find_all('a') #找到匹配成功的所有标签

下面代码需求是获取一个：

import requests
from bs4 import BeautifulSoup
#上面俩是帮助我们对HTML做解析

response = requests.get('https://www.autohome.com.cn/news/')
response.encoding = 'gbk'
# print(response.content)   #content拿到的是字节
# response.encoding = 'gbk'
# print(response.text)    #text拿到的是文本信息
soup = BeautifulSoup(response.text,'html.parser') #'html.parser'  在python内部有一个解析器，把html解析成一个对象  soup相当于一个对象
tag = soup.find(id = 'auto-channel-lazyload-article')
h3 = tag.find(name = 'h3')
print(h3)

多内容爬虫
import requests
from bs4 import BeautifulSoup

# 找到所有新闻
# 标题，简介，URL，图片
response = requests.get('https://www.autohome.com.cn/news/')
response.encoding = 'gbk'
soup = BeautifulSoup(response.text,'html.parser')

li_list = soup.find(id = 'auto-channel-lazyload-article').find_all(name='li')

for li in li_list:
    title = li.find('h3')
    if not title:
        continue
    summary = li.find('p').text
#     li.find('a').attrs,获取他的所有属性，形成一个字典     li.find('a').attrs['href']
# 想获取一个标签里的属性，除了上面那种方法，还可以：     li.find('a').get('href')
    url = li.find('a').get('href')
    img = li.find('img').get('src')
    print(title.text,url,summary,img)

    res = requests.get(img)
    file_name = '%s.jpg'%(title)
    with open(file_name,'wb') as f:
        f.write(res.content)

由上面代码流程进行如下总结：

第一个模块：requests

obj = requests.get("url") 表示把url网址页面的内容拿出来放到obj中

obj.content 拿出来的内容原始字节形式

obj.encoding = 'gbk' 字节和字符串之间的转换默认utf-8，可以获取也可以指定

obj.text 拿出来的内容文本形式

上面拿到内容后：

soup = beatifulsoup(obj.text,'html.parser')

soup里面都会一下两种（他们的参数是一样的，但是返回值不一样）：

标签 = soup.find(name = 'XX'...) 标签对象

[标签,] = soup.find_all(attrs ='xx'...) 标签对象列表

针对每一个标签可以获取其中的文本内容：标签.text . 还可以获取其中的属性：标签.attrs ->(这是一个字典) 如果香获取其中属性的一个：标签.get(key)

不才一首歌

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
第一个爬虫

Admin Name : adminCluster Name : HDP_YclTotal Hosts : 2 (2 new)Repositories: redhat6 (HDP-2.5): http://192.168.90.121/HDP/ redhat6 (HDP-UTILS-1.1.0.21): http://192.168.90.121/HDP-UTILS-1.1...
复制链接

扫一扫