第一个爬虫

最新推荐文章于 2020-02-18 16:50:12 发布

不才一首歌

最新推荐文章于 2020-02-18 16:50:12 发布

阅读量312

点赞数

分类专栏： python学习笔记文章标签：爬虫

本文链接：https://blog.csdn.net/albert_ycl/article/details/80301371

版权

python学习笔记专栏收录该内容

34 篇文章 0 订阅

订阅专栏

1.爬虫的基本操作

URL指定内容获取到

- 发送HTTP请求：http://www.XXX.com.cn/X..

- 基于正则表达式获取内容

與情系统：

keys = ['XXX','XX','......',.......]

https://www.sogou.com/web?query=%s

https://search.sina.com.cn/?q=%s&c=new&from=channel&ie=utf-8

Python实现爬虫：

import requests

from bs4 import beatifulsoup

response = requests.get('http://www.XXX.com')

response.text

obj = beatifulsoup(response.text,......)

标签对象 = obj.find('a') #找到匹配成功的第一个标签

标签对象.find(...)

[标签对象,标签对象,]=obj.find_all('a') #找到匹配成功的所有标签

下面代码需求是获取一个：

import requests
from bs4 import BeautifulSoup
#上面俩是帮助我们对HTML做解析

response = requests.get('https://www.autohome.com.cn/news/')
response.encoding = 'gbk'
# print(response.content)   #content拿到的是字节
# response.encoding = 'gbk'
# print(response.text)    #text拿到的是文本信息
soup = BeautifulSoup(response.text,'html.parser') #'html.parser'  在python内部有一个解析器，把html解析成一个对象  soup相当于一个对象
tag = soup.find(id = 'auto-channel-lazyload-article')
h3 = tag.find(name = 'h3')
print(h3)

多内容爬虫
import requests
from bs4 import BeautifulSoup

# 找到所有新闻
# 标题，简介，URL，图片
response = requests.get('https://www.autohome.com.cn/news/')
response.encoding = 'gbk'
soup = BeautifulSoup(response.text,'html.parser')

li_list = soup.find(id = 'auto-channel-lazyload-article').find_all(name='li')

for li in li_list:
    title = li.find('h3')
    if not title:
        continue
    summary = li.find('p').text
#     li.find('a').attrs,获取他的所有属性，形成一个字典     li.find('a').attrs['href']
# 想获取一个标签里的属性，除了上面那种方法，还可以：     li.find('a').get('href')
    url = li.find('a').get('href')
    img = li.find('img').get('src')
    print(title.text,url,summary,img)

    res = requests.get(img)
    file_name = '%s.jpg'%(title)
    with open(file_name,'wb') as f:
        f.write(res.content)