爬虫基础实战一

最新推荐文章于 2024-08-24 00:15:00 发布

夜明二

最新推荐文章于 2024-08-24 00:15:00 发布

阅读量551

点赞数

分类专栏： Python爬虫

本文链接：https://blog.csdn.net/xphouziyu/article/details/81838868

版权

Python爬虫专栏收录该内容

2 篇文章 0 订阅

订阅专栏

安装必要库

首先需要确认是否安装了 PIP 管理工具

pip install requests 
pip install BeautifulSoup4
pip install jupyter

启动jupyter

这里没有使用pycharm，而是使用jupyter。
启动方法很简单，在命令行

jupyter notebook

然后会弹出一个网页。
使用方法太简单不想说。

抓取第一个网页

import requests
newsurl = 'http://news.sina.com.cn/c/2018-08-17/doc-ihhvciiw9327767.shtml'
res = requests.get(newsurl)
res.encoding = 'utf-8'
print(res.text)

提取文本

from bs4 import BeautifulSoup
html_sample = '\
<html> \
<body> \
<h1 id="title">Hello World</h1> \
<a href ="#" class="link">This is link1</a> \
</body> \
</html> '

soup = BeautifulSoup(html_sample,'html.parser')             //‘html.parser’ 是剖析器类型，如果不选会警告
print(soup.text)

一二两个实例的结合

from bs4 import BeautifulSoup
import requests
res = requests.get('http://news.sina.com.cn/c/2018-08-17/doc-ihhvciiw9327767.shtml')
res.encoding = 'utf-8'

soup = BeautifulSoup(res.text,'html.parser')
print(soup.text)

筛选文本

使用select方法，找出 h1标签。

soup = BeautifulSoup(html_sample)
header = soup.select('h1')
print(header)

找出 a标签 元素

soup = BeautifulSoup(html_sample)
alink = soup.select('a')
print(alink)

获得的select是一个列表,所以是不能直接使用text方法提取字符串
现在我们用for循环试试

soup = BeautifulSoup(html_sample)
alink = soup.select('a')
for i in alink:
    print(i.text)

除了通过标签，我们还能使用id号和class找出元素。

('#title')  #id前面加“#”
('.link')  #class前面加“.”

获取a标签的链接

这个经常要用到

alinks = soup.select('a')
for link in alinks:
    print(link['href'])

第一小节总结

打开要抓取的网页，使用审查元素点开Nerwork然后重新加载网页，点doc看Response下是否有我们要的内容。
获得 URL 后，查找出要的内容标签（class或者id）
接着，使用requests get网页，再用BeautifulSoup把内容剖析出来。

！！！一万头草泥马

！！！气煞老夫，这个网页反爬！！！
我们换个目标
不够步骤不变。

import requests
from bs4 import BeautifulSoup
# 下get网页
res = requests.get('https://www.goddess8.com/meinvmote/3044-4.html')
# 转换字符编码
res.encoding = 'gbk'

soup = BeautifulSoup(res.text,'html.parser')
h1_link = soup.select('.panel-title')
print(h1_link[0].text)

抓到了我们要的标题
！！！靠破不说破，福利哦

完整抓取

第一阶段获取网页

import requests
from bs4 import BeautifulSoup
res = requests.get('http://news.sina.com.cn/china/')
res.encoding = 'utf-8'
soup = BeautifulSoup(res.text,'html.parser')

第二阶段

获取我们需要的内容

我们需要，时间，标题，还有链接

for news in soup.select('.news-item'):
    if len(news.select('h2')) > 0:  #去除空内容
        h2 = news.select('h2')[0].text  # 获得标签
        time = news.select('.time')[0].text # 获得时间
        a = news.select('a')[0]['href'] #获得链接
        print(time,h2,a)

接着进入文章中获取跟多内容，比如大标题

import requests
from bs4 import BeautifulSoup
res = requests.get('http://news.sina.com.cn/china/')
res.encoding = 'utf-8'
soup = BeautifulSoup(res.text,'html.parser')
for news in soup.select('.news-item'):
    if len(news.select('h2')) > 0:
        a = news.select('a')[0]['href']
        res_1 = requests.get(a)
        res_1.encoding = 'utf-8'
        soup_1 =BeautifulSoup(res_1.text,'html.parser')
        print(soup_1.select('.main-title')[0].text)

时间字符串转换

from fatetime import datetime

字符转时间 -strptime

dt =datetime.strptime(timesource,'%Y年%m月%d日%H:%M')

时间转字符 -strftime

dt.strftime('%Y-%m-%d')

from datetime import datetime
time = "2018年01月22日 08:17"
datestr = datetime.strptime(time,'%Y年%m月%d日 %H:%M')
print(datestr)

夜明二

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录