将自己之前学习爬取新浪记录的内容分享一下吧!
透过pip安装套件
pip install openpyxl
pip install pandas -i http://pypi.douban.com/simple --trusted-host pypi.douban.com
pip install requests
pip install BeautifulSoup4
pip install jupyter
jupyter notebook
查看能否爬取网页内容
import requests
res = requests.get('https://www.sina.com.cn/')
res.encoding='utf-8'
print(res.text)
简单叙述关于BeautifulSoup使用
from bs4 import BeautifulSoup
html_sample = '\
<html> \
<body> \
<h1 id="title">Hello World</h1> \
<a href="#" class="link">This is link1</a> \
<a href="# link2" class="link">This is link2</a> \
</body> \
</html>'
soup = BeautifulSoup(html_sample, 'html.parser')
print(type(soup))
print(soup.text)
1、使用select找出含有h1标签的元素
soup = BeautifulSoup(html_sample)
header = soup.select(‘h1’)
print(header)
print(header[0]) #去掉两边[]
print(header[0].text)#取出文本
2、使用select找出含有a标签的元素
soup = BeautifulSoup(html_sample)
alink = soup.select(‘a’)
print(alink)
for link in alink:
print(link) #分开取出
print(link.text) #分开取出文字
3、使用select找出所有id为title的元素(id前面需加#)
alink = soup.select(‘#title’)
print(alink)
4、使用select找出所有class为link的元素(class前面需加 . )
soup = BeautifulSoup(html_sample)
for link in soup.select(‘.link’):
print(link)
5、使用select找出所有a tag的href连接
alinks = soup.select(‘a’)
for link in alinks:
print(link[‘href’])
下面就开始进入正题爬取新浪网页内容
import requests
from bs4 import BeautifulSoup
res = requests.get('https://news.sina.com.cn/world/')
res.encoding = 'utf-8'
soup = BeautifulSoup(res.text, 'html.parser')
for news in soup.select('.news-item'):
if len(news.select('h2')) > 0: