网络爬虫基础练习

最新推荐文章于 2021-03-25 22:36:57 发布

aoxue4029

最新推荐文章于 2021-03-25 22:36:57 发布

阅读量140

点赞数

原文链接：http://www.cnblogs.com/AAAAAAAA/p/8696725.html

版权

　0.可以新建一个用于练习的html文件，在浏览器中打开。

 
   <! 
   DOCTYPE 
   html> 
  
 
   < 
   html 
   lang="en"> 
  
 
   < 
   head 
   > 
  
 
        
   < 
   meta 
   charset="UTF-8"> 
  
 
        
   < 
   title 
   >Title</ 
   title 
   > 
  
 
   </ 
   head 
   > 
  
 
   < 
   body 
   > 
  
 
       
   < 
   h1 
   >This is the document body</ 
   h1 
   > 
  
 
        
   < 
   P 
   ID = "p1Node">This is paragraph 1.</ 
   P 
   > 
  
 
        
   < 
   P 
   ID = "p2Node">段落2</ 
   P 
   > 
  
 
        
   < 
   a 
   href="http://www.gzcc.cn/">广州商学院</ 
   a 
   > 
  

      
  
 
        
   < 
   li 
   > 
  
 
            
   < 
   a 
   href="http://news.gzcc.cn/html/2018/xiaoyuanxinwen_0328/9113.html"> 
  
 
                
   < 
   div 
   class="news-list-text"> 
  
 
                    
   < 
   div 
   class="news-list-title" style="">我校校长杨文轩教授讲授新学期“思政第一课”</ 
   div 
   > 
  
 
                    
   < 
   div 
   class="news-list-description">3月27日下午，我校校长杨文轩教授在第四教学楼310室为学生讲授了新学期“思政第一课”。</ 
   div 
   > 
  
 
                    
   < 
   div 
   class="news-list-info">< 
   span 
   >< 
   i 
   class="fa fa-clock-o"></ 
   i 
   >2018-03-28</ 
   span 
   >< 
   span 
   >< 
   i 
   class="fa fa-building-o"></ 
   i 
   >马克思主义学院</ 
   span 
   ></ 
   div 
   > 
  
 
                
   </ 
   div 
   > 
  
 
            
   </ 
   a 
   > 
  
 
   </ 
   body 
   > 
  
 
   </ 
   html 
   > 
  

1.利用requests.get(url)获取网页页面的html文件

import requests

newurl = 'http://news.gzcc.cn/html/2018/xiaoyuanxinwen_0328/9113.html'
res = requests.get(newurl)
res.encoding = "utf-8"
print(res.text)

2.利用BeautifulSoup的HTML解析器，生成结构树

from bs4 import BeautifulSoup
soup = BeautifulSoup(res.text,'html.parser')
print(soup)

3.找出特定标签的html元素

soup.p #标签名，返回第一个

soup.head

soup.p.name #字符串

soup.p. attrs #字典，标签的所有属性

soup.p. contents # 列表，所有子标签

soup.p.text #字符串

soup.p.string

soup.select(‘li')

4.取得含有特定CSS属性的元素

soup.select('#p1Node')

soup.select( '.news-list-title' )

5.练习：

取出H1标签

a =soup.select('h1')
a1=a[0].text
print(a1)

取出a的标签

a = soup.a.attrs['href']

soup.li.a.attrs[ 'href' ]

print (a)

取出所有li标签的所有内容

a = soup.select( 'li' )[ 0 ].text

print (a)

取出一条新闻的标题、链接、发布时间、来源

a0 = soup.select( '.news-list-title' )[ 0 ].text

a1 = soup.select( '.news-list-info' )[ 0 ].contents[ 0 ].text

a2 = soup.select( '.news-list-info' )[ 0 ].contents[ 1 ].text

a3 = soup.body.li.a.attrs[ 'href' ]

print (a0,a1,a2,a3)

转载于:https://www.cnblogs.com/AAAAAAAA/p/8696725.html