Python爬虫学习1

最新推荐文章于 2024-09-02 14:30:48 发布

zhenjiangxzy

最新推荐文章于 2024-09-02 14:30:48 发布

阅读量716

点赞数

分类专栏： Python学习文章标签： python 爬虫

本文链接：https://blog.csdn.net/zhenjiangxzy/article/details/58072880

版权

Python学习专栏收录该内容

7 篇文章 0 订阅

订阅专栏

一.概述

最近在学习Python，对于爬虫这一块有较大兴趣，于是开博客记录学习历程。

在使用中，我没有使用一些网络教程中的urllib2模块，直接使用requests模块，感觉确实很简单。

我使用爬虫抓取sina新闻的有关内容

二.使用requests获取html代码

在这里直接上代码

import requests
newsurl = "http"//news.sina.com.cn/china"
res = requests.get(newsurl)
print (res.text)

发现是乱码，于是查看编码方式

print (res.encoding)#查看编码方式

要能解析中文，需要使用“utf-8”的编码格式
最终，使用Python请求链接的代码为：

import requests  
newurl = 'http://news.sina.com.cn/china/'  
res = requests.get(newurl)  
res.encoding = 'utf-8'  
print(res.text)

三.使用BeautifulSoup4，对网页进行解析

在这里，我有这样一段html代码：

<html>
    <body>
        <h1 id="title"> Hello World </h1>
        <a href="!"  class = "link"> This is link1 </a>
        <a href="! link2" class = "link"> This is\ link2 </a>
    </body>
</html>

接下来，引入BeautifulSoup4库

soup = BeautifulSoup(html_sample, 'html.parser') #剖析器为parser
print (soup.text)   #得到需要的文字

#找出所有含有特定标签的html元素
soup = BeautifulSoup(html_sample,'html.parser')
header = soup.select("h1")
print (header)      #回传Pythonlist
print (header[0])   #去掉括号
print (header[0].text)  #取出文字

得到结果如下

四.其他类似功能的实现

#找出所有含有特定标签的html元素
soup = BeautifulSoup(html_sample)
header = soup.select("h1")
print (header)


#取得含有特定css属性的元素
使用select找出所有id为title的元素（id前面要加#）
alink = soup.select('#title')
print (alink)

#使用select找出所有class为link的元素（class前面要加.）
soup = BeautifulSoup(html_sample)
for link in soup.select('.link'):
    print (link)



#使用SELECT找出所有a tag的href连结
alinks = soup.select("a")
for link in alinks:
    print (link["href"])


#根据不同的html标签取得对应内容
for news in soup.select('.news-item'):
    if(len(news.select('h2'))>0):
        h2 = news.select('h2')[0].text
        time = news.select('.time')[0].text
        a = news.select('a')[0]['href']
        print (time, h2, a)

以上代码有以下注意事项：
a）id前面要加上句号（.）; class前面要加上井号（#）
b）在最后一段代码处，需要判断字符串的长度是否为0，只需要长度不为0的字符串进行解析，其他一律省略

五.网页内容的抓取


##取得内文页面
import requests
from bs4 import BeautifulSoup

url = "http://news.sina.com.cn/c/nd/2017-02-27/doc-ifyavvsh6939815.shtml"
res = requests.get(url)
res.encoding = "utf-8"
print (res.text)
soup = BeautifulSoup(res.text, 'html.parser')



#抓取标题
soup.select("#artibodyTitle")[0].text


#来源与时间
soup.select('.time-source')[0]


###使用contents:将资料列成不同list
soup.select('.time-sourse')[0].contents[0].strip() #strip()进行相关字符串的删除


##取得文章内容
article = []
for p in soup.select('#artibody p')[:-1]:
    article.append(p.text.strip())
" ".join(article)#段落部分用空格隔
##相当于使用列表解析
[p.text.strip() for p in soup.select("#antibody p")[:-1]]


###取得编辑的名字
editor = soup.select('.article-editor')[0].text.strip("zerenbianji")


###取得评论数量
soup.select("#commentCount1")
## 找出评论出处