使用python进行基本的爬虫

最新推荐文章于 2024-04-01 13:30:49 发布

夏微凉秋微暖

最新推荐文章于 2024-04-01 13:30:49 发布

阅读量148

点赞数

分类专栏： python

本文链接：https://blog.csdn.net/pengbin790000/article/details/88050594

版权

python 专栏收录该内容

22 篇文章 0 订阅

订阅专栏

python版本:2.7

一：爬取豆瓣接口

使用到了urllib2、json

import urllib2
import json
try:
    response = urllib2.urlopen('https://api.douban.com/v2/book/1220562')
    html = response.read()
    print html
    hjson = json.loads(html)
    print hjson
    print hjson['id']
    print hjson['rating']['max']
    print hjson['tags'][0]['name']

except urllib2.URLError:
    exit()

二：爬取百度新闻

使用到了requests、time、beautifulSoup、lxml

1.先通过requesst请求获取百度新闻首页的内容

url = "http://news.baidu.com/"
        # 请求腾讯新闻的URL，获取其text文本
        wbdata = requests.get(url).text

2.使用beautifulSoup解析lxml

# 对获取到的文本进行解析
        soup = BeautifulSoup(wbdata, 'lxml')

3.通过分析页面源码

通过查看可以知道新闻内容大多在ul的li的a下

# 从解析文件中通过select选择器定位指定的元素，返回一个列表
        news_titles = soup.select("ul.focuslistnews > li > a")

接下来对其遍历即可

        for n in news_titles:
            # 提取出标题和链接信息
            title = n.get_text()
            href = n.get("href")
            new = New()
            new.title = title
            new.href = href
            new.queryTime = time.strftime('%Y-%m-%d', time.localtime())
            news.append(new)

        # 保存新闻到文件
        self.saveNewsToText(news)

夏微凉秋微暖

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
使用python进行基本的爬虫

python版本:2.7一：爬取豆瓣接口使用到了urllib2、jsonimport urllib2import jsontry: response = urllib2.urlopen('https://api.douban.com/v2/book/1220562') html = response.read() print html hjson ...
复制链接

扫一扫

专栏目录