Python---实验九

最新推荐文章于 2023-09-20 12:18:31 发布

beyond谚语

最新推荐文章于 2023-09-20 12:18:31 发布

阅读量1.4k

点赞数

分类专栏：平院Python测试题

平院Python测试题专栏收录该内容

25 篇文章 9 订阅

订阅专栏

1、使用标准库urllib爬取“http://news.pdsu.edu.cn/info/1005/31269.htm”平顶山学院新闻网上的图片，要求:保存到F盘pic目录中，文件名称命名规则为“本人姓名”+ “_图片编号”，如姓名为张三的第一张图片命名为“张三_1.jpg”。

from re import findall
from urllib.request import urlopen

url = 'http://news.pdsu.edu.cn/info/1005/31269.htm'
with urlopen(url) as fp:
    content=fp.read().decode('utf-8')

pattern = '<img width="500" src="(.+?)"'
#查找所有图片链接地址
result = findall(pattern, content)  #捕获分组
#逐个读取图片数据，并写入本地文件
path='f:/pic/'
name="烟雨"
for index, item in enumerate(result):
    picture = 'http://news.pdsu.edu.cn/' + item
    with urlopen(str(picture)) as fp:
        with open(path+name+'_'+str(index+1)+'.png','wb') as fp1: #这里因为是从1开始，这里注意下
            fp1.write(fp.read())

效果图如下：
在这里插入图片描述

2、采用scrapy爬虫框架，抓取平顶山学院新闻网（http://news.pdsu.edu.cn/）站上的内容，具体要求：抓取新闻栏目，将结果写入lm.txt。

cmd打开之后就别关了
scrapy startproject wsqwsq为项目名
cd wsq
scrapy genspider lm news.pdsu.edu.cnlm为爬虫名称，pdsu.edu.cn为爬取起始位置
在这里插入图片描述
分析：编写正确的正则表达式筛选信息
由关键信息：<h2 class="fl">媒体平院</h2>
筛选其正则表达式如下：soup.find_all('h2', class_='fl')
找到lm.py也就是上面创建的爬虫
编辑：将下面代码负责粘贴下
pip install beautifulsoup4
pip install scrapy
俩第三方库要安装下

# -*- coding: utf-8 -*-
import scrapy
from bs4 import BeautifulSoup
import re 

class LmmSpider(scrapy.Spider):
    name = 'lmm'
    allowed_domains = ['pdsu.cn']
    start_urls = ['http://news.pdsu.edu.cn/']

    def parse(self, response):
        html_doc=response.text
        soup= BeautifulSoup(html_doc, 'html.parser')         
        re=soup.find_all('h2', class_='fl')
        content=''
        for lm in re:
            print(lm.text)
            content+=lm.text+'\n'
        with open('f:\\lm.txt', 'a+') as fp:
           fp.writelines(content)#保存路径可变

scrapy crawl lmlm为爬虫名称
效果图如下：
在这里插入图片描述

3、采用request爬虫模块，抓取平顶山学院网络教学平台上的Python语言及应用课程上的每一章标题（http://mooc1.chaoxing.com/course/206046270.html）。

cmd打开之后就别关了
scrapy startproject yyyy为项目名
cd yy
scrapy genspider beyond news.mooc1.chaoxing.com/course/206046270.htmlbeyond为爬虫名称，mooc1.chaoxing.com/course/206046270.html为爬取起始位置
在这里插入图片描述
分析：编写正确的正则表达式筛选信息
由关键信息：<div class="f16 chapterText">第一章 python概述</div>
筛选其正则表达式如下：soup.findAll('div',class_='f16 chapterText')
找到beyond.py也就是上面创建的爬虫
编辑：将下面代码负责粘贴下

# -*- coding: utf-8 -*-
import scrapy
import re 
import requests
import bs4

headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36'
}

url='http://mooc1.chaoxing.com/course/206046270.html'
response = requests.get(url,headers=headers).text
soup = bs4.BeautifulSoup(response,'html.parser')
t=soup.findAll('div',class_='f16 chapterText')
for ml in t:
    print (ml.text)

效果图如下：
在这里插入图片描述

beyond谚语

关注

0
点赞
踩
6

收藏

觉得还不错? 一键收藏
1
评论
Python---实验九

1、使用标准库urllib爬取“http://news.pdsu.edu.cn/info/1005/31269.htm”平顶山学院新闻网上的图片，要求:保存到F盘pic目录中，文件名称命名规则为“本人姓名”+ “_图片编号”，如姓名为张三的第一张图片命名为“张三_1.jpg”。from re import findallfrom urllib.request import urlopenurl = 'http://news.pdsu.edu.cn/info/1005/31269.htm'with
复制链接

扫一扫