Python-----爬虫

最新推荐文章于 2020-06-04 19:02:14 发布

万绿从中一点红

最新推荐文章于 2020-06-04 19:02:14 发布

阅读量582

点赞数 1

分类专栏： Python

本文链接：https://blog.csdn.net/weixin_41860600/article/details/106504714

版权

Python 专栏收录该内容

24 篇文章 2 订阅

订阅专栏

1、使用标准库urllib爬取“http://news.pdsu.edu.cn/info/1005/31269.htm”平顶山学院新闻网上的图片，要求:保存到F盘pic目录中，文件名称命名规则为“本人姓名”+ “_图片编号”，如姓名为张三的第一张图片命名为“张三_1.jpg”。

from re import findall
from urllib.request import urlopen

url = 'http://news.pdsu.edu.cn/info/1005/31269.htm'
with urlopen(url) as fp:
    content=fp.read().decode('utf-8')

pattern = '<img width="500" src="(.+?)"'
#查找所有图片链接地址
result = findall(pattern, content)  #捕获分组
#逐个读取图片数据，并写入本地文件
path='f:/pic/嘻嘻嘻_'##嘻嘻为照片名字
for index, item in enumerate(result):
    with urlopen( 'http://news.pdsu.edu.cn'+str(item)) as fp:
        with open(path+str(index+1)+'.png','wb') as fp1:
            fp1.write(fp.read())

运行结果如下：
在这里插入图片描述

2、采用scrapy爬虫框架，抓取平顶山学院新闻网（http://news.pdsu.edu.cn/）站上的内容，具体要求：抓取新闻栏目，将结果写入lm.txt。

import scrapy
from bs4 import BeautifulSoup
import re 

class mmlSpider(scrapy.Spider):
    name = 'mml'
    allowed_domains = ['pdsu.cn']
    start_urls = ['http://news.pdsu.edu.cn/']

    def parse(self, response):
        html_doc=response.text
        soup= BeautifulSoup(html_doc, 'html.parser')         
        re=soup.find_all('h2', class_='fl')
        content=''
        for lm in re:
            print(lm.text)
            content+=lm.text+'\n'
        with open('f:\\lm.txt', 'a+') as fp:
           fp.writelines(content)
           #文章内容保存在F盘的lm.text中

以管理员身份打开，安装bs4和Scrapy这两个库
输入

pip install bs4 
pip install scrapy

继续输入：

scrapy startproject xinwen  
cd xinwen
scrapy genspider lm news.pdsu.edu.cn  
scrapy crawl lm

xinwen：项目名；lm：爬虫名称；pdsu.edu.cn为爬取域名
在文件夹中找到 lm.py所在地，打开并把上面代码复制进去后运行代码打开F盘的text即可看见结果

3、采用request爬虫模块，抓取平顶山学院网络教学平台上的Python语言及应用课程上的每一章标题（http://mooc1.chaoxing.com/course/206046270.html）。

import scrapy
import re 
import requests
import bs4

headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36'
}

url='http://mooc1.chaoxing.com/course/206046270.html'
response = requests.get(url,headers=headers).text
soup = bs4.BeautifulSoup(response,'html.parser')
t=soup.findAll('div',class_='f16 chapterText')
for ml in t:
    print (ml.text)

以管理员身份打开cmd
输入：

scrapy startproject chaoxing
cd chaoxing
scrapy genspider lm news.mooc1.chaoxing.com/course/206046270.html
scrapy crawl lm

scrapy startproject+项目名
scrapy genspider+爬虫名+域名（如：news.mooc1.chaoxing.com/course/206046270.html）

运行代码

import scrapy
import re 
import requests
import bs4

headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36'
}

url='http://mooc1.chaoxing.com/course/206046270.html'
response = requests.get(url,headers=headers).text
soup = bs4.BeautifulSoup(response,'html.parser')
t=soup.findAll('div',class_='f16 chapterText')
for ml in t:
    print (ml.text)

在这里插入图片描述

万绿从中一点红

关注

1
点赞
踩
5

收藏

觉得还不错? 一键收藏
0
评论
Python-----爬虫

1、使用标准库urllib爬取“http://news.pdsu.edu.cn/info/1005/31269.htm”平顶山学院新闻网上的图片，要求:保存到F盘pic目录中，文件名称命名规则为“本人姓名”+ “_图片编号”，如姓名为张三的第一张图片命名为“张三_1.jpg”。from re import findallfrom urllib.request import urlopenurl = 'http://news.pdsu.edu.cn/info/1005/31269.htm'with
复制链接

扫一扫