爬虫技术知多少

最新推荐文章于 2024-08-02 16:41:42 发布

yzhua_777

最新推荐文章于 2024-08-02 16:41:42 发布

阅读量373

点赞数

分类专栏：爬虫技术文章标签： python 爬虫

本文链接：https://blog.csdn.net/yzhua_777/article/details/105669562

版权

爬虫技术专栏收录该内容

5 篇文章 0 订阅

订阅专栏

一、爬取python之禅
了解一个网络爬虫程序的最普遍的过程：
1.访问站点
2.找到需要的信息，并且定位好
3.获得信息后，进行处理

show the code

import requests
url = 'https://www.python.org/dev/peps/pep-0020/'
res = requests.get(url)
text = res.text
text

看一下结果
在这里插入图片描述
可以看到返回的其实就是开发者工具下Elements的内容，只不过是字符串类型，接下来我们要用python的内置函数find来定位“python之禅”的索引，然后从这段字符串中取出它
通过观察网站，我们可以发现这段话在一个特殊的容器中，通过审查元素，使用快捷键Ctrl+shift+c快速定位到这段话也可以发现这段话包围在pre标签中，因此我们可以由这个特定用find函数找出具体内容

#将爬取内容存放在txt文档里
with open('zon_of_python.txt', 'w') as f:
    f.write(text[text.find('<pre')+28:text.find('</pre>')])
#这里的+28指的是从<pre开始定位往后28位就是我们要的文档
print(text[text.find('<pr')+28:text.find('</pre>')-1])

在这里插入图片描述
接下来，我们用金山词霸来翻译我们刚刚爬出来的python之禅
我们先以金山词霸为例，有道翻译百度翻译谷歌翻译都有加密，以后可以自己尝试。
首先进入金山词霸首页http://www.iciba.com/
然后打开开发者工具下的“Network”，翻译一段话，比如刚刚我们爬到的第一句话“Beautiful is better than ugly.”
点击翻译后可以发现Name下多了一项请求方法是POST的数据，点击Preview可以发现数据中有我们想要的翻译结果
在这里插入图片描述

import requests
def translate(word):
    url = 'http://fy.iciba.com/ajax.php?a=fy'
    
    data = {
        'f':'auto',
        't':'auto',
        'w':word,
         }
    headers ={
        'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36',
        }#user-agent会告诉网站服务器，访问者是通过什么工具来请求的，如果是爬虫请求，一般会拒绝，如果是用户浏览器请求就会应答
    response = requests.post(url, data=data,headers=headers)#发起请求
    json_data = response.json()#获取json数据
    return json_data
def run(word):
    result = translate(word)['content']['out']
    print(result)
    return result
def main():
    with open('zon_of_python.txt')as f:
        zh = [run(word) for word in f]
    with open('zon_of_python_zh-CN.txt','w') as g:
        for i in zh:
            g.write(i + '\n')

if __name__== '__main__':
    main()

在这里插入图片描述
二、爬取豆瓣电影top250电影名称和图片
当我们打开https://movie.douban.com/top250时，发现电影每个页面只显示25个，要爬取top250这是个动态的过程需要看一下接下来url的变化。
https://movie.douban.com/top250?start=’+ str() +’&filter=’
这个就可以看出都豆瓣页面的变化了。
当然也要查看一下，我们所要信息的定位。
废话不多说，上代码：

import requests
import os
if not os.path.exists('image'):
    os.mkdir('image')
def parse_html(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36'
    }
    res = requests.get(url, headers=headers)
    text = res.text
    item = []
    for i in range(25):
        text = text[text.find('alt')+3:]
        item.append(extract(text))
    return item
def extract(text):
    text = text.split('"')
    name = text[1]
    image = text[3]
    return name, image
def write_movies_file(item, stars):
    print(item)
    with open('douban_film.txt','a', encoding='utf-8') as f:
        f.write('排名:%d\t 电影名：%s\n' %(stars, item[0]))
    r = requests.get(item[1])
    with open('image/'+str(item[0])+'.jpg','wb') as f:
        f.write(r.content)
def main():
    stars = 1
    for offset in range(0,250,25):
        url = 'https://movie.douban.com/top250?start='+str(offset)+'&filter='
        for item in parse_html(url):
            write_movies_file(item, stars)
            stars += 1
if __name__ == '__main__':
    main()

看下输出结果：
在这里插入图片描述
ok，这样就可以大功告成了！

yzhua_777

关注

0
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
爬虫技术知多少

一、爬取python之禅了解一个网络爬虫程序的最普遍的过程：1.访问站点2.找到需要的信息，并且定位好3.获得信息后，进行处理show the codeimport requestsurl = 'https://www.python.org/dev/peps/pep-0020/'res = requests.get(url)text = res.texttext看一下结果...
复制链接

扫一扫

专栏目录