Python爬虫笔记（1）

最新推荐文章于 2024-11-13 17:24:18 发布

蜻蜓队长TTT

最新推荐文章于 2024-11-13 17:24:18 发布

阅读量286

点赞数

文章标签： python 数据挖掘

本文链接：https://blog.csdn.net/weixin_44880916/article/details/105656585

版权

Python爬虫笔记（1）

1.爬取zon of python

import requests
url = 'https://www.python.org/dev/peps/pep-0020/'
res = requests.get(url)
text = res.text
### 通过find函数定位标签文本
with open('zon_of_python.txt', 'w') as f:
    f.write(text[text.find('<pre')+28:text.find('</pre>')-1])
print(text[text.find('<pre')+28:text.find('</pre>')-1])

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!

2.利用金山词霸翻译zon of python（POST）

import requests
def translate(word):
    url="http://fy.iciba.com/ajax.php?a=fy"

    data={
        'f': 'auto',
        't': 'auto',
        'w': word,
    }
    
    headers={
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36',
    }#User-Agent会告诉网站服务器，访问者是通过什么工具来请求的，如果是爬虫请求，一般会拒绝，如果是用户浏览器，就会应答。
    response = requests.post(url,data=data,headers=headers)     #发起请求
    json_data=response.json()   #获取json数据
    #print(json_data)
    return json_data
    
def run(word):    
    result = translate(word)['content']['out']   
    print(result)
    return result

def main():
    with open('zon_of_python.txt') as f:
        zh = [run(word) for word in f]

    with open('zon_of_python_zh-CN.txt', 'w') as g:
        for i in zh:
            g.write(i + '\n')
            
if __name__ == '__main__':
    main()

美丽胜过丑陋。
 外显优于内隐..
 简单胜于复杂。
 复杂胜于复杂。
 平比嵌套好..
 疏而不密..
 可读性计数。
 特殊情况不足以打破规则。
 尽管实用性胜过纯度。
 错误永远不应该悄悄地过去。
除非有明确的沉默。
 面对暧昧，拒绝猜测的诱惑..
 应该有一种----最好只有一种----明显的办法来做到这一点。
 虽然这种方式一开始可能不明显，除非你是荷兰人。
 现在总比永远好。
虽然从来没有比现在更好。
 如果实施很难解释，那是个坏主意。
 如果实现很容易解释，这可能是个好主意。
 命名空间是一个伟大的想法-让我们做更多的这些！

3.爬取豆瓣top250电影的排名、名字、图片

（主要运用的还是通过规则，都比较简单，item包括名字+图片Url）

import requests
import os

if not os.path.exists('image'):
     os.mkdir('image')
### item包括（名字+图片Url）
def parse_html(url):
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36"}
    res = requests.get(url, headers=headers)
    text = res.text
    item = []
    for i in range(25):
        text = text[text.find('alt')+3:]
        item.append(extract(text))
    return item
       
def extract(text):
    text = text.split('"')
    name = text[1]
    image = text[3]
    return name, image

def write_movies_file(item, stars):
    print(item)
    with open('douban_film.txt','a',encoding='utf-8') as f:
        f.write('排名：%d\t电影名：%s\n' % (stars, item[0]))
    r = requests.get(item[1])
    with open('image/' + str(item[0]) + '.jpg', 'wb') as f:
        f.write(r.content)
        
def main():
    stars = 1
    for offset in range(0, 250, 25):
        url = 'https://movie.douban.com/top250?start=' + str(offset) +'&filter='
        for item in parse_html(url):
            write_movies_file(item, stars)
            stars += 1

if __name__ == '__main__':
    main()

('肖申克的救赎', 'https://img9.doubanio.com/view/photo/s_ratio_poster/public/p480747492.jpg')
('霸王别姬', 'https://img9.doubanio.com/view/photo/s_ratio_poster/public/p2561716440.jpg')
('阿甘正传', 'https://img9.doubanio.com/view/photo/s_ratio_poster/public/p1484728154.jpg')
('这个杀手不太冷', 'https://img9.doubanio.com/view/photo/s_ratio_poster/public/p511118051.jpg')
('美丽人生', 'https://img9.doubanio.com/view/photo/s_ratio_poster/public/p2578474613.jpg')
('泰坦尼克号', 'https://img9.doubanio.com/view/photo/s_ratio_poster/public/p457760035.jpg')
('千与千寻', 'https://img9.doubanio.com/view/photo/s_ratio_poster/public/p2557573348.jpg')
('辛德勒的名单', 'https://img9.doubanio.com/view/photo/s_ratio_poster/public/p492406163.jpg')
('盗梦空间', 'https://img9.doubanio.com/view/photo/s_ratio_poster/public/p513344864.jpg')
('忠犬八公的故事', 'https://img9.doubanio.com/view/photo/s_ratio_poster/public/p524964016.jpg')
('海上钢琴师', 'https://img9.doubanio.com/view/photo/s_ratio_poster/public/p2574551676.jpg')
('楚门的世界', 'https://img9.doubanio.com/view/photo/s_ratio_poster/public/p479682972.jpg')
('三傻大闹宝莱坞', 'https://img9.doubanio.com/view/photo/s_ratio_poster/public/p579729551.jpg')