初识python——一只爬虫的自我修养1

最新推荐文章于 2024-10-08 12:37:10 发布

倾听彼岸

最新推荐文章于 2024-10-08 12:37:10 发布

阅读量293

点赞数

分类专栏： python 文章标签： python

本文链接：https://blog.csdn.net/qq_46031627/article/details/112238939

版权

python 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

写在前面
跳了几节课，先学习爬虫-_-

文章目录

URL

URL的一般格式为(带方括号[]的为可选项) :
protocol:// hostname[:port] / path / [;parameters][?query]#fragment

URL由三部分组成

第一部分是协议: http,https, ftp, file, ed2k…

第二部分是存放资源的服务器的域名系统或IP地址(有时候要包含端只号，各种传输协议都有默认的端口号，如http的默认端口为80)

第三部分是资源的具体地址，如目录或文件名等。

first reptile

import urllib.request
response = urllib.request.urlopen("http://baidu/com")
#选择地址
html = response.read()
#读取数据
print(html)
#打印
html = html.decode("utf-8")
#以utf-8的形式编码
print(html)
#打印即为网页源代码

访问placekitten下载图片

import urllib.request
#response = urllib.request.urlopen('http://placekitten.com/500/600')
req = urllib.request.Request('http://placekitten.com/500/600')
response = urllib.request.urlopen(req)
#效果一样
cat_img = response.read()

with open("cat_500_600.jpg", 'wb') as f:
	f.write(cat_img)
response.geturl()
#访问的地址
response.info()
#远程服务器返回的信息
response.getcode()
#http的状态

模拟网易有道词典

虽然我没看见翻译结果，但是感觉好像可以用

import urllib.request
import urllib.parse

url = 'http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule'
data = {}
data['type'] = 'AUTO'
data['i']: 'I love XXX'
data['smartresult']: 'dict'
data['client']: 'fanyideskweb'
data['doctype']: 'json'
data['version']: '2.1'
data['keyfrom']: 'fanyi.web'
data = urllib.parse.urlencode(data).encode('utf-8')

response = urllib.request.urlopen(url, data)
html = response.read().decode('utf-8')
print(html)

下面这是船心版本

一键翻译，妈妈再也不用担心我的学习

import urllib.request
import urllib.parse
import json

content = input("请输入需要翻译的内容：")

url = 'http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule'
data = {}
data['type'] = 'AUTO'
data['i'] = content
data['smartresult'] = 'dict'
data['client'] = 'fanyideskweb'
data['doctype'] = 'json'
data['version'] = '2.1'
data['keyfrom'] = 'fanyi.web'
data = urllib.parse.urlencode(data).encode('utf-8')

response = urllib.request.urlopen(url, data)
html = response.read().decode('utf-8')
# print(html)

target = json.loads(html)
print("翻译结果为：%s" % (target['translateResult'][0][0]['tgt']))

修改header

通过Request的headers参数修改

head['User-Agent'] = 'Mozilla/5.0(window NT 6.3;WOW64) AppleWebKit/537.36(KHTML, like Gecko) Chrome/39.0.2171.65 Safair/537.36'
#修改客户端信息

通过Request.add_header()方法修改


req = urllib.request.Request(url,data)
req.add_header('User-Agent','Mozilla/5.0(Window NT 6.3;WOW64) AppleWebKit/537.36(KHTML, like Gecko) Chrome/39.0.2171.65 Safair/537.36)')

time模块

import sleep
time.sleep(5)
睡5秒钟

代理

步骤
1. 参数是一个字典{‘类型’：‘代理ip:端口号’}
proxy_support = urllib.request.ProxyHandler({})
2. 定制、创建一个opener
opener = urllib.request.build_opener(proxy_support)
3a. 安装opener
urllib.request.install_opener(opener)
3b.调用opener.
opener.open(url)