西刺代理网页是:http://www.xicidaili.com/nn
注意:
1.西刺代理千万别用代理爬取,目前我使用66代理和西刺代理是无法爬取的西刺网页的
2.一定要加User-Agent报头
1.从网页爬取到csv文件代码
from urllib import request #导request包
from piaot import * #导自定义包博客:[python伪装定义包]伪装包
import re #导正则包
#我们把打开一个csv的文件我没将爬取出的信息存到里面
f = open('C:/Users/黑神/Desktop/爬虫/西刺代理ip.csv','a',encoding='utf-8')
a=''
#循环页数
for t in range(1,3):
if t==1:
url = 'http://www.xicidaili.com/nn'
else:
url='http://www.xicidaili.com/nn/'+str(t)
print(url)
#添加报头
headers = {'User-Agent':pa()}
res=request.Request(url,headers=headers)
#开启接口爬取
html=request.urlopen(res)
#将爬取的数据进行用utf-8解码
html=html.read().decode('utf-8')
#正则匹配
data=re.compile(r'<table id="ip_list">(.*?)</table>',re.S)
html=data.findall(html)[-1]
#正则匹配
data1=re.compile(r'<td>(.*?)</td>|<td class="country">(.*?)</td>')
html=data1.findall(html)
a = ''
for i in html:
for j in i:
if j !='':
if not 'img src' in j :
a+=j+','
if '-' in j:
a = a[:-1]
a += '\n'
# 保存到csv文件里
f.write(a)
2.将csv里清洗出来,将ip和端口号过滤出来代码
将西刺代理ip数据进行清洗
#打开csv文件
with open('C:/Users/黑神/Desktop/爬虫/西刺代理ip.csv', 'r', encoding='utf-8') as f:
x = f.readlines()
lbiao=[]
for i in range(len(x)):
x1=x[i].split(',')
row = x1[0].replace('\r', '').replace('\n', '').replace('\t', '').replace('\ufeff','')
lbiao.append(row+':'+x1[1])
print(lbiao)
# 保存到txt文件
with open('C:/Users/黑神/Desktop/爬虫/备份.txt', 'w') as f:
f.write(str(lbiao))
3.测试txt里的ip地址是否畅通
#用66代理测试,我目前感觉很好用
url = 'http://www.66ip.cn/'
#同样将将通顺的ip地址存到txt里
with open('C:/Users/黑神/Desktop/爬虫/备份.txt', 'r', encoding='utf-8') as f:
x = f.readlines()
a = []
for j in x:
x=eval(j)
for t in x[0:50]:
try:
proxy = {'http': t}
print('开启代理:' + proxy['http'])
# 创建ProxyHandler
proxy_support = request.ProxyHandler(proxy)
# 创建Opener
opener = request.build_opener(proxy_support)
# 添加User Angent
opener.addheaders = [('User-Agent', pa())]
# 安装OPener
request.install_opener(opener)
print('伪装成功,开始爬寻.....')
# 使用自己安装好的Opener
response = request.urlopen(url,timeout=6)
print(response)
except:
print('爬寻失败并且删除ip地址!')
continue
print('成功爬寻完毕!(^-^)')
a.append(t)
# 保存到文件
with open('C:/Users/黑神/Desktop/爬虫/代理ip地址.txt', 'w') as f:
f.write(str(a))