2024年Python最新python爬虫初步-与java爬虫的比较(1)，2024年最新腾讯面试做题

最新推荐文章于 2024-11-10 15:01:17 发布

2401_84572928

最新推荐文章于 2024-11-10 15:01:17 发布

阅读量402

点赞数 3

分类专栏：程序员文章标签： python 学习面试

本文链接：https://blog.csdn.net/2401_84572928/article/details/138632588

版权

程序员专栏收录该内容

96 篇文章 0 订阅

订阅专栏

网上学习资料一大堆，但如果学到的知识不成体系，遇到问题时只是浅尝辄止，不再深入研究，那么很难做到真正的技术提升。

需要这份系统化学习资料的朋友，可以戳这里获取

一个人可以走的很快，但一群人才能走的更远！不论你是正从事IT行业的老鸟或是对IT行业感兴趣的新人，都欢迎加入我们的的圈子（技术交流、学习资源、职场吐槽、大厂内推、面试辅导），让我们一起学习成长！

try {

URL url = new URL(link);

URLConnection con = url.openConnection();

reader = new BufferedReader(new InputStreamReader(con.getInputStream()));

writer = new FileWriter(“f://test1.txt”);

String buff = null;

StringBuilder sb = new StringBuilder();

while((buff = reader.readLine()) != null){

writer.write(buff);

}

} catch (MalformedURLException e) {

// TODO Auto-generated catch block

e.printStackTrace();

} catch (IOException e) {

// TODO Auto-generated catch block

e.printStackTrace();

}finally{

try {

reader.close();

writer.close();

} catch (IOException e) {

// TODO Auto-generated catch block

e.printStackTrace();

}

python代码:

import urllib.request

def downloadHtml(link):

try:

data = urllib.request.urlopen(link).read()

text = data.decode(‘utf-8’,‘ignore’)

file = open(‘f:\\test2.txt’,‘w’)

file.write(text)

file.close()

except urllib.request.URLError as e:

print(e.reason)

downloadHtml(“http://www.bilibili.com”)

然而这种爬取网站的方式是不安全的,1是网站会根据访问者的类型来屏蔽访问.2是在爬取的时候访问网站会非常频繁,网站会封掉我们的IP

我们需要欺骗网站,在java中实现这一点比较常用的是使用HttpClient的jar包来实现

java代码如下

import java.io.BufferedInputStream;

import java.io.BufferedReader;

import java.io.IOException;

import java.io.InputStreamReader;

import org.apache.http.HttpEntity;

import org.apache.http.HttpHost;

import org.apache.http.HttpRequest;

import org.apache.http.HttpResponse;

import org.apache.http.client.ClientProtocolException;

import org.apache.http.client.config.RequestConfig;

import org.apache.http.client.methods.CloseableHttpResponse;

import org.apache.http.client.methods.HttpGet;

import org.apache.http.impl.client.CloseableHttpClient;

import org.apache.http.impl.client.HttpClients;

import org.apache.http.impl.execchain.MainClientExec;

public class HttpclientTest {

public static void main(String[] args){

System.out.println(new HttpclientTest().gethtml(“http://www.bilibili.com”));

}

public String gethtml(String url){

String html = null;

//建立HttpClient类,这是HttpClient的jar包中的一个类

CloseableHttpClient httpclient = HttpClients.createDefault();

//读取html的流

BufferedReader reader = null;

//设置请求信息

HttpGet getmethod = new HttpGet(url);

//响应

HttpResponse response = null;

//设置代理IP

HttpHost proxy = new HttpHost(“124.88.67.81”,80);

//设置超时时间

RequestConfig config = RequestConfig.custom().setProxy(proxy).setConnectTimeout(5000).setConnectionRequestTimeout(5000).setSocketTimeout(5000).build();

//添加请求头

getmethod.addHeader(“User-Agent”,“Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36”);

getmethod.setConfig(config);

try {

//获得响应

response = httpclient.execute(getmethod);

//响应体

HttpEntity entity = response.getEntity();

//建立一个读取流

reader = new BufferedReader(new InputStreamReader(entity.getContent()));

最后

不知道你们用的什么环境，我一般都是用的Python3.6环境和pycharm解释器，没有软件，或者没有资料，没人解答问题，都可以免费领取（包括今天的代码），过几天我还会做个视频教程出来，有需要也可以领取~

给大家准备的学习资料包括但不限于：

Python 环境、pycharm编辑器/永久激活/翻译插件

python 零基础视频教程

Python 界面开发实战教程

Python 爬虫实战教程

Python 数据分析实战教程

python 游戏开发实战教程

Python 电子书100本

Python 学习路线规划

网上学习资料一大堆，但如果学到的知识不成体系，遇到问题时只是浅尝辄止，不再深入研究，那么很难做到真正的技术提升。

需要这份系统化学习资料的朋友，可以戳这里获取

2401_84572928

关注

3
点赞
踩
8

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录