爬虫入门杂记

Sharry洗手溢

已于 2022-02-10 11:49:53 修改

阅读量246

点赞数

分类专栏：杂记文章标签：爬虫 python java

于 2022-02-09 09:00:00 首次发布

本文链接：https://blog.csdn.net/weixin_43779187/article/details/122445770

版权

杂记专栏收录该内容

10 篇文章 0 订阅

订阅专栏

文章目录

爬虫
1. 爬虫的概念
2. 爬虫入门
3. 爬虫解决方案
4. 总结

爬虫

1. 爬虫的概念

1.1 什么是爬虫

一种按照一定规则，自动抓取网页信息的程序或脚本。

1.2 爬虫的作用

抓取信息。

1.3 爬虫常用于

大型网站的用户信息的抓取，结合相关算法对目标用户精准推送信息。

科学研究。

通过互联网的数据调查。

等。

1.4 注意事项

在爬虫大佬眼里万物皆可爬。

遵守网络安全法，不要会了爬虫就刑了。

2. 爬虫入门

2.1 爬虫原理

简单来说就是通过程序，模拟浏览器的行为，并实现一定程度的自动化，帮助用户抓取网络信息。

2.2 JAVA爬虫入门

2.2.1 Client入门案例


public class RetrivePage { private static HttpClient httpClient = new HttpClient(); 

// 设置 代理 服务器

static {

// 设置 代理 服务器 的 IP 地址 和 端口

httpClient. getHostConfiguration(). setProxy(" 172. 17. 18. 84", 8080);

}

public static boolean downloadPage( String path) throws HttpException, IOException { 

InputStream input = null; OutputStream output = null; 

// 得到 post 方法 
PostMethod postMethod = new PostMethod( path); 

//设置 post 方法 的 参数 
NameValuePair[] postData = new NameValuePair[ 2]; 
postData[ 0] = new NameValuePair(" name"," lietu"); postData[ 1] = new NameValuePair(" password","*****"); 
postMethod. addParameters( postData); 

// 执行， 返回 状态 码 
int statusCode = httpClient. executeMethod( postMethod); 

// 针对 状态 码 进行 处理 （简单 起见， 只 处理 返回 值 为 200 的 状态 码） 
if (statusCode == HttpStatus. SC_ OK) { 

input = postMethod. getResponseBodyAsStream(); 

//得到 文件名 
String filename = path. substring( path. lastIndexOf('/')+ 1); 

//获得 文件 输出 流 
output = new FileOutputStream( filename);

//输出 到 文件 
int tempByte = -1;

while(( tempByte= input. read())> 0){ 

output. write( tempByte); 
    
} 

//关闭 输入 输出 流 

if( input!= null){ 

input. close(); 
    
} if( output!= null){

output. close(); 
    
} 
return true; 
    
} 
return false; 
    
}


/** * 测试 代码 */ 
public static void main( String[] args) { 
// 抓取 lietu 首页， 输出 
try {
RetrivePage. downloadPage(" https:// www. baidu. com/"); 
    
} catch (HttpException e) { 

// TODO Auto- generated catch block
e. printStackTrace(); } catch (IOException e) {
// TODO Auto- generated catch block e. printStackTrace(); } 
} 
    
}

2.2.2 jsoup入门案例

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
import java.io.IOException;
import java.net.URL;

/**
 * @Author: Sharry
 * @CreateTime: 2022/1/11 15:49
 * @Version: Version-1.0
 * JAVA Jsoup 爬虫入门案例
 */
public class ParseWebsite {

    public static void main(String[] args) throws IOException {
        //Define target URL
        //输入网址
        String url = "https://www.xxxx.com/cn";

        //Jsoup 方法，获取网页Document数据
        Document document = Jsoup.parse(new URL(url), 200000);

        //获取特定元素
        Elements element = document.getElementsByClass("ClassName");

        //入门时，输出就好
        System.out.println(element.html());

    }
}

3. 爬虫解决方案

作为一名程序员，在目前学习阶段，学习和了解爬虫解决方案、算法，才是积累竞争力的核心。

3.1 搜索策略

广度优先、深度优先等。广度优先和深度优先是我们的老朋友了，在数据结构中经常提及。这里的广度优先、深度优先搜素策略与数据结构中的同名概念很像：广度优先某爬虫先进行横向搜索，再进行下一层搜索；深度优先指优先进行纵向搜索。

其余的策略还有最佳搜索等爬虫常用的搜索策略。

3.2 数据分析

宏观上，一般用于生产的爬虫会先进行拓扑分析，使用拓扑分析算法，再对目标网络进行爬取。

微观上，爬取下来的数据会用不同的算法进行分析、整理。目前结合人工智能的文本内容分析相关方法、算法的研究在高校广泛开展。

3.3 Python爬虫的优劣

3.3.1 Python爬虫优势

跨平台、可视化优秀、支持复杂的网络、科学计算等领域优秀。

简单来说就是 Python 用于支持爬虫的第三方类库、包比其他语言要多，而且该语言比较简洁。

3.3.2 Python爬虫缺点

某些基于Python的爬虫策略本身存在缺陷。

参考blog

由于Python语言语法本身比较简单，当业务、功能过于复杂时，难以避免要依赖更多的第三方包。

但总的来说，用 Python 爬虫是目前较优选择。

3.3.3 Python爬虫案例

import urllib
import urllib.request
 
def loadPage(url,filename):
 """
 作用：根据url发送请求，获取html数据;
 :param url:
 :return:
 """
 request=urllib.request.Request(url)
 html1= urllib.request.urlopen(request).read()
 return html1.decode('utf-8')
 
def writePage(html,filename):
 """
 作用将html写入本地
 
 :param html: 服务器相应的文件内容
 :return:
 """
 with open(filename,'w') as f:
  f.write(html)
 print('-'*30)
def tiebaSpider(url,beginPage,endPage):
 """
 作用贴吧爬虫调度器，负责处理每一个页面url;
 :param url:
 :param beginPage:
 :param endPage:
 :return:
 """
 for page in range(beginPage,endPage+1):
  pn=(page - 1)*50
  fullurl=url+"&pn="+str(pn)
  print(fullurl)
  filename='第'+str(page)+'页.html'
  html= loadPage(url,filename)
 
  writePage(html,filename)
 
 
 
if __name__=="__main__":
 kw=input('请输入你要需要爬取的贴吧名:')
 beginPage=int(input('请输入起始页'))
 endPage=int(input('请输入结束页'))
 url='https://tieba.baidu.com/f?'
 kw1={'kw':kw}
 key = urllib.parse.urlencode(kw1)
 fullurl=url+key
 tiebaSpider(fullurl,beginPage,endPage)