浅谈java及python爬虫

最新推荐文章于 2024-05-14 19:21:39 发布

微醺小熊

最新推荐文章于 2024-05-14 19:21:39 发布

阅读量1.1w

点赞数 6

分类专栏：爬虫文章标签： python java 爬虫

本文链接：https://blog.csdn.net/SayFz/article/details/78930416

版权

爬虫专栏收录该内容

2 篇文章 0 订阅

订阅专栏

 
 爬虫 
 ，就是把你在网页上能看到的信息通过代码自动获取到本地的过程。 

 
 常用框架： 

 
 java:webmagic, 
 http://webmagic.io/docs/zh/ 

 
 python:scrapy, 
 http://blog.csdn.net/sunnyxiaohu/article/details/50787430 

 
 随着AJAX技术不断的普及，以及现在AngularJS这种Single-page application框架的出现，现在js渲染出的页面越来越多。对于爬虫来说，这种页面是比较讨厌的：仅仅提取HTML内容，往往无法拿到有效的信息。那么如何处理这种页面呢？总的来说有两种做法： 

 
 1.在抓取阶段，在爬虫中内置一个浏览器内核，执行js渲染页面后，再抓取。这方面对应的工具有Selenium、HtmlUnit或者PhantomJs。但是这些工具都存在一定的效率问题，同时也不是那么稳定。好处是编写规则同静态页面一样。 

 
 2.因为js渲染页面的数据也是从后端拿到，而且基本上都是AJAX获取，所以分析AJAX请求，找到对应数据的请求，也是比较可行的做法。而且相对于页面样式，这种接口变化可能性更小。缺点就是找到这个请求，并进行模拟，是一个相对困难的过程，也需要相对多的分析经验。 

 
 windows下主流浏览器都可以，linux服务器上只能无头浏览器，常用的例如phantomjs。 

 
 selenium python操作： 

 
 dcap = dict(DesiredCapabilities.PHANTOMJS) #设置useragent 

 
 dcap['phantomjs.page.settings.userAgent'] = ('Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:25.0)' ' Gecko/20100101 Firefox/25.0 ') #根据需要设置具体的浏览器信息 

 
 browser = webdriver.PhantomJS(desired_capabilities=dcap) #封装浏览器信息 

browser.get(url) #访问url
browser.find_element_by_xpath('//div[@id="nav_content_5"]/a') #元素定位
browser.find_element_by_class_name("query").send_keys(u'百度') #元素定位并键入值（适用于搜索框）
browser.find_element_by_class_name("oBtn").click() #元素定位并点击（常用）
ActionChains(browser).move_to_element(element).perform() #鼠标悬停
handles = browser.window_handles #获取当前全部窗口句柄集合
browser.switch_to.window(browser.window_handles[1]) #跳转到某个窗口（适用于点击新开标签页）
browser.find_element_by_link_text(“下一页”) #标签元素快速定位
data = browser.page_source #获取页面源码

 
 selenium java操作： 

 
 WebDriver driver = new ChromeDriver(); 

driver.get(url);
driver.findElement(By.xpath("//*[@id=\"username\"]")).sendKeys(userName);#定位元素键入值
driver.findElement(By.xpath("//*[@id=\"btn-login\"]")).click();#定位元素并点击
new Actions(driver).moveToElement(driver.findElement(By.xpath("//*[@id='topPanel']/ul/li[3]/a"))).perform(); #鼠标悬停
String source = driver.getPageSource(); #获取页面源码

 
 爬取细节： 

 
 可以通过xpath元素定位亦或是正则获取。 

 
 python: 

 
 pattern = '<li>.*?<a href="(.*?)">(.*?)</a>.*?<span.*?>(.*?)</span>' 

 
 data = browser.page_source 

 
 contents = re.findall(pattern, data, re.S) #re.S作用：使用re.S参数以后，正则表达式会将这个字符串作为一个整体，而不只是单行匹配 

 
 for content in contents: 

 
 #处理细节 

 
 #content[0],content[1],content[2分别表示第一，二，三个group 

 
 java: 

 
 regex: 

 
 String regexNumber = "&coNum=(.*?)&"; 

 
 Pattern pattern = Pattern.compile(regexNumber,Pattern.DOTALL); 

 
 Matcher matcher = pattern.matcher(url); 

 
 while (matcher.find()){ 

 
 #处理细节 

 
 #mather.group(0)表示匹配的整个字符串，mather.group(1)表示第一个group, 

 
 #mather.group(2)表示第二个group 

}

 
 xpath: 

 
  Html html = new Html(source); 
 
  //获取单个数据 
 
  String name = html.xpath("[@id=\"custName\"]/text()").toString(); 
 
  //获取列表数据 
 
  List<String> strings = html.xpath("//*[@id=\"content\"]/ul/li").all();

 
 附：1.我对正则group的理解 

 
 一个()就是一个group 

 
 2.xpath简单获取方式 

 
 浏览器上右键审查元素，此时会定位到代码，再右键出来copy xpath就ok 

 
 2.java模拟chrome下载文件代码 

 
 String downloadFilepath = "D:\\test\\excel"; 

 
 HashMap<String, Object> chromePrefs = new HashMap<String, Object>(); 

 
 chromePrefs.put("profile.default_content_settings.popups", 0); 

 
 chromePrefs.put("download.default_directory", downloadFilepath); 

 
 ChromeOptions options = new ChromeOptions(); 

 
 HashMap<String, Object> chromeOptionsMap = new HashMap<String, Object>(); 

 
 options.setExperimentalOption("prefs",chromePrefs); 

 
 options.addArguments("--test-type"); 

 
 DesiredCapabilities cap = DesiredCapabilities.chrome(); 

 
 cap.setCapability(ChromeOptions.CAPABILITY, chromeOptionsMap); 

 
 cap.setCapability(CapabilityType.ACCEPT_SSL_CERTS, true); 

 
 cap.setCapability(ChromeOptions.CAPABILITY, options); 

 
 WebDriver driver = new ChromeDriver(cap); 

 
 #-----------------------此处省略各种花里胡哨的操作----------------------- 

 
 JavascriptExecutor js = (JavascriptExecutor) driver; 

 
 js.executeScript("arguments[0].click();",driver.findElement(By.xpath("//* [@id=\"content\"]/div[1]/div[5]/input[1]"))); 

 
 driver.findElement(By.xpath("//*[@id=\"content\"]/div[1]/div[5]/input[1]")).click(); 

微醺小熊

关注

6
点赞
踩
12

收藏

觉得还不错? 一键收藏
1
评论
浅谈java及python爬虫

爬虫，就是把你在网页上能看到的信息通过代码自动获取到本地的过程。常用框架：java:webmagic,http://webmagic.io/docs/zh/python:scrapy,http://blog.csdn.net/sunnyxiaohu/article/details/50787430随着AJAX技术不断的普及，以及现在AngularJS这种Si
复制链接

扫一扫