记一次批量定时抓取微信公众号文章的实现

最新推荐文章于 2024-08-13 08:19:56 发布

吃西兰花的man

最新推荐文章于 2024-08-13 08:19:56 发布

阅读量4.4k

点赞数 2

分类专栏：微信公众号抓取文章标签：微信公众号抓取批量定时反爬虫

本文链接：https://blog.csdn.net/DC3987/article/details/84324745

版权

微信公众号抓取专栏收录该内容

1 篇文章 0 订阅

订阅专栏

记一次批量定时抓取微信公众号文章的实现

抓取前的说明和准备

抓取前的说明和准备

本次抓取的选择的语言是java，本文章不会将整个工程的全部代码全部贴出，只会提供核心代码和抓取思路的说明。

数据的抓取

抓取文章的来源为搜狗微信网站，网站如下图。
在这里插入图片描述

抓取的思路如下

一般抓取微信公众号的文章都是以微信公众号的id为关键字，我们可以通过url+ keyword的形式直接跳转到想要抓取公众号页面，keyword即为想要搜索微信公众号的名称或者是id；

// 搜狗微信搜索链接入口
       String sogou_search_url = "http://weixin.sogou.com/weixin?type=1&query="
               + keyword + "&ie=utf8&s_from=input&_sug_=n&_sug_type_=";

为了避免网站对爬虫的初步拦截，我们可以使用Selenium （浏览器自动化测试框架）来伪装自己的爬虫，我们使用的chrome，这里需要注意自己的chrome版本与使用的webdriver的版本是对应的；

       ChromeOptions chromeOptions = new ChromeOptions();
       // 全屏，为了接下来防抓取做准备
       chromeOptions.addArguments("--start-maximized");
       System.setProperty("webdriver.chrome.driver", chromedriver);
       WebDriver webDriver = new ChromeDriver(chromeOptions);

到达微信公众号列表页面，如下图，获取微信公众号链接。

              // 获取当前页面的微信公众号列表
              List<WebElement> weixin_list = webDriver
                      .findElements(By.cssSelector("div[class='txt-box']"));
              // 获取进入公众号的链接
              String weixin_url = "";
              for (int i = 0; i <= weixin_list.size(); i++) {
                  String weixin_name = weixin_list.get(i)
                          .findElement(By.cssSelector("p[class='tit']"))
                          .findElement(By.tagName("a")).getText();
                  if (weixin_name.equals(keyword)) {
                      weixin_url = weixin_list.get(i)
                              .findElement(By.cssSelector("p[class='tit']"))
                              .findElement(By.tagName("a")).getAttribute("href");
                      break;
                  }
              }

              webDriver.get(weixin_url);

到达微信公众号文章列表页面，如下图；
通过对网页元素的分析对微信文章相关信息进行抓取；

             // 获取微信文章列表
             List<WebElement> weixin_article_list = webDriver.findElements(
                     By.cssSelector("div[class='weui_media_box appmsg']"));
             for (WebElement weixin_article : weixin_article_list) {

                 // 获取文章缩略图url
                 String thumbNail = weixin_article
                         .findElement(By
                                 .cssSelector("span[class='weui_media_hd']"))
                         .getAttribute("style").substring(23);
                 String thumbNailUrl = thumbNail.substring(0,
                         thumbNail.length() - 3);
                 // 获取文章url
                 String articleUrl = "http://mp.weixin.qq.com"
                         + weixin_article
                                 .findElement(By.cssSelector(
                                         "h4[class='weui_media_title']"))
                                 .getAttribute("hrefs");
                 // 获取文章简介
                 String desc = weixin_article
                         .findElement(By
                                 .cssSelector("p[class='weui_media_desc']"))
                         .getText();
                 // 获取文章日期
                 String dateString = weixin_article
                         .findElement(By.cssSelector(
                                 "p[class='weui_media_extra_info']"))
                         .getText();
                 Date date = null;
                 try {
                     date = new SimpleDateFormat("yyyy年MM月dd日")
                             .parse(dateString);
                 } catch (ParseException e) {
                     // TODO 完成异常处理
                     e.printStackTrace();
                 }

                 Date today = new Date();

批量抓取

批量抓取的实现是通过Spring Batch来实现的。Spring Batch 是一个轻量级的、完善的批处理框架,旨在帮助企业建立健壮、高效的批处理应用。
Spring Batch 分为Reader,Processor,Writer,Listener等几个部分。因为是进行批量抓取，所以可以新建一个稿件抓取历史列表来判断是否为新的稿件，从而判断是否进行抓取。


@Configuration
@EnableBatchProcessing
@EnableScheduling
public class WechatOfficialAccountsStoryImportBatchConfig {
   
   @Bean
   public WechatOfficialAccountsStoryReader<WechatLink> wechatOfficialAccountsStoryReader() throws BusinessException, IOException {
       return new WechatOfficialAccountsStoryReader<WechatLink>();
   }
   
   @Bean
   public ItemProcessor<LinkHolder<WechatLink>, NrStory> wechatOfficialAccountsStoryProcessor() {
       return new WechatOfficialAccountsStoryProcessor();
   }

   @Bean
   public ItemWriter<NrStory> wechatOfficialAccountsStoryWriter() {
       return new WechatOfficialAccountsStoryWriter();
   }
   
   @Bean
   public Job importStoryWechatOfficialAccountsJob(JobBuilderFactory jobs, StepBuilderFactory stepBuilderFactory) throws BusinessException, IOException {
       return jobs.get("importStoryWechatOfficialAccountsJob")
               .incrementer(new RunIdIncrementer())
               .listener(listener(wechatOfficialAccountsStoryReader()))
               .flow(importStoryWechatOfficialAccountsStep(stepBuilderFactory))
               .end()
               .build();
   }
   
   @Bean
   public Step importStoryWechatOfficialAccountsStep(StepBuilderFactory stepBuilderFactory) throws BusinessException, IOException {
       return stepBuilderFactory.get("importStoryWechatOfficialAccountsStep")
               .<LinkHolder<WechatLink>, NrStory> chunk(1).faultTolerant()
               .skip(SkippingException.class).skip(BusinessException.class).skipLimit(100000)
               .reader(wechatOfficialAccountsStoryReader())
               .processor(wechatOfficialAccountsStoryProcessor())
               .writer(wechatOfficialAccountsStoryWriter())
               .build();
   }
   
   @Bean
   public WechatOfficialAccountsStoryJobListener listener(WechatOfficialAccountsStoryReader<WechatLink> reader) {
       return new WechatOfficialAccountsStoryJobListener(reader);
   }
}

定时抓取

Spring Batch支持定时抓取，具体规则参考Scheduled。


@Scheduled(cron = "0 0 7,9,11,13,15,17,23 * * ?")
   public synchronized void executeWechatOfficialAccountsStoryImport() throws BusinessException, IOException {
   	try {    
           long begin = System.currentTimeMillis();
   		LOG.info("############## 微信公众号定时任务开始 ##############");
   		Job importStoryJob = wechatOfficialAccountsConfig.importStoryWechatOfficialAccountsJob(
   				jobs, stepBuilderFactory);
   		JobParameters jobParameters = new JobParametersBuilder().addLong(
   				"time", System.currentTimeMillis()).toJobParameters();
   		JobExecution jobExecution = jobLauncher.run(importStoryJob,
   				jobParameters); 
   		LOG.info("jobExecution" + jobExecution);
   		long end = System.currentTimeMillis();
   		LOG.info("##############微信公众号定时任务结束，共耗时：[" + (end-begin) / 1000 + "]秒##############");

   	} catch (JobExecutionAlreadyRunningException e) {
   		LOG.info("微信公众号文章导入发生错误！" + e.getLocalizedMessage());
   	} 
   }

对爬虫防抓取机制的一些解决办法

如果抓取的频率较高，抓取的时候页面会要求输入验证码，我也尝试过许多图片识别的jar包，效果并不是很好。我们采用的是第三方的打码平台，有的时候人工还是比智能更好。我们选择的打码平台是快若打码，其官网上有他的使用方法，他提供了一些第三方接口。识别验证码后输入，继续进行抓取。


// 获取当前网页的标题
               String title = webDriver.getTitle();
               while (title.contentEquals("请输入验证码")) {
                   LOG.info("############## 微信公众号访问失败，需要输入验证码 ##############");
                   LOG.info("############## 开始识别验证码 ##############");

                   // 全屏截图
                   File srcFile = ((TakesScreenshot) webDriver)
                           .getScreenshotAs(OutputType.FILE);
                   Image src = Toolkit.getDefaultToolkit()
                           .getImage(srcFile.getPath());
                   BufferedImage originalImage = WechatOfficialAccountStoryJobUtils
                           .toBufferedImage(src);

                   WebElement verify = webDriver
                           .findElement(By.className("page_verify"));
                   Point location = verify.getLocation();
                   Dimension size = verify.getSize();

                   // 截取验证码
                   int getX = location.getX();
                   int getY = location.getY();
                   int getWidth = size.getWidth();
                   int getHeight = size.getHeight();
                   BufferedImage verifyImg = originalImage.getSubimage(
                           getX + getWidth / 3, getY + getHeight / 3,
                           getWidth / 2, getHeight / 3);

                   // 验证码的名称和路径
                   String dirName = "D:/verifyImg";
                   File dir = new File(dirName);
                   Date time = new Date();
                   String verifyImgPath = dirName + "/" + time.getTime()
                           + ".jpg";

                   // 验证码文件夹是否存在
                   if (dir.exists()) {
                       LOG.info("############## 目录已存在，无需创建 ##############");
                   } else {
                       LOG.info("############## 目录不存在，正在创建目录 ##############");
                       if (dir.mkdirs()) {
                           LOG.info("############## 目录创建成功 ##############");
                       }
                   }

                   // 下载验证码
                   try {
                       ImageIO.write(verifyImg, "jpg",
                               new File(verifyImgPath));
                   } catch (IOException e) {
                       // TODO 完成异常处理
                       e.printStackTrace();
                   }

                   LOG.info("############## 验证码下载成功 ##############");

                   // 识别验证码
                   String ret = WechatOfficialAccountStoryJobUtils
                           .createByPost(ruokuaiUsername, ruokuaiPassword,
                                   ruokuaiWeixintypeid, ruokuaiTimeout,
                                   ruokuaiSoftid, ruokuaiSoftkey,
                                   verifyImgPath);
                   String verifyLetter = ret.substring(37, 41);
                   LOG.info("############## 验证码识别成功 ，验证码为：##############");
                   LOG.info(verifyLetter);

                   // 输入验证码
                   WebElement input = webDriver.findElement(By.id("input"));
                   input.sendKeys(verifyLetter);

                   WebElement bt = webDriver.findElement(By.id("bt"));
                   bt.click();

                   // 刷新页面
                   webDriver.navigate().refresh();
                   title = webDriver.getTitle();