一个twitch 直播地址搜索爬虫基于java selenium 和 spring boot

1、运行环境

selenium 首先要下载chromedriver,chromedriver与google浏览器对应的版本如下表所示:

驱动版本浏览器对应的版本
ChromeDriver v2.43 (2018-10-16)Supports Chrome v69-71
ChromeDriver v2.42 (2018-09-13)Supports Chrome v68-70

应该没有人用更老的版本了,chromedriver的下载地址如下:
http://npm.taobao.org/mirrors/chromedriver/

2、程序功能

通过twitch的搜索框,搜索给定的关键字,返回视频直播地址的url。
这个使用idea进行编写的程序,打算打包成war运行,所以创建项目的时候配置如下图所示,主要是将jar打包成war:
在这里插入图片描述
程序运行的pom文件如下:

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
	xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
	<modelVersion>4.0.0</modelVersion>
	<parent>
		<groupId>org.springframework.boot</groupId>
		<artifactId>spring-boot-starter-parent</artifactId>
		<version>2.1.1.RELEASE</version>
		<relativePath/> <!-- lookup parent from repository -->
	</parent>
	<groupId>learn</groupId>
	<artifactId>twlive</artifactId>
	<version>0.0.1-SNAPSHOT</version>
	<packaging>war</packaging>
	<name>twlive</name>
	<description>Demo project for Spring Boot</description>

	<properties>
		<java.version>1.8</java.version>
	</properties>

	<dependencies>

		<dependency>
			<groupId>org.springframework.boot</groupId>
			<artifactId>spring-boot-starter-web</artifactId>
		</dependency>

		<dependency>
			<groupId>org.springframework.boot</groupId>
			<artifactId>spring-boot-starter-tomcat</artifactId>
			<scope>provided</scope>
		</dependency>

		<dependency>
			<groupId>org.springframework.boot</groupId>
			<artifactId>spring-boot-starter-test</artifactId>
			<scope>test</scope>
		</dependency>

		<dependency>
			<groupId>org.seleniumhq.selenium</groupId>
			<artifactId>selenium-java</artifactId>
			<version>3.141.59</version>
		</dependency>

		<dependency>
			<groupId>org.jsoup</groupId>
			<artifactId>jsoup</artifactId>
			<version>1.10.3</version>
		</dependency>

	</dependencies>

	<build>
		<plugins>
			<plugin>
				<groupId>org.springframework.boot</groupId>
				<artifactId>spring-boot-maven-plugin</artifactId>
			</plugin>
		</plugins>
	</build>

</project>

对应的application.properties配置文件如下,配置文件里面写的是chromedriver在系统中的绝对路径:

spring.jmx.default-domain=twitch-live
#chorme webdriver的路径
web.driver = C:\\bin\\chromedriver.exe

两个默认的启动类代码如下类1:

package learn.twlive;

import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;
import org.springframework.scheduling.annotation.EnableScheduling;

@SpringBootApplication
@EnableScheduling
public class TwliveApplication {

	public static void main(String[] args) {
		SpringApplication.run(TwliveApplication.class, args);
	}
}

类2

package learn.twlive;

import org.springframework.boot.builder.SpringApplicationBuilder;
import org.springframework.boot.web.servlet.support.SpringBootServletInitializer;

public class ServletInitializer extends SpringBootServletInitializer {
	@Override
	protected SpringApplicationBuilder configure(SpringApplicationBuilder application) {
		return application.sources(TwliveApplication.class);
	}
}

定时器运行类如下,配置成50s运行一次,cron在线配置网站链接:http://cron.qqe2.com/ 原来好像没有广告,现在加上了。

package learn.twlive.web;

import org.springframework.beans.factory.annotation.Value;
import org.springframework.scheduling.annotation.Scheduled;
import org.springframework.web.bind.annotation.RestController;

@RestController
public class Web {

    @Value("${web.driver}")
    private String webDriverPath;

    @Scheduled(cron = "0/50 0 * * * ? ")
    public void getTwurl(){
        TwitchSearch twitchSearch = new TwitchSearch();
        String searchKey = "dota2 psg.lgd";
        String url = twitchSearch.downLoadSeliunm(searchKey,webDriverPath);
        if(url!=null)
            System.out.println(url);
    }

}

具体的搜索类和url提取类如下,需要使用jsoup ,也可以使用selenium自带的选择器:

package learn.twlive.web;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
import org.openqa.selenium.By;
import org.openqa.selenium.Dimension;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.chrome.ChromeOptions;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.util.ArrayList;

public class TwitchSearch {

    Logger logger = LoggerFactory.getLogger(this.getClass());

    public String downLoadSeliunm(String keyword,String webDriverPath){
        String url = "https://www.twitch.tv/";
        System.setProperty("webdriver.chrome.driver",webDriverPath);
        //chromedriver服务地址
        ArrayList<String> command = new ArrayList<String>();
        //command.add("--headless");
        ChromeOptions options = new ChromeOptions();
        options.addArguments(command);
        WebDriver driver = new ChromeDriver(options);
        try{
            driver.manage().window().setSize(new Dimension(1920,1080));
            driver.get(url);
            driver.findElement(By.id("nav-search-input")).sendKeys(keyword);
            String docString = driver.getPageSource();
            Document doc  =  Jsoup.parse(docString);
            String liveUrl= parserTwitch(doc);
            return liveUrl;
        }catch (Exception ex){
            logger.info(ex.toString());
        }finally {
            driver.quit();
        }
        return "";
    }
    
    
    public String parserTwitch(Document doc){
        Elements searchResultSectionBlock = doc.getElementsByClass("search-result-section__block");
        if(searchResultSectionBlock.size()>0){
            Elements urlA = searchResultSectionBlock.get(0).getElementsByClass("tw-interactive tw-block tw-full-width tw-interactable tw-interactable--inverted");
            String href = urlA.get(0).attr("href");
            return  "https://www.twitch.tv"+href;
        }
        return "";
    }
}

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值