如果网络请求来自Google搜寻器或Google漫游器,则请求的“用户代理”应类似于以下内容:
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
or
(rarely used): Googlebot/2.1 (+http://www.google.com/bot.html)
资料来源: Google检索器
1. Java示例
在Java中,您可以从HttpServletRequest
获取“用户代理”。
Example : Service hosted at abcdefg.com
@Autowired
private HttpServletRequest request;
//...
String userAgent = request.getHeader("user-agent");
System.out.println("User Agent : " + userAgent);
if(!StringUtils.isEmpty(userAgent)){
if(userAgent.toLowerCase().contains("googlebot")){
System.out.println("This is Google bot");
}else{
System.out.println("Not from Google");
}
}
注意
以上解决方案效果很好,但未能检测到伪造或欺骗的用户代理。
2.假用户代理
创建伪造/欺骗的用户代理请求很容易。 例如 :
Example : Send a fake user agent request to abcdefg.com
package com.mkyong.web;
import org.apache.http.HttpResponse;
import org.apache.http.client.HttpClient;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.HttpClientBuilder;
public class test {
public static void main(String[] args) throws Exception {
HttpClient client = HttpClientBuilder.create().build();
HttpGet request = new HttpGet("abcdefg.com");
request.setHeader("user-agent", "fake googlebot");
HttpResponse response = client.execute(request);
}
}
在abcdefg.com上的输出。
User Agent : fake googlebot
This is Google bot
3.验证Googlebot
要验证真实的Googlebot,您可以像这样手动使用“反向DNS查找”:
> host 66.249.66.1
1.66.249.66.in-addr.arpa domain name pointer
crawl-66-249-66-1.googlebot.com.
> host crawl-66-249-66-1.googlebot.com
crawl-66-249-66-1.googlebot.com has address 66.249.66.1
资料来源: 验证Googlebot
4.验证Googlebot – Java示例
基于以上理论,我们可以模拟“反向DNS查找”的第一部分。 使用host
命令确定请求的IP指向何处。
如果请求来自Googlebot,它将显示以下格式: xx *.googlebot.com.
。
PS host
命令仅在* nix系统上可用。
Example : Detect fake user agent
@Autowired
private HttpServletRequest request;
//...
String requestIp = getRequestIp();
String userAgent = request.getHeader("user-agent");
System.out.println("User Agent : " + userAgent);
if(!StringUtils.isEmpty(userAgent)){
if(userAgent.toLowerCase().contains("googlebot")){
//check fake user agent
String output = executeCommand("host " + requestIp);
System.out.println("Output : " + output);
if(output.toLowerCase().contains("googlebot.com")){
System.out.println("This is Google bot");
}else{
System.out.println("This is fake user agent");
}
}else{
System.out.println("Not from Google");
}
}
//get requested IP
private String getRequestIp() {
String ipAddress = request.getHeader("X-FORWARDED-FOR");
if (ipAddress == null) {
ipAddress = request.getRemoteAddr();
}
return ipAddress;
}
// execute external command
private String executeCommand(String command) {
StringBuffer output = new StringBuffer();
Process p;
try {
p = Runtime.getRuntime().exec(command);
p.waitFor();
BufferedReader reader =
new BufferedReader(new InputStreamReader(p.getInputStream()));
String line = "";
while ((line = reader.readLine())!= null) {
output.append(line + "\n");
}
} catch (Exception e) {
e.printStackTrace();
}
return output.toString();
}
再次尝试“步骤2”伪造的用户代理示例。 现在,您将获得以下输出:
Output : Host 142.1.168.192.in-addr.arpa. not found: 3(NXDOMAIN) //this output may vary.
User Agent : fake googlebot
This is fake user agent
注意
这种简单的解决方案可能无法100%阻止假冒/欺骗的用户代理,但是这种额外的安全层应该能够阻止大多数基本的用户代理的欺骗攻击。
如果您有更好的解决方案,请在下面分享,谢谢。
参考文献
翻译自: https://mkyong.com/java/java-check-if-web-request-is-from-google-crawler/