166.URL基本用法

URI: Universa1 Resource Identifier统一资源标志符,用来标识抽象或物理资源的一个紧凑字符串。

URL:Universal Resource Locator统一资源定位符,一种定位资源的主要访问机制的字符串,一个标准的URL必须包括:
protocol、host、port、path、parameter、anchor。

URN:universal Resource Name统一资源名称,通过特定命名空间中的唯—名称或ID来标识资源。

在www上,每一信息资源都有统一且唯一的地址URL。

URL的组成结构

URL的组成结构:协议+存放资源的主机域名+端口号+资源文件名 1

URL示例:http://www.google.com:80/index.html 的组成部分如下:

  1. 协议:http://

  2. 存放资源的主机域名:www.google.com

  3. 端口号:80

  4. 资源文件名:index.html

URL类

类 URL 代表一个统一资源定位符,它是指向互联网“资源”的指针。资源可以是简单的文件或目录,也可以是对更为复杂的对象的引用,例如对数据库或搜索引擎的查询

构造方法
ConstructorDescription
URL(String spec)Creates a URL object from the String representation.
URL(String protocol, String host, int port, String file)Creates a URL object from the specified protocol, host, port number, and file.
URL(String protocol, String host, int port, String file, URLStreamHandler handler)Creates a URL object from the specified protocol, host, port number, file, and handler.
URL(String protocol, String host, String file)Creates a URL from the specified protocol name, host name, and file name.
URL(URL context, String spec)Creates a URL by parsing the given spec within a specified context.
URL(URL context, String spec, URLStreamHandler handler)Creates a URL by parsing the given spec with the specified handler within a specified context.
成员方法
Method Modifier and TypeDescription
boolean equals(Object obj)Compares this URL for equality with another object.
String getAuthority()Gets the authority part of this URL.
Object getContent()Gets the contents of this URL.
Object getContent(Class<?>[] classes)Gets the contents of this URL.
int getDefaultPort()Gets the default port number of the protocol associated with this URL.
String getFile()Gets the file name of this URL.
String getHost()Gets the host name of this URL, if applicable.
String getPath()Gets the path part of this URL.
int getPort()Gets the port number of this URL.
String getProtocol()Gets the protocol name of this URL.
String getQuery()Gets the query part of this URL.
String getRef()Gets the anchor (also known as the “reference”) of this URL.
String getUserInfo()Gets the userInfo part of this URL.
int hashCode()Creates an integer suitable for hash table indexing.
URLConnection openConnection()Returns a URLConnection instance that represents a connection to the remote object referred to by the URL.
URLConnection openConnection(Proxy proxy)Same as openConnection(), except that the connection will be made through the specified proxy; Protocol handlers that do not support proxying will ignore the proxy parameter and
InputStream openStream()Opens a connection to this URL and returns an InputStream for reading from that connection.
boolean sameFile(URL other)Compares two URLs, excluding the fragment component.
static void setURLStreamHandlerFactory(URLStreamHandlerFactory fac)Sets an application’s URLStreamHandlerFactory.
String toExternalForm()Constructs a string representation of this URL.
String toString()Constructs a string representation of this URL.
URI toURI()Returns a URI equivalent to this URL.
示例
import java.net.MalformedURLException;
import java.net.URL;

public class urlTest01 {
	public static void main(String[] args) throws MalformedURLException {
		URL url = new URL("http://www.baidu.com:80/index.html?uname=shsxt&age=18#a");
		System.out.println("协议:"+url.getProtocol());
		System.out.println("域名|ip:"+url.getHost());
		System.out.println("端口:"+url.getPort());
		System.out.println("请求资源:"+url.getFile()); //文件+参数
		System.out.println("获取路径:"+url.getPath());
		
		//参数
		System.out.println("参数:"+url.getQuery());
		//锚点
		System.out.println("锚点:"+url.getRef());
	}
}
写一个最简单的网络爬虫

抓取京东网站主页源码

import java.net.URL;
import java.io.IOException;
import java.io.InputStream;
import java.io.BufferedReader;
import java.io.InputStreamReader;
public class spiderTest01 {
	public static void main(String[] args) throws IOException {
		URL url = new URL("http://www.jd.com");
		
		//下载Opens a connection to this URL and returns an InputStream for reading from that connection.
		InputStream is = url.openStream();
		BufferedReader br = new BufferedReader(new InputStreamReader(is, "UTF-8"));
		String msg = null;
		while(null != (msg = br.readLine())) {
			System.out.println(msg);
		}
		br.close();
	}
}

有的网站是不允许抓包的,因此模拟浏览器浏览方式抓取大众点评网站主页源码

要设置连接调用URL的openConnection方法,将会返回HttpURLConnection、JarURLConnection的远程对象的连接,从而可以设置连接的参数,进而伪装访问抓取资源

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.URL;
public class spiderTest02 {
	public static void main(String[] args) throws IOException {
		URL url = new URL("http://www.dianping.com");
		
		//下载(模拟浏览器访问)
		HttpURLConnection conn = (HttpURLConnection)url.openConnection();
		conn.setRequestMethod("GET");
		conn.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36");
		
		BufferedReader br = new BufferedReader(new InputStreamReader(conn.getInputStream(), "UTF-8"));
		String msg = null;
		while(null != (msg = br.readLine())) {
			System.out.println(msg);
		}
		br.close();
	}
}

  1. 更完整的后面还有参数和锚点 ↩︎

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值