166.URL基本用法

最新推荐文章于 2021-06-09 11:09:16 发布

云疏不知数

最新推荐文章于 2021-06-09 11:09:16 发布

阅读量649

点赞数

分类专栏： JavaSE

本文链接：https://blog.csdn.net/qq_43808700/article/details/109034384

版权

JavaSE 专栏收录该内容

198 篇文章 5 订阅

订阅专栏

URI: Universa1 Resource Identifier统一资源标志符，用来标识抽象或物理资源的一个紧凑字符串。

URL:Universal Resource Locator统一资源定位符，一种定位资源的主要访问机制的字符串，一个标准的URL必须包括:
protocol、host、port、path、parameter、anchor。

URN:universal Resource Name统一资源名称，通过特定命名空间中的唯—名称或ID来标识资源。

在www上，每一信息资源都有统一且唯一的地址URL。

URL的组成结构

URL的组成结构：协议+存放资源的主机域名+端口号+资源文件名 ¹

URL示例：http://www.google.com:80/index.html 的组成部分如下：

协议:http://
存放资源的主机域名：www.google.com
端口号：80
资源文件名：index.html

URL类

类 URL 代表一个统一资源定位符，它是指向互联网“资源”的指针。资源可以是简单的文件或目录，也可以是对更为复杂的对象的引用，例如对数据库或搜索引擎的查询

构造方法

Constructor	Description
URL(String spec)	Creates a URL object from the String representation.
URL(String protocol, String host, int port, String file)	Creates a URL object from the specified protocol, host, port number, and file.
URL(String protocol, String host, int port, String file, URLStreamHandler handler)	Creates a URL object from the specified protocol, host, port number, file, and handler.
URL(String protocol, String host, String file)	Creates a URL from the specified protocol name, host name, and file name.
URL(URL context, String spec)	Creates a URL by parsing the given spec within a specified context.
URL(URL context, String spec, URLStreamHandler handler)	Creates a URL by parsing the given spec with the specified handler within a specified context.

成员方法

Method Modifier and Type	Description
boolean equals(Object obj)	Compares this URL for equality with another object.
String getAuthority()	Gets the authority part of this URL.
Object getContent()	Gets the contents of this URL.
Object getContent(Class<?>[] classes)	Gets the contents of this URL.
int getDefaultPort()	Gets the default port number of the protocol associated with this URL.
String getFile()	Gets the file name of this URL.
String getHost()	Gets the host name of this URL, if applicable.
String getPath()	Gets the path part of this URL.
int getPort()	Gets the port number of this URL.
String getProtocol()	Gets the protocol name of this URL.
String getQuery()	Gets the query part of this URL.
String getRef()	Gets the anchor (also known as the “reference”) of this URL.
String getUserInfo()	Gets the userInfo part of this URL.
int hashCode()	Creates an integer suitable for hash table indexing.
URLConnection openConnection()	Returns a URLConnection instance that represents a connection to the remote object referred to by the URL.
URLConnection openConnection(Proxy proxy)	Same as openConnection(), except that the connection will be made through the specified proxy; Protocol handlers that do not support proxying will ignore the proxy parameter and
InputStream openStream()	Opens a connection to this URL and returns an InputStream for reading from that connection.
boolean sameFile(URL other)	Compares two URLs, excluding the fragment component.
static void setURLStreamHandlerFactory(URLStreamHandlerFactory fac)	Sets an application’s URLStreamHandlerFactory.
String toExternalForm()	Constructs a string representation of this URL.
String toString()	Constructs a string representation of this URL.
URI toURI()	Returns a URI equivalent to this URL.

示例

import java.net.MalformedURLException;
import java.net.URL;

public class urlTest01 {
	public static void main(String[] args) throws MalformedURLException {
		URL url = new URL("http://www.baidu.com:80/index.html?uname=shsxt&age=18#a");
		System.out.println("协议："+url.getProtocol());
		System.out.println("域名|ip："+url.getHost());
		System.out.println("端口："+url.getPort());
		System.out.println("请求资源："+url.getFile()); //文件+参数
		System.out.println("获取路径："+url.getPath());
		
		//参数
		System.out.println("参数："+url.getQuery());
		//锚点
		System.out.println("锚点："+url.getRef());
	}
}

写一个最简单的网络爬虫

抓取京东网站主页源码

import java.net.URL;
import java.io.IOException;
import java.io.InputStream;
import java.io.BufferedReader;
import java.io.InputStreamReader;
public class spiderTest01 {
	public static void main(String[] args) throws IOException {
		URL url = new URL("http://www.jd.com");
		
		//下载Opens a connection to this URL and returns an InputStream for reading from that connection.
		InputStream is = url.openStream();
		BufferedReader br = new BufferedReader(new InputStreamReader(is, "UTF-8"));
		String msg = null;
		while(null != (msg = br.readLine())) {
			System.out.println(msg);
		}
		br.close();
	}
}

有的网站是不允许抓包的，因此模拟浏览器浏览方式抓取大众点评网站主页源码

要设置连接调用URL的openConnection方法，将会返回HttpURLConnection、JarURLConnection的远程对象的连接，从而可以设置连接的参数，进而伪装访问抓取资源

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.URL;
public class spiderTest02 {
	public static void main(String[] args) throws IOException {
		URL url = new URL("http://www.dianping.com");
		
		//下载(模拟浏览器访问)
		HttpURLConnection conn = (HttpURLConnection)url.openConnection();
		conn.setRequestMethod("GET");
		conn.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36");
		
		BufferedReader br = new BufferedReader(new InputStreamReader(conn.getInputStream(), "UTF-8"));
		String msg = null;
		while(null != (msg = br.readLine())) {
			System.out.println(msg);
		}
		br.close();
	}
}

更完整的后面还有参数和锚点 ↩︎

云疏不知数

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
166.URL基本用法

URI: Universa1 Resource Identifier统一资源标志符，用来标识抽象或物理资源的一个紧凑字符串。URL:Universal Resource Locator统一资源定位符，一种定位资源的主要访问机制的字符串，一个标准的URL必须包括:protocol、host、port、path、parameter、anchor。URN:universal Resource Name统一资源名称，通过特定命名空间中的唯—名称或ID来标识资源。在www上，每一信息资源都有统一且唯一的地址.
复制链接

扫一扫