URI: Universa1 Resource Identifier统一资源标志符,用来标识抽象或物理资源的一个紧凑字符串。
URL:Universal Resource Locator统一资源定位符,一种定位资源的主要访问机制的字符串,一个标准的URL必须包括:
protocol、host、port、path、parameter、anchor。
URN:universal Resource Name统一资源名称,通过特定命名空间中的唯—名称或ID来标识资源。
在www上,每一信息资源都有统一且唯一的地址URL。
URL的组成结构
URL的组成结构:协议+存放资源的主机域名+端口号+资源文件名 1
URL示例:http://www.google.com:80/index.html 的组成部分如下:
-
协议:http://
-
存放资源的主机域名:www.google.com
-
端口号:80
-
资源文件名:index.html
URL类
类 URL 代表一个统一资源定位符,它是指向互联网“资源”的指针。资源可以是简单的文件或目录,也可以是对更为复杂的对象的引用,例如对数据库或搜索引擎的查询
构造方法
Constructor | Description |
---|---|
URL(String spec) | Creates a URL object from the String representation. |
URL(String protocol, String host, int port, String file) | Creates a URL object from the specified protocol, host, port number, and file. |
URL(String protocol, String host, int port, String file, URLStreamHandler handler) | Creates a URL object from the specified protocol, host, port number, file, and handler. |
URL(String protocol, String host, String file) | Creates a URL from the specified protocol name, host name, and file name. |
URL(URL context, String spec) | Creates a URL by parsing the given spec within a specified context. |
URL(URL context, String spec, URLStreamHandler handler) | Creates a URL by parsing the given spec with the specified handler within a specified context. |
成员方法
Method Modifier and Type | Description |
---|---|
boolean equals(Object obj) | Compares this URL for equality with another object. |
String getAuthority() | Gets the authority part of this URL. |
Object getContent() | Gets the contents of this URL. |
Object getContent(Class<?>[] classes) | Gets the contents of this URL. |
int getDefaultPort() | Gets the default port number of the protocol associated with this URL. |
String getFile() | Gets the file name of this URL. |
String getHost() | Gets the host name of this URL, if applicable. |
String getPath() | Gets the path part of this URL. |
int getPort() | Gets the port number of this URL. |
String getProtocol() | Gets the protocol name of this URL. |
String getQuery() | Gets the query part of this URL. |
String getRef() | Gets the anchor (also known as the “reference”) of this URL. |
String getUserInfo() | Gets the userInfo part of this URL. |
int hashCode() | Creates an integer suitable for hash table indexing. |
URLConnection openConnection() | Returns a URLConnection instance that represents a connection to the remote object referred to by the URL. |
URLConnection openConnection(Proxy proxy) | Same as openConnection(), except that the connection will be made through the specified proxy; Protocol handlers that do not support proxying will ignore the proxy parameter and |
InputStream openStream() | Opens a connection to this URL and returns an InputStream for reading from that connection. |
boolean sameFile(URL other) | Compares two URLs, excluding the fragment component. |
static void setURLStreamHandlerFactory(URLStreamHandlerFactory fac) | Sets an application’s URLStreamHandlerFactory. |
String toExternalForm() | Constructs a string representation of this URL. |
String toString() | Constructs a string representation of this URL. |
URI toURI() | Returns a URI equivalent to this URL. |
示例
import java.net.MalformedURLException;
import java.net.URL;
public class urlTest01 {
public static void main(String[] args) throws MalformedURLException {
URL url = new URL("http://www.baidu.com:80/index.html?uname=shsxt&age=18#a");
System.out.println("协议:"+url.getProtocol());
System.out.println("域名|ip:"+url.getHost());
System.out.println("端口:"+url.getPort());
System.out.println("请求资源:"+url.getFile()); //文件+参数
System.out.println("获取路径:"+url.getPath());
//参数
System.out.println("参数:"+url.getQuery());
//锚点
System.out.println("锚点:"+url.getRef());
}
}
写一个最简单的网络爬虫
抓取京东网站主页源码
import java.net.URL;
import java.io.IOException;
import java.io.InputStream;
import java.io.BufferedReader;
import java.io.InputStreamReader;
public class spiderTest01 {
public static void main(String[] args) throws IOException {
URL url = new URL("http://www.jd.com");
//下载Opens a connection to this URL and returns an InputStream for reading from that connection.
InputStream is = url.openStream();
BufferedReader br = new BufferedReader(new InputStreamReader(is, "UTF-8"));
String msg = null;
while(null != (msg = br.readLine())) {
System.out.println(msg);
}
br.close();
}
}
有的网站是不允许抓包的,因此模拟浏览器浏览方式抓取大众点评网站主页源码
要设置连接调用URL的openConnection方法,将会返回HttpURLConnection、JarURLConnection的远程对象的连接,从而可以设置连接的参数,进而伪装访问抓取资源
![](https://img-blog.csdnimg.cn/20201012200012886.png)
![](https://img-blog.csdnimg.cn/20201012200030561.png)
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.URL;
public class spiderTest02 {
public static void main(String[] args) throws IOException {
URL url = new URL("http://www.dianping.com");
//下载(模拟浏览器访问)
HttpURLConnection conn = (HttpURLConnection)url.openConnection();
conn.setRequestMethod("GET");
conn.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36");
BufferedReader br = new BufferedReader(new InputStreamReader(conn.getInputStream(), "UTF-8"));
String msg = null;
while(null != (msg = br.readLine())) {
System.out.println(msg);
}
br.close();
}
}
更完整的后面还有参数和锚点 ↩︎