HTTP协议中用到的url，你知道多少

最新推荐文章于 2023-09-17 17:13:26 发布

kendyhj9999

最新推荐文章于 2023-09-17 17:13:26 发布

阅读量1.3k

点赞数

分类专栏：辅助知识

辅助知识专栏收录该内容

56 篇文章 0 订阅

订阅专栏

HTTP协议中用到的url，你知道多少 | {流芒}-{土豆}

HTTP协议中的url，你知道多少呢？
今天有人问我如何来匹配锚点部分的内容，一下子把握问到了，如何来匹配锚点，如何来确定两个url是否是一致的？

只好翻阅了一下http协议的文档以及 w3c关于uri部分的说明文档。找到了定义如下：

一关于两个url是否是一致的？

HTTP歇息中关于url的格式定义如下：
http_URL = “http:” “//” host [ ":" port ] [ abs_path [ "?" query ]]
可以看到port可选、query也是可选的。
如果没有给port的话，默认用的就是大家都知道的80端口，abs_path是根据你服务器配置的document root的相对路径,如果为空的话，表示对应的是根目录 ‘/’。
在http协议中规定，除了下面几点外，url的比较是大小写敏感的：
1 没有写明端口的，和默认的端口的是一样的。
2 host（主机或者域名部分）是大小写不敏感的
3 如果路径部分是空的话，等同于/
4 字符html转义前后都是一样的。如~转义为%7e或者%7E
5 这里并没有说明我常常遇到的锚节点（#test）的情况，其实在比较url时，锚节点是不参与比较的。

譬如说下面这三条url就是一样的：

http://abc.com:80/~smith/home.html

http://ABC.com/%7Esmith/home.html

http://ABC.com:/%7esmith/home.html

附英文：

If the port is empty or not given, port 80 is assumed. The semantics are that the identified resource is located at the server listening for TCP connections on that port of that host, and the Request-URI for the resource is abs_path (section 5.1.2). The use of IP addresses in URLs SHOULD be avoided whenever possible (see RFC 1900 [24]). If the abs_path is not present in the URL, it MUST be given as “/” when used as a Request-URI for a resource (section 5.1.2). If a proxy receives a host name which is not a fully qualified domain name, it MAY add its domain to the host name it received. If a proxy receives a fully qualified domain name, the proxy MUST NOT change the host name.

When comparing two URIs to decide if they match or not, a client SHOULD use a case-sensitive octet-by-octet comparison of the entire URIs, with these exceptions:

      - A port that is empty or not given is equivalent to the default
        port for that URI-reference;

        - Comparisons of host names MUST be case-insensitive;

        - Comparisons of scheme names MUST be case-insensitive;

        - An empty abs_path is equivalent to an abs_path of "/".

Characters other than those in the “reserved” and “unsafe” sets (see RFC 2396 [42]) are equivalent to their “”%” HEX HEX” encoding.
参考自：http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.2.2

二关于锚点

锚点就是url中从#开始到url结尾的部分，称之为锚点。

附w3c中关于url的说明：
The generic URI syntax consists of a hierarchical sequence of components referred to as the scheme, authority, path, query, and fragment.

   URI         = scheme ":" hier-part [ "?" query ] [ "#" fragment ]

   hier-part   = "//" authority path-abempty
               / path-absolute
               / path-rootless
               / path-empty

The scheme and path components are required, though the path may be empty (no characters). When authority is present, the path must either be empty or begin with a slash (“/”) character. When authority is not present, the path cannot begin with two slash characters (“//”). These restrictions result in five different ABNF rules for a path (Section 3.3), only one of which will match any given URI reference.

The following are two example URIs and their component parts:

      foo://example.com:8042/over/there?name=ferret#nose
      \_/   \______________/\_________/ \_________/ \__/
       |           |            |            |        |
    scheme     authority       path        query   fragment
       |   _____________________|__
      / \ /                        \
      urn:example:animal:ferret:nose

The fragment identifier component of a URI allows indirect identification of a secondary resource by reference to a primary resource and additional identifying information. The identified secondary resource may be some portion or subset of the primary resource, some view on representations of the primary resource, or some other resource defined or described by those representations. A fragment identifier component is indicated by the presence of a number sign (“#”) character and terminated by the end of the URI.

   fragment    = *( pchar / "/" / "?" )

The semantics of a fragment identifier are defined by the set of representations that might result from a retrieval action on the primary resource. The fragment’s format and resolution is therefore dependent on the media type [RFC2046] of a potentially retrieved representation, even though such a retrieval is only performed if the URI is dereferenced. If no such representation exists, then the semantics of the fragment are considered unknown and are effectively unconstrained. Fragment identifier semantics are independent of the URI scheme and thus cannot be redefined by scheme specifications.

Individual media types may define their own restrictions on or structures within the fragment identifier syntax for specifying different types of subsets, views, or external references that are identifiable as secondary resources by that media type. If the primary resource has multiple representations, as is often the case for resources whose representation is selected based on attributes of the retrieval request (a.k.a., content negotiation), then whatever is identified by the fragment should be consistent across all of those representations. Each representation should either define the fragment so that it corresponds to the same secondary resource, regardless of how it is represented, or should leave the fragment undefined (i.e., not found).

As with any URI, use of a fragment identifier component does not imply that a retrieval action will take place. A URI with a fragment identifier may be used to refer to the secondary resource without any implication that the primary resource is accessible or will ever be accessed.

Fragment identifiers have a special role in information retrieval systems as the primary form of client-side indirect referencing, allowing an author to specifically identify aspects of an existing resource that are only indirectly provided by the resource owner. As such, the fragment identifier is not used in the scheme-specific processing of a URI; instead, the fragment identifier is separated from the rest of the URI prior to a dereference, and thus the identifying information within the fragment itself is dereferenced solely by the user agent, regardless of the URI scheme. Although this separate handling is often perceived to be a loss of information, particularly for accurate redirection of references as resources move over time, it also serves to prevent information providers from denying reference authors the right to refer to information within a resource selectively. Indirect referencing also provides additional flexibility and extensibility to systems that use URIs, as new media types are easier to define and deploy than new schemes of identification.

The characters slash (“/”) and question mark (“?”) are allowed to represent data within the fragment identifier. Beware that some older, erroneous implementations may not handle this data correctly when it is used as the base URI for relative references (Section 5.1).

URL解析的正则表达式：

 ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
    12            3  4          5       6  7        8 9

   $1 = http:
   $2 = http
   $3 = //www.ics.uci.edu
   $4 = www.ics.uci.edu
   $5 = /pub/ietf/uri/
   $6 = <undefined>
   $7 = <undefined>
   $8 = #Related
   $9 = Related
   scheme    = $2
   authority = $4
   path      = $5
   query     = $7
   fragment  = $9

Tags:

HTTP

HTTP协议

lighttpd

URI

URL

url格式

锚

kendyhj9999

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
HTTP协议中用到的url，你知道多少

HTTP协议中用到的url，你知道多少 | {流芒}-{土豆}HTTP协议中的url，你知道多少呢？今天有人问我如何来匹配锚点部分的内容，一下子把握问到了，如何来匹配锚点，如何来确定两个url是否是一致的？只好翻阅了一下http协议的文档以及 w3c关于uri部分的说明文档。找到了定义如下：一关于两个url是否是一致的？HTTP歇息中关于url的格式定义如下
复制链接

扫一扫