PHP源码分析-流的实现之parse_url的源码分析

最新推荐文章于 2021-03-25 23:10:43 发布

刘泽奇1990

最新推荐文章于 2021-03-25 23:10:43 发布

阅读量444

点赞数

分类专栏： php源码阅读笔记文章标签： url c语言 php

本文链接：https://blog.csdn.net/a_lzq/article/details/107675050

版权

php源码阅读笔记专栏收录该内容

11 篇文章 0 订阅

订阅专栏

PHP源码分析-流的实现之parse_url的源码分析

<scheme>://<user>:<password>@<host>:<port>/<path>;<params>?<query>#<frag>

此处是一个完整的url的组成,这里只对每个段做简单的说明,
scheme:协议部分,例如:http
user和password:用户名和密码,最原始的协议规定的用户名和密码,以http为例如果服务器设置了此选项而在访问时不输入此项时将返回401的状态.
host:域名例如baidu.com.
port:端口号例如http的80
path:访问的文件路径.例如aa/bb.php
params:忽略(现在很少有统一定位资源符使用这个了,php中也没有关于这个的解析)
query:查询参数例如a=1&b=2
frag:锚点
其他详细的内容就不多描述了, 可以百度其他人的文章或者去看下<<http权威详解>>中的内容.
举例说明几个url的实例

http://www.ori.com/index.html
http://www.yahoo.com/images/logo.gif
http://www.zous.com/inven-check.cgi?item=12731
ftp://joe:tools4U@ftp.zous.com/looking.gif

首先根据标准定义来分析URL的组成,固定的可见字符中只有:/@;?#号,如果要得到其中的每个字段根据其中出现次数最多的:号来处理url是最合适的(可以进行相关的数学证明,但不是本次讨论的重点).
例如:
取到第一个分号的位置,假如其后两个字符紧跟着//部分,则可以认为分号前的是协议部分<scheme>
再从//下一个位置往后取如果匹配到@字符则认为中间部分是<user>和<password>,如果中间没有:号则认为这个是<user>没有password.
接着@下一个位置往后取到第一个/如果中间有:则证明其中有<host>和<port>如果没有则认为是只有host.
再往后取/的下一个位置.处理剩余的字符串#号之后的是<frag>
去掉<frag>后?号之后的是<query>
去掉<frag>和<query>之后的是<path>+<params>
这对url进行的初步的分析.
然后一起来看下php中的parse_url函数是如何实现的

PHP_FUNCTION(parse_url)
{
	//省略一些初始化的量
	//底层的核心调用就是这个函数
	resource = php_url_parse_ex(str, str_len);
	//省略另一些初始化的量
}

这里是调用的php_url_parse_ex函数,这个函数接受两个入参,url和url的长度
首先是

char port_buf[6];
php_url *ret = ecalloc(1, sizeof(php_url));
char const *s, *e, *p, *pp, *ue;

s = str;
ue = s + length;

这里定义port的临时缓存和五个指向字符串的指针const可以忽略(可根据简单的左定值右定向原则判断出来为什么可以忽略)
然后s指向str的第一个字符所在的地址.
ue的赋值有个特别的地方需要注意ue=s+length而不是ue = s+length-1 这也是为什么必须用const修饰的原因之一
ue指向字符串末尾的’\0’
接着往下看第一个判断

	/* parse scheme –解析协议简单明了的优秀注释*/
if ((e = memchr(s, ':', length)) && e != s) {
		/* validate scheme */
		p = s;
		while (p < e) {
			/* scheme = 1*[ lowalpha | digit | "+" | "-" | "." ] */
//这决定冒号只能是password和port之前的冒号逻辑有点绕得多读几遍代码
			if (!isalpha(*p) && !isdigit(*p) && *p != '+' && *p != '.' && *p != '-') {
				if (e + 1 < ue && e < s + strcspn(s, "?#")) {
					goto parse_port;
				} else if (s + 1 < ue && *s == '/' && *(s + 1) == '/') { /* relative-scheme URL */
					s += 2;
					e = 0;
					goto parse_host;
				} else {
					goto just_path;
				}
			}
			p++;
		}

		if (e + 1 == ue) { /* only scheme is available */
			ret->scheme = zend_string_init(s, (e - s), 0);
			php_replace_controlchars_ex(ZSTR_VAL(ret->scheme), ZSTR_LEN(ret->scheme));
			return ret;
		}

		/*
		 * certain schemas like mailto: and zlib: may not have any / after them
		 * this check ensures we support those.
		 */
		if (*(e+1) != '/') {
			/* check if the data we get is a port this allows us to
			 * correctly parse things like a.com:80
			 */
			p = e + 1;
			while (p < ue && isdigit(*p)) {
				p++;
			}

			if ((p == ue || *p == '/') && (p - e) < 7) {
				goto parse_port;
			}

			ret->scheme = zend_string_init(s, (e-s), 0);
			php_replace_controlchars_ex(ZSTR_VAL(ret->scheme), ZSTR_LEN(ret->scheme));

			s = e + 1;
			goto just_path;
		} else {
			ret->scheme = zend_string_init(s, (e-s), 0);
			php_replace_controlchars_ex(ZSTR_VAL(ret->scheme), ZSTR_LEN(ret->scheme));

			if (e + 2 < ue && *(e + 2) == '/') {
				s = e + 3;
				if (zend_string_equals_literal_ci(ret->scheme, "file")) {
					if (e + 3 < ue && *(e + 3) == '/') {
						/* support windows drive letters as in:
						   file:///c:/somedir/file.txt
						*/
						if (e + 5 < ue && *(e + 5) == ':') {
							s = e + 4;
						}
						goto just_path;
					}
				}
			} else {
				s = e + 1;
				goto just_path;
			}
		}
	}

memchr是C 库函数.
原型：extern void *memchr(void *buf, char ch, unsigned count);
用法：#include <string.h>
功能：从buf所指内存区域的前count个字节查找字符ch。
说明：当第一次遇到字符ch时停止查找。如果成功，返回指向字符ch的指针；否则返回NULL。

这里是在url种寻找:号第一次出现的位置找到了(这里有个坑,这里有个赋值操作,赋值之后冒号的开始位置在下边的else等等中也是可以使用的,在一般的代码中不推荐这种写法,除了晦涩难懂外没有任何用途)
e != s
假如找到了而且冒号不是在开头的位置出现
开始检验是否是有效的协议,
将p也指向开头的位置
开始循环处理过滤掉非字母非数字非+非.非-号的字符(检查没有协议的情况)

while (p < e) {
	/* scheme = 1*[ lowalpha | digit | "+" | "-" | "." ] */
	if (!isalpha(*p) && !isdigit(*p) && *p != '+' && *p != '.' && *p != '-') {
//省略其他
}
}

然后下边

if (!isalpha(*p) && !isdigit(*p) && *p != '+' && *p != '.' && *p != '-') {
				if (e + 1 < ue && e < s + strcspn(s, "?#")) {
					goto parse_port;
				} else if (s + 1 < ue && *s == '/' && *(s + 1) == '/') { /* relative-scheme URL */
					s += 2;
					e = 0;
					goto parse_host;
				} else {
					goto just_path;
				}
			}
			p++;

第一个判断遇到另一个函数
关函数 strspn
表头文件 #inclued<string.h>
定义函数 size_t strcspn ( const char *s,const char * reject);
函数说明 strcspn()从参数s字符串的开头计算连续的字符，而这些字符都完全不在参数reject 所指的字符串中。简单地说，若strcspn()返回的数值为n，则代表字符串s开头连续有n个字符都不含字符串reject内的字符。
返回值返回字符串s开头连续不含字符串reject内的字符数目。
这个函数在这里的作用就是返回从字符串开头开始最长的不包含?号和#号的长度.
而这个条件的意思就是e不是最后一个字符串而且e在开始的字符和?号或者#号之前.
e所需指向的:号只能使password之前或者port之前的:号.然后跳转解析port.
如果s是双斜线开头直接将s后移两位e重置跳转host解析
如果都不是则被认为是path,跳转path解析.

再往下这里正常跳出循环的条件时:号之前全部是字母数字或者+.-组成的字符串
跳出循环之后

//如果这里:号之后再无字符认为只有协议,赋值并返回
if (e + 1 == ue) { /* only scheme is available */
	ret->scheme = zend_string_init(s, (e - s), 0);
	php_replace_controlchars_ex(ZSTR_VAL(ret->scheme), ZSTR_LEN(ret->scheme));
	return ret;
}

再往下

/*
		 * certain schemas like mailto: and zlib: may not have any / after them
		 * this check ensures we support those.
		 */
//此处是支持协议之后不是/字符的url
		if (*(e+1) != '/') {
			/* check if the data we get is a port this allows us to
			 * correctly parse things like a.com:80
			 */
			p = e + 1;
			while (p < ue && isdigit(*p)) {
				p++;
			}

			if ((p == ue || *p == '/') && (p - e) < 7) {
				goto parse_port;
			}

			ret->scheme = zend_string_init(s, (e-s), 0);
			php_replace_controlchars_ex(ZSTR_VAL(ret->scheme), ZSTR_LEN(ret->scheme));

			s = e + 1;
			goto just_path;
		} else {
			ret->scheme = zend_string_init(s, (e-s), 0);
			php_replace_controlchars_ex(ZSTR_VAL(ret->scheme), ZSTR_LEN(ret->scheme));

			if (e + 2 < ue && *(e + 2) == '/') {
				s = e + 3;
				if (zend_string_equals_literal_ci(ret->scheme, "file")) {
					if (e + 3 < ue && *(e + 3) == '/') {
						/* support windows drive letters as in:
						   file:///c:/somedir/file.txt
						*/
						if (e + 5 < ue && *(e + 5) == ':') {
							s = e + 4;
						}
						goto just_path;
					}
				}
			} else {
				s = e + 1;
				goto just_path;
			}
		}

if这里支持了协议:号之后不是/号的协议.
如果连续的数字则认为是端口号跳转host解析.
解析:号之后不是/号的协议
else内接正常的协议格式(http://xxxx)
这里有个判断用来解析file:///的格式url
最后跳转path解析

好了,现在回过头来看看第一个判断
((e = memchr(s, ‘:’, length)) && e != s)
当不满足这个条件
e是存在时候即e是第一个字符且是:号的时候进入端口分析
端口号这块的解析还是比较简单的需要注意的是

if (s + 1 < ue && *s == '/' && *(s + 1) == '/') { /* relative-scheme URL */
		s += 2;
}

这个地方是承接上边if (e + 1 < ue && e < s + strcspn(s, “?#”))的判断当s是//开头的字符串的时候处理,这里跳转host解析
其余跳转path解析

host解析开始置空e
将e指向/?#中的最后一个字符所在的位置
这块也比较简单
其中zend_memrchr可以认为是memrchr的zend版本
这里主要是根据@来解析user和password

再往下就比较简单了就不做过多的冗述了.

2个小时的时间终于写完了,可能写的林林总总的有些有问题的地方欢迎大家指正和讨论(^^▽^)
最后附上php的php_url_parse_ex源代码

/* {{{ php_url_parse
 */
PHPAPI php_url *php_url_parse_ex(char const *str, size_t length)
{
	char port_buf[6];
	php_url *ret = ecalloc(1, sizeof(php_url));
	char const *s, *e, *p, *pp, *ue;

	s = str;
	ue = s + length;

	/* parse scheme */
	if ((e = memchr(s, ':', length)) && e != s) {
		/* validate scheme */
		p = s;
		while (p < e) {
			/* scheme = 1*[ lowalpha | digit | "+" | "-" | "." ] */
			if (!isalpha(*p) && !isdigit(*p) && *p != '+' && *p != '.' && *p != '-') {
				if (e + 1 < ue && e < s + strcspn(s, "?#")) {
					goto parse_port;
				} else if (s + 1 < ue && *s == '/' && *(s + 1) == '/') { /* relative-scheme URL */
					s += 2;
					e = 0;
					goto parse_host;
				} else {
					goto just_path;
				}
			}
			p++;
		}

		if (e + 1 == ue) { /* only scheme is available */
			ret->scheme = zend_string_init(s, (e - s), 0);
			php_replace_controlchars_ex(ZSTR_VAL(ret->scheme), ZSTR_LEN(ret->scheme));
			return ret;
		}

		/*
		 * certain schemas like mailto: and zlib: may not have any / after them
		 * this check ensures we support those.
		 */
		if (*(e+1) != '/') {
			/* check if the data we get is a port this allows us to
			 * correctly parse things like a.com:80
			 */
			p = e + 1;
			while (p < ue && isdigit(*p)) {
				p++;
			}

			if ((p == ue || *p == '/') && (p - e) < 7) {
				goto parse_port;
			}

			ret->scheme = zend_string_init(s, (e-s), 0);
			php_replace_controlchars_ex(ZSTR_VAL(ret->scheme), ZSTR_LEN(ret->scheme));

			s = e + 1;
			goto just_path;
		} else {
			ret->scheme = zend_string_init(s, (e-s), 0);
			php_replace_controlchars_ex(ZSTR_VAL(ret->scheme), ZSTR_LEN(ret->scheme));

			if (e + 2 < ue && *(e + 2) == '/') {
				s = e + 3;
				if (zend_string_equals_literal_ci(ret->scheme, "file")) {
					if (e + 3 < ue && *(e + 3) == '/') {
						/* support windows drive letters as in:
						   file:///c:/somedir/file.txt
						*/
						if (e + 5 < ue && *(e + 5) == ':') {
							s = e + 4;
						}
						goto just_path;
					}
				}
			} else {
				s = e + 1;
				goto just_path;
			}
		}
	} else if (e) { /* no scheme; starts with colon: look for port */
		parse_port:
		p = e + 1;
		pp = p;

		while (pp < ue && pp - p < 6 && isdigit(*pp)) {
			pp++;
		}

		if (pp - p > 0 && pp - p < 6 && (pp == ue || *pp == '/')) {
			zend_long port;
			memcpy(port_buf, p, (pp - p));
			port_buf[pp - p] = '\0';
			port = ZEND_STRTOL(port_buf, NULL, 10);
			if (port > 0 && port <= 65535) {
				ret->port = (unsigned short) port;
				if (s + 1 < ue && *s == '/' && *(s + 1) == '/') { /* relative-scheme URL */
				    s += 2;
				}
			} else {
				php_url_free(ret);
				return NULL;
			}
		} else if (p == pp && pp == ue) {
			php_url_free(ret);
			return NULL;
		} else if (s + 1 < ue && *s == '/' && *(s + 1) == '/') { /* relative-scheme URL */
			s += 2;
		} else {
			goto just_path;
		}
	} else if (s + 1 < ue && *s == '/' && *(s + 1) == '/') { /* relative-scheme URL */
		s += 2;
	} else {
		goto just_path;
	}

	parse_host:
	/* Binary-safe strcspn(s, "/?#") */
	e = ue;
	if ((p = memchr(s, '/', e - s))) {
		e = p;
	}
	if ((p = memchr(s, '?', e - s))) {
		e = p;
	}
	if ((p = memchr(s, '#', e - s))) {
		e = p;
	}

	/* check for login and password */
	if ((p = zend_memrchr(s, '@', (e-s)))) {
		if ((pp = memchr(s, ':', (p-s)))) {
			ret->user = zend_string_init(s, (pp-s), 0);
			php_replace_controlchars_ex(ZSTR_VAL(ret->user), ZSTR_LEN(ret->user));

			pp++;
			ret->pass = zend_string_init(pp, (p-pp), 0);
			php_replace_controlchars_ex(ZSTR_VAL(ret->pass), ZSTR_LEN(ret->pass));
		} else {
			ret->user = zend_string_init(s, (p-s), 0);
			php_replace_controlchars_ex(ZSTR_VAL(ret->user), ZSTR_LEN(ret->user));
		}

		s = p + 1;
	}

	/* check for port */
	if (s < ue && *s == '[' && *(e-1) == ']') {
		/* Short circuit portscan,
		   we're dealing with an
		   IPv6 embedded address */
		p = NULL;
	} else {
		p = zend_memrchr(s, ':', (e-s));
	}

	if (p) {
		if (!ret->port) {
			p++;
			if (e-p > 5) { /* port cannot be longer then 5 characters */
				php_url_free(ret);
				return NULL;
			} else if (e - p > 0) {
				zend_long port;
				memcpy(port_buf, p, (e - p));
				port_buf[e - p] = '\0';
				port = ZEND_STRTOL(port_buf, NULL, 10);
				if (port > 0 && port <= 65535) {
					ret->port = (unsigned short)port;
				} else {
					php_url_free(ret);
					return NULL;
				}
			}
			p--;
		}
	} else {
		p = e;
	}

	/* check if we have a valid host, if we don't reject the string as url */
	if ((p-s) < 1) {
		php_url_free(ret);
		return NULL;
	}

	ret->host = zend_string_init(s, (p-s), 0);
	php_replace_controlchars_ex(ZSTR_VAL(ret->host), ZSTR_LEN(ret->host));

	if (e == ue) {
		return ret;
	}

	s = e;

	just_path:

	e = ue;
	p = memchr(s, '#', (e - s));
	if (p) {
		p++;
		if (p < e) {
			ret->fragment = zend_string_init(p, (e - p), 0);
			php_replace_controlchars_ex(ZSTR_VAL(ret->fragment), ZSTR_LEN(ret->fragment));
		}
		e = p-1;
	}

	p = memchr(s, '?', (e - s));
	if (p) {
		p++;
		if (p < e) {
			ret->query = zend_string_init(p, (e - p), 0);
			php_replace_controlchars_ex(ZSTR_VAL(ret->query), ZSTR_LEN(ret->query));
		}
		e = p-1;
	}

	if (s < e || s == ue) {
		ret->path = zend_string_init(s, (e - s), 0);
		php_replace_controlchars_ex(ZSTR_VAL(ret->path), ZSTR_LEN(ret->path));
	}

	return ret;
}
/* }}} */