字符串分割函数strtok和strsep使用注意事项

最新推荐文章于 2024-04-29 10:51:08 发布

llljjlj

最新推荐文章于 2024-04-29 10:51:08 发布

阅读量816

点赞数

分类专栏： linux

linux 专栏收录该内容

47 篇文章 2 订阅

订阅专栏

转载自 https://blog.csdn.net/astrotycoon/article/details/50813959

为什么写本文

最近工作中经常需要解析字符串，并且这些字符串都有一个共同的特点，那就是它们由一个或者多个分隔符（delimiter）隔开，而我要做的就是提取出由这些分隔符隔开的诸多子串。

我们来看一个例子，假设DHCP服务器返回的信息如下格式：

网络地址:子网掩码:默认网关:DNS地址1:DNS地址2

这里为了简单化问题，字符串中只有一种分隔符“:”，当然我相信这是现实中最常见的一种情况。现在的任务就是正确地提取出网络地址、子网掩码、默认网关以及两个DNS地址。需要指出的是，因为有时DHCP服务器会发生异常，导致返回的信息不完整，例如192.168.6.138:255.255.255.0::202.38.64.1:114.114.114.114，看到没有？默认网关没有正确获取到，因此返回的字符串中有连续的两个分隔符“:”，彼此之间是没有空格的。

我首先想到的办法是使用sscanf函数，如下：

const char *dhcpargs = "192.168.6.138:255.255.255.0::202.38.64.1:114.114.114.114";
char ip[32], netmask[32], gateway[32], dns[2][32];
int ret = sscanf(dhcpargs, "%[^:]:%[^:]:%[^:]:%[^:]:%s", ip, netmask, gateway, dns[0], dns[1]);
结果发现不可行，sscanf函数返回值为2，网络地址和子网掩码正确解析出来了，但是DNS地址都没有解析出来，也就是说sscanf无法正确解析空字符串，如果遇到空字符串就停止后续的解析 -- 确切的说，是说明符[ ]无法匹配空字符串。

后来了解到C语言中有函数来分割字符串，分别为strtok和strsep函数。strtok是标准C库函数，strsep不是，但是现如今的C库几乎全部有strsep的实现，就连linux内核也已经很早开始使用strsep，而放弃使用strtok了。

在使用过程中，我发现strtok不能胜任以上的情况，最终选择了strsep函数。还发现了这两个函数的诸多相同点和不同点，以及使用这两个函数容易犯的错误，本文就结合这两个函数的源码来简单分析下这两个函数的异同点。

函数源码

代码来自glibc-2.24，strtok源码如下：

/* Copyright (C) 1991-2016 Free Software Foundation, Inc.
This file is part of the GNU C Library.
The GNU C Library is free software; you can redistribute it and/or
modify it under the terms of the GNU Lesser General Public
License as published by the Free Software Foundation; either
version 2.1 of the License, or (at your option) any later version.
The GNU C Library is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
Lesser General Public License for more details.
You should have received a copy of the GNU Lesser General Public
License along with the GNU C Library; if not, see
<http://www.gnu.org/licenses/>. */

#include <string.h>

static char *olds;

#undef strtok

#ifndef STRTOK
# define STRTOK strtok
#endif

/* Parse S into tokens separated by characters in DELIM.
If S is NULL, the last string strtok() was called with is
used. For example:
   char s[] = "-abc-=-def";
   x = strtok(s, "-");       // x = "abc"
   x = strtok(NULL, "-=");       // x = "def"
   x = strtok(NULL, "=");       // x = NULL
       // s = "abc\0=-def\0"
*/
char *
STRTOK (char *s, const char *delim)
{
char *token;

if (s == NULL)
s = olds;

/* Scan leading delimiters. */
s += strspn (s, delim);
if (*s == '\0')
{
olds = s;
return NULL;
}

/* Find the end of the token. */
token = s;
s = strpbrk (token, delim);
if (s == NULL)
/* This token finishes the string. */
olds = __rawmemchr (token, '\0');
else
{
/* Terminate the token and make OLDS point past it. */
*s = '\0';
olds = s + 1;
}
return token;
}
strtok代码整体流程如下：

（1）判断参数s是否为NULL。不为NULL则就以s为起始点开始分割；如果是NULL，证明不是第一次调用，则从上一次olds保存的位置处开始分割。

（2）跳过开始位置的所有分隔字符，直到遇到第一个非分割字符。函数strspn用来计算开始处有多少个连续分割字符。

（3）如果一走来就遇到了'\0'字符，证明字符串已经分割完成，因此返回NULL。不然则继续。

（4）先保存待分割字符串的首地址，也就是本次待分割出的子串的首地址。接着调用函数strpbrk寻找后续的第一个分隔符：如果返回值为NULL，证明后续的所有字符中没有分隔符了，那么使olds指向字符串的末尾，即'\0'，函数返回s。如果找到了，那么将其覆盖写为了'\0'，并且使old指向后面的第一个字符，函数返回s。

再来看看strsep的源码如下：

/* Copyright (C) 1992-2016 Free Software Foundation, Inc.
This file is part of the GNU C Library.
The GNU C Library is free software; you can redistribute it and/or
modify it under the terms of the GNU Lesser General Public
License as published by the Free Software Foundation; either
version 2.1 of the License, or (at your option) any later version.
The GNU C Library is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
Lesser General Public License for more details.
You should have received a copy of the GNU Lesser General Public
License along with the GNU C Library; if not, see
<http://www.gnu.org/licenses/>. */

#include <string.h>

#undef __strsep
#undef strsep

char *
__strsep (char **stringp, const char *delim)
{
char *begin, *end;

begin = *stringp;
if (begin == NULL)
return NULL;

/* A frequent case is when the delimiter string contains only one
character. Here we don't need to call the expensive `strpbrk'
function and instead work using `strchr'. */
if (delim[0] == '\0' || delim[1] == '\0')
{
char ch = delim[0];

if (ch == '\0')
   end = NULL;
else
   {
   if (*begin == ch)
   end = begin;
   else if (*begin == '\0')
   end = NULL;
   else
   end = strchr (begin + 1, ch);
   }
}
else
/* Find the end of the token. */
end = strpbrk (begin, delim);

if (end)
{
/* Terminate the token and set *STRINGP past NUL character. */
*end++ = '\0';
*stringp = end;
}
else
/* No more delimiters; this is the last token. */
*stringp = NULL;

return begin;
}
weak_alias (__strsep, strsep)
strong_alias (__strsep, __strsep_g)
libc_hidden_def (__strsep_g)
个人认为strsep的源码比strtok的写的要好，主要体现在函数内局部变量的命名上，简单明了。strsep代码的整体流程如下：

（1）begin指向参数stringp，即传递进来的字符串首地址。如果首地址为NULL，则什么也不做，返回NULL；否则继续。

（2）获取后续字符中第一个分割字符的位置，如果没有获取到，设置*stringp为NULL，返回begin。否则覆盖写找到的分隔符为'\0'，并且是*stringp指向后续的第一个字符。

相同点

好了，看完了源码，想必已经看出了些许两者的相同之处了吧？总结如下：

（1）两者都会修改原字符串，因此这个字符串可千万不能存储在只读内存区域里（也就是大家常说的string literal，中文一般叫做字符串字面值），否则运行时会发生断错误。可以是字符数组（存储在数据段.data，或者栈上），或者是存储在动态申请的内存（堆）里的字符串。

（2）如果处理过程中没有找到分隔符，则返回的就是传递进去的字符串的首地址。

不同点

（1）strtok函数使用了全局静态变量，这就使得它不是线程安全函数或者说是不可重入函数。而strsep通过二级指针用来替代strtok中的全局指针的功能，因此它是线程安全函数，glibc中有个strtok_r函数，它同样是通过提供二级指针的功能来保证线程安全的，在不考虑移植的情况下，应尽量使用strtok_r函数。

（2）strtok函数在扫描一个字符串时，会忽略掉一开始的分隔符，而strsep不会忽略，而是覆盖写成'\0'，并且返回一个空字符串。因此strtok的返回值只有两种情况:非空字符串首地址和NULL；strsep的返回值有三种可能：空字符串首地址，非空字符串首地址以及NULL。这点尤其重要，会导致strtok和strsep的行为不同，也是最迷惑人的地方。

（3）调用方式的区别：strtok函数第一次调用时第一个参数必须提供待分割字符串的首地址，而后续调用必须设置成NULL。strsep的调用方式一直不变。

实际例子

先来说下为什么文章一开始我说最终选择了strsep函数，根据上面的不同点（2）可知，strsep在处理网关时会返回空字符串，这是我想要的，为空就知道是DHCP服务器发生异常导致的，而strtok却直接跳过去解析DNS，这就导致把DNS地址当成网关了，这完全不是我想要的结果嘛！

好了，来看个比较典型并且简单的例子，重点是向读者展示两个函数扫描字符串的不同方式！

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main(int argc, const char *argv[])
{
char buf[] = "abadcbaf";
char *result;

#if 0
for (result = strtok(buf, "ab"); result != NULL; result = strtok(NULL, "ab")) {
printf("result = %s\n", result);
}
#else
char *pbuf = buf;
while ((result = strsep(&pbuf, "ab")) != NULL) {
printf("result = %s\n", result);
}
#endif

exit(EXIT_SUCCESS);
}
接下来我以图形方式向读者展示两个函数的扫描过程，你会发现两个函数的扫描次数是不同的。

strtok函数处理过程：

（1）初始状态

（2）第一次分割后（忽略前三个分隔符，覆盖写找到的第一个分割符b为'\0'，olds指向后续字符a，返回字符串"dc"）

（3）第二次分割后（忽略分割字符a，olds执行字符串尾部，返回字符串"f"）

（4）最后一次（第三次）分割后（函数返回NULL，olds指向字符串尾部）

接下来看strsep的处理过程：

（1）初始状态

（2）第一次分割（覆盖写字符a为'\0'， stringp指向字符b，返回空字符串）

（3）第二次分割（覆盖写字符b为'\0'，stringp指向后续的字符a，返回空字符串）

（4）第三次分割（覆盖写字符a为'\0'，stringp指向后续字符d，返回空字符串）

（5）第四次分割（覆盖写字符b为'\0'，stringp指向后续字符a，返回字符串"dc"）

（6）第五次分割（覆盖写字符a为'\0'，stringp指向后续的字符f，返回空字符串）

（7）第六次分割（stringp指向NULL，返回字符串"f"）

（8）第七次分割（因为stringp为NULL，什么都不做，函数返回NULL）

通过对比发现：strtok函数实际分割了2次，strsep实际分割了6次。

总结

对strtok和strsep这两个函数，网上有不少文章讲解，我发信很多存在错误，究其原因是他们没有结合源码来分析，甚至有猜测的成分在里面，导致对它们的理解不够透彻。而strtok和strsep的源码我们又是唾手可得的，那还不如直接看源码，这样来的痛快，也不会存在理解上的模棱两可。

参考链接：

《Linux C函数strtok解析》

《关于函数strtok和strtok_r的使用要点和实现原理（一）》

《What are the differences between strtok and strsep inC》

补充：

今天简单看了下VS下strtok的实现，发现它的实现跟GNU的实现方式不大一样，因此简单分析如下。

先看源码如下：

/***
*strtok.c - tokenize a string with given delimiters
*
* Copyright (c) Microsoft Corporation. All rights reserved.
*
*Purpose:
* defines strtok() - breaks string into series of token
* via repeated calls.
*
*******************************************************************************/

#include <cruntime.h>
#include <string.h>
#ifdef _SECURE_VERSION
#include <internal.h>
#else /* _SECURE_VERSION */
#include <mtdll.h>
#endif /* _SECURE_VERSION */

/***
*char *strtok(string, control) - tokenize string with delimiter in control
*
*Purpose:
* strtok considers the string to consist of a sequence of zero or more
* text tokens separated by spans of one or more control chars. the first
* call, with string specified, returns a pointer to the first char of the
* first token, and will write a null char into string immediately
* following the returned token. subsequent calls with zero for the first
* argument (string) will work thru the string until no tokens remain. the
* control string may be different from call to call. when no tokens remain
* in string a NULL pointer is returned. remember the control chars with a
* bit map, one bit per ascii char. the null char is always a control char.
*
*Entry:
* char *string - string to tokenize, or NULL to get next token
* char *control - string of characters to use as delimiters
*
*Exit:
* returns pointer to first token in string, or if string
* was NULL, to next token
* returns NULL when no more tokens remain.
*
*Uses:
*
*Exceptions:
*
*******************************************************************************/

#ifdef _SECURE_VERSION
#define _TOKEN *context
#else /* _SECURE_VERSION */
#define _TOKEN ptd->_token
#endif /* _SECURE_VERSION */

#ifdef _SECURE_VERSION
char * __cdecl strtok_s (
char * string,
const char * control,
char ** context
)
#else /* _SECURE_VERSION */
char * __cdecl strtok (
char * string,
const char * control
)
#endif /* _SECURE_VERSION */
{
unsigned char *str;
const unsigned char *ctrl = control;

unsigned char map[32];
int count;

#ifdef _SECURE_VERSION

/* validation section */
_VALIDATE_RETURN(context != NULL, EINVAL, NULL);
_VALIDATE_RETURN(string != NULL || *context != NULL, EINVAL, NULL);
_VALIDATE_RETURN(control != NULL, EINVAL, NULL);

/* no static storage is needed for the secure version */

#else /* _SECURE_VERSION */

_ptiddata ptd = _getptd();

#endif /* _SECURE_VERSION */

/* Clear control map */
for (count = 0; count < 32; count++)
map[count] = 0;

/* Set bits in delimiter table */
do {
map[*ctrl >> 3] |= (1 << (*ctrl & 7));
} while (*ctrl++);

/* Initialize str */

/* If string is NULL, set str to the saved
* pointer (i.e., continue breaking tokens out of the string
* from the last strtok call) */
if (string)
str = string;
else
str = _TOKEN;

/* Find beginning of token (skip over leading delimiters). Note that
* there is no token iff this loop sets str to point to the terminal
* null (*str == '\0') */
while ( (map[*str >> 3] & (1 << (*str & 7))) && *str )
str++;

string = str;

/* Find the end of the token. If it is not the end of the string,
* put a null there. */
for ( ; *str ; str++ )
if ( map[*str >> 3] & (1 << (*str & 7)) ) {
*str++ = '\0';
break;
}

/* Update nextoken (or the corresponding field in the per-thread data
* structure */
_TOKEN = str;

/* Determine if a token has been found. */
if ( string == str )
return NULL;
else
return string;
}
可能最让人迷惑的就是这个数组map，它的用途是什么？又为什么它的大小是32呢？

不难看出，map数组是用于记录分隔符的，但是又不是单纯的记录，确切地说，它是一个位集，每个字节8个bit，因此32*8共256，足以表示所有的ascii。

因此map数组的每个元素代表8个ascii，如果对应的ascii字符存在，对应的bit设置为1，否则为0。

我们可以看一个具体的例子，在ascii表中从字符@到字符G的十进制值为64~71，正好是数组map的第9个元素，即map[8] -- 也就是说map[8]字节中的每一个bit的状态代表分隔符总是否存在字符@到G。

这样就比较清楚了。其实右移3位，相当于除以8，&7相当于取8的模。

/* Set bits in delimiter table */
do {
map[*ctrl >> 3] |= (1 << (*ctrl & 7));
} while (*ctrl++);
这段代码就是讲分隔符映射到map数组中。
/* Find beginning of token (skip over leading delimiters). Note that
* there is no token iff this loop sets str to point to the terminal
* null (*str == '\0') */
while ( (map[*str >> 3] & (1 << (*str & 7))) && *str )
str++;
这段代码即忽略前缀的分隔符。
好，差不多了明白了吧。

参考链接：

《 strtok源码剖析位操作与空间压缩》

《杭电水题--排序关于strtok的一些问题》
---------------------
作者：astrotycoon
来源：CSDN
原文：https://blog.csdn.net/astrotycoon/article/details/50813959
版权声明：本文为博主原创文章，转载请附上博文链接！