Lua源码之字符串

最新推荐文章于 2022-03-02 17:01:35 发布

-沧海流云-

最新推荐文章于 2022-03-02 17:01:35 发布

阅读量251

点赞数

分类专栏： Lua 文章标签： lua

本文链接：https://blog.csdn.net/wyk223344/article/details/106959363

版权

Lua 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

lua版本：5.3.5

数据结构

lua的字符串分为短字符串和长字符串：

/* Variant tags for strings */
#define LUA_TSHRSTR	(LUA_TSTRING | (0 << 4))  /* short strings */
#define LUA_TLNGSTR	(LUA_TSTRING | (1 << 4))  /* long strings */

字符串结构体定义代码：

/*
** Common Header for all collectable objects (in macro form, to be
** included in other objects)
*/
#define CommonHeader	GCObject *next; lu_byte tt; lu_byte marked

/*
** Header for string value; string bytes follow the end of this structure
** (aligned according to 'UTString'; see next).
*/
typedef struct TString {
  CommonHeader;
  lu_byte extra;  /* reserved words for short strings; "has hash" for longs */
  lu_byte shrlen;  /* length for short strings */
  unsigned int hash;
  union {
    size_t lnglen;  /* length for long strings */
    struct TString *hnext;  /* linked list for hash table */
  } u;
} TString;

CommonHeader：用于GC。
TString ：字符串的结构体。
- extra：对于短字符串，extra表示是否为保留字(即：and、or这些关键字，不参与回收)；对于长字符串，extra可以表示该字符串是否进行过hash。
- shrlen：短字符串的长度。
- hash：短字符串在创建时就会计算哈希值，并根据此哈希值将短字符串放入stringtable这一个开散列表中；长字符串则是独立存放，并且哈希值不会立即计算，而是等到需要的时候再计算。
- union {lnglen, hnext}：联合体。当是短字符串时，用到的hnext，表示与其相同哈希值的下一个TString的指针；当时长字符串时，用到的是lnglen，表示长字符串的长度。

用于存放短字符串的全局哈希表，会动态更新大小：

typedef struct stringtable {
  TString **hash;
  int nuse;  /* number of elements */
  int size;
} stringtable;

hash：存储短字符串的散列表。
nuse：当前实际元素数。
size：散列数组的长度。

实现

字符串比较

短字符串只会保存一份，所以要比较短字符串直接比较地址就可以了。
长字符串比较函数则为：

/*
** equality for long strings
*/
int luaS_eqlngstr (TString *a, TString *b) {
  size_t len = a->u.lnglen;
  lua_assert(a->tt == LUA_TLNGSTR && b->tt == LUA_TLNGSTR);
  return (a == b) ||  /* same instance or... */
    ((len == b->u.lnglen) &&  /* equal length and ... */
     (memcmp(getstr(a), getstr(b), len) == 0));  /* equal contents */
}

比较方式则是先判断是否地址相同，地址相同则字符串肯定相同。接下来则是比较字符串长度，长度不同则字符串必然不同，最后则是逐字比较内容。

字符串哈希

unsigned int luaS_hash (const char *str, size_t l, unsigned int seed) {
  unsigned int h = seed ^ cast(unsigned int, l);
  size_t step = (l >> LUAI_HASHLIMIT) + 1;
  for (; l >= step; l -= step)
    h ^= ((h<<5) + (h>>2) + cast_byte(str[l - 1]));
  return h;
}

字符串哈希时会用到一个随机种子，以降低被人采用Hash Dos攻击的可能。并且采用了step步长来加速哈希长字符串的过程。

unsigned int luaS_hashlongstr (TString *ts) {
  lua_assert(ts->tt == LUA_TLNGSTR);
  if (ts->extra == 0) {  /* no hash? */
    ts->hash = luaS_hash(getstr(ts), ts->u.lnglen, ts->hash);
    ts->extra = 1;  /* now it has its hash */
  }
  return ts->hash;
}

长字符串初始不会进行哈希，如果需要哈希则得手动调用此函数(目前长字符串只有在作为表key值时才会进行哈希)。这里会通过extra字段表示是否哈希过。另外，长字符串初始化时，hash字段直接被赋值为全局种子的值，所以这里传了hash作为参数。

短字符串内部化

/*
** checks whether short string exists and reuses it or creates a new one
*/
static TString *internshrstr (lua_State *L, const char *str, size_t l) {
  TString *ts;
  global_State *g = G(L);
  unsigned int h = luaS_hash(str, l, g->seed);
  TString **list = &g->strt.hash[lmod(h, g->strt.size)];
  lua_assert(str != NULL);  /* otherwise 'memcmp'/'memcpy' are undefined */
  for (ts = *list; ts != NULL; ts = ts->u.hnext) {
    if (l == ts->shrlen &&
        (memcmp(str, getstr(ts), l * sizeof(char)) == 0)) {
      /* found! */
      if (isdead(g, ts))  /* dead (but not collected yet)? */
        changewhite(ts);  /* resurrect it */
      return ts;
    }
  }
  if (g->strt.nuse >= g->strt.size && g->strt.size <= MAX_INT/2) {
    luaS_resize(L, g->strt.size * 2);
    list = &g->strt.hash[lmod(h, g->strt.size)];  /* recompute with new size */
  }
  ts = createstrobj(L, l, LUA_TSHRSTR, h);
  memcpy(getstr(ts), str, l * sizeof(char));
  ts->shrlen = cast_byte(l);
  ts->u.hnext = *list;
  *list = ts;
  g->strt.nuse++;
  return ts;
}

用于创建和获取短字符串的函数，其过程为：

调用luaS_hash，计算哈希值。
在stringtable里找是否有相同的字符串，有则直接返回该字符串。没有则继续。
检查stringtable的大小，如果元素数大于数组长度，并且数组长度小于最大值的一半的话，则进行resize，将哈希表扩大为两倍。
创建短字符串结构体并初始化，将其加入散列表。
更新散列表的元素数记录。

/*
** If possible, shrink string table
*/
static void checkSizes (lua_State *L, global_State *g) {
  if (g->gckind != KGC_EMERGENCY) {
    l_mem olddebt = g->GCdebt;
    if (g->strt.nuse < g->strt.size / 4)  /* string table too big? */
      luaS_resize(L, g->strt.size / 2);  /* shrink it a little */
    g->GCestimate += g->GCdebt - olddebt;  /* update estimate */
  }
}

在gc时，如果元素数量小于哈希表数组大小的四分之一，则会将哈希表缩小一半。

总结

短字符串：
- 存于哈希表stringtable中这一开散列表中，相同字符串只会存一份。
- 创建时就会计算哈希值。
- 哈希表stringtable会动态调整表大小。
长字符串：
- 可能保存多份相同的长字符串。
- 创建时不会计算哈希值，只有用到的时候才会计算。
其他：
- 计算哈希值时会利用随机种子来减少被Hash Dos攻击的可能性。
- 计算哈希值时会采用步长来加快计算过程。

Reference

1、Lua源码欣赏——云风
2、Lua设计与实现–字符串篇

-沧海流云-

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Lua源码之字符串

lua版本：5.3.5数据结构 lua的字符串分为短字符串和长字符串：/* Variant tags for strings */#define LUA_TSHRSTR (LUA_TSTRING | (0 << 4)) /* short strings */#define LUA_TLNGSTR (LUA_TSTRING | (1 << 4)) /* long strings */ 字符串结构体定义代码：/*** Common Header for all
复制链接

扫一扫