lua数据结构原理

最新推荐文章于 2022-01-25 11:56:05 发布

popcorn丶

最新推荐文章于 2022-01-25 11:56:05 发布

阅读量3.5k

点赞数 8

分类专栏： lua 文章标签： lua数据结构

本文链接：https://blog.csdn.net/qq_17347313/article/details/100698348

版权

lua 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

开发游戏好几年了，一直好奇，为什么lua可以那样子随便赋值，一会等于数字，一会有插入一个字符串，到底底层是怎么表示这些数据的。经过最近对lua源码的解读，记录一下自己的心得。

本文主要讲以下几个方面：

lua通用数据结构的实现
lua中字符串的表示
lua表的表示

lua通用数据结构的实现：

/*这里是lua所有的基本数据类型token定义（保存在lua.h）*/
#define LUA_TNONE		(-1)

#define LUA_TNIL		0
#define LUA_TBOOLEAN		1
#define LUA_TLIGHTUSERDATA	2
#define LUA_TNUMBER		3
#define LUA_TSTRING		4
#define LUA_TTABLE		5
#define LUA_TFUNCTION		6
#define LUA_TUSERDATA		7
#define LUA_TTHREAD		8

可以看到LUA_TSTRING之后的类型不是简单的值类型，需要lua GC管理。这里GC不是重点要讲的，之后的文章会出。不过还是说一下

#define CommonHeader	GCObject *next; lu_byte tt; lu_byte marked
typedef struct GCheader {
  CommonHeader;
} GCheader;

这就是GC的保存类型，next指针是在GC链表中使用的，lua_byte(unsigned char) 表示数据的类型，使用无符号的原因是为了数据类型拓展不出问题。marked是GC标记位，参考lua三色标记法。

下面讲以下lua抽象出来的所有类型数据的通用表示：

/*通用表示lobject.h*/
typedef struct lua_TValue {
  TValuefields;
} TValue;

/*TValuefieldsd的表示 lobject.h*/
#define TValuefields	Value value; int tt

/*Value 的表示 lobject.h*/
typedef union {
  GCObject *gc;
  void *p;
  lua_Number n;
  int b;
} Value;

/*GCObject 的表示 lobject.h*/
union GCObject {
  GCheader gch;
  union TString ts;
  union Udata u;
  union Closure cl;
  struct Table h;
  struct Proto p;
  struct UpVal uv;
  struct lua_State th;  /* thread */
};

可以看到，lua源码使用union与struct结合的方式很节省内存的表示了通用的数据结构。如下图，取自lua设计与实现

降到这里，我相信仔细看的，应该已经明白了lua通用数据的表示了。

lua中字符串的表示

lua中，字符串是内化的数据，并不是每一个字符串单独表示，而是在一个全局的global_state->strt的一个哈希表保存了字符串，外部变量值保存引用。

举例说明一下：

test = "123"
test = test .. "456"

上面第一行代码会先在全局字符串表中先生成一个“123”的字符串，a指向“123”，然后在与“456”连接以后，会继续在全局字符串表中生成一个“123456”的字符串，然后a再指向“123456”这个字符串，如果后面123不适用的话，就会在GC的时候被回收。

下面给出string的数据结构表示：

typedef union TString {
  L_Umaxalign dummy;  /* ensures maximum alignment for strings */
  struct {
    CommonHeader;
    lu_byte reserved;
    unsigned int hash;
    size_t len;
  } tsv;
} TString;

/*这里的L_Umaxalign 是为了表示最大对齐方式，double void long 中最大的字节数，一般最大为8*/
typedef LUAI_USER_ALIGNMENT_T L_Umaxalign;
#define LUAI_USER_ALIGNMENT_T	union { double u; void *s; long l; }

解释一下变量的含义：

CommonHeader：GC回收的头
reserved：保留字符串，如果是1，将不被GC回收
hash：该字符串的hash散列值
len：size_t，4/8位的unsigned int，根据编译器不同，字符串的长度

下面讲一下前面提到过的全局字符串表，首先看一下数据结构，在global_State中用strt表示：

typedef struct stringtable {
  GCObject **hash;
  lu_int32 nuse;  /* number of elements */
  int size;       // hash arry size
} stringtable;

介绍几个重要的函数:

void luaS_resize (lua_State *L, int newsize) {
  GCObject **newhash;
  stringtable *tb;
  int i;
  if (G(L)->gcstate == GCSsweepstring)
    return;  /* cannot resize during GC traverse */
  newhash = luaM_newvector(L, newsize, GCObject *);
  tb = &G(L)->strt;
  for (i=0; i<newsize; i++) newhash[i] = NULL;
  /* rehash */
  for (i=0; i<tb->size; i++) {
    GCObject *p = tb->hash[i];
    while (p) {  /* for each node in the list */
      GCObject *next = p->gch.next;  /* save next */
      unsigned int h = gco2ts(p)->hash;
      int h1 = lmod(h, newsize);  /* new position */
      lua_assert(cast_int(h%newsize) == lmod(h, newsize));
      p->gch.next = newhash[h1];  /* chain it */
      newhash[h1] = p;
      p = next;
    }
  }
  luaM_freearray(L, tb->hash, tb->size, TString *);
  tb->size = newsize;
  tb->hash = newhash;
}

为什么要重新散列呢，因为数据量比较大的情况下，每个桶的链会很长，这样线性的访问效率明显不高，那么就需要这个函数了。

TString *luaS_newlstr (lua_State *L, const char *str, size_t l) {
  GCObject *o;
  unsigned int h = cast(unsigned int, l);  /* seed */
  size_t step = (l>>5)+1;  /* if string is too long, don't hash all its chars */
  size_t l1;
  for (l1=l; l1>=step; l1-=step)  /* compute hash */
    h = h ^ ((h<<5)+(h>>2)+cast(unsigned char, str[l1-1])); /*计算hash值的方法，值得借鉴*/
  for (o = G(L)->strt.hash[lmod(h, G(L)->strt.size)];
       o != NULL;
       o = o->gch.next) {
    TString *ts = rawgco2ts(o);
    if (ts->tsv.len == l && (memcmp(str, getstr(ts), l) == 0)) {
      /* string may be dead */
      if (isdead(G(L), o)) changewhite(o);
      return ts;
    }
  }
  return newlstr(L, str, l, h);  /* not found */
}

这里逻辑并不复杂，首先计算hash值，这里使用异或与移位以及长数据步长，效率很快，然后根据hash值找到对应的桶，顺序查找该桶的链表，看能够查找到同样的字符串，如果存在直接返回，找不到就增加一个，在newlstr函数：

static TString *newlstr (lua_State *L, const char *str, size_t l,
                                       unsigned int h) {
  TString *ts;
  stringtable *tb;
  if (l+1 > (MAX_SIZET - sizeof(TString))/sizeof(char))
    luaM_toobig(L);
  ts = cast(TString *, luaM_malloc(L, (l+1)*sizeof(char)+sizeof(TString)));
  ts->tsv.len = l;
  ts->tsv.hash = h;
  ts->tsv.marked = luaC_white(G(L));
  ts->tsv.tt = LUA_TSTRING;
  ts->tsv.reserved = 0;
  memcpy(ts+1, str, l*sizeof(char));
  ((char *)(ts+1))[l] = '\0';  /* ending 0 */
  tb = &G(L)->strt;
  h = lmod(h, tb->size);
  ts->tsv.next = tb->hash[h];  /* chain new entry */
  tb->hash[h] = obj2gco(ts);
  tb->nuse++;
  if (tb->nuse > cast(lu_int32, tb->size) && tb->size <= MAX_INT/2)
    luaS_resize(L, tb->size*2);  /* too crowded */
  return ts;
}

这里就是新建了一个TString然后挂在了hash表里。

有人好奇上面的ts->tsv.next在哪里定义了tsv.next，这个可以看下之前数据的定义，在GCObject中存在next的。

学了这么多，大家会觉得，卵用，哈哈。给一个例子：

/*case 1*/
local str = ""
for i=1,1000000 do
    str = str .. tostring(i)
end


/*case 2*/
local str = ""
local tab = {}
for i=1,1000000 do
    tab[#tab + 1] = tostring(i) 
end
s = table.concat(t)

case1的的循环会产生1000000个string保存在全局中，然后GC在回收，想想这是多么吓人的数据。而下面的性能会强10倍左右。那么下面介绍一下table

lua表的表示

lua table真的可以想怎么表示就怎么表示么？我想你看完以下内容会对你书写lua程序带来非常大的受益。

先看下table的数据结构：


typedef struct Table {
  CommonHeader;
  lu_byte flags;  /* 1<<p means tagmethod(p) is not present */ 
  lu_byte lsizenode;  /* log2 of size of `node' array */
  struct Table *metatable;
  TValue *array;  /* array part */
  Node *node;
  Node *lastfree;  /* any free position is before this position */
  GCObject *gclist;
  int sizearray;  /* size of `array' array */
} Table;

CommonHeader：#define CommonHeader GCObject *next; lu_byte tt; lu_byte marked
flags位图形势保存有哪些元表
lsizenode：散列表大小的对数，以2为底，说明该table的扩容缩容是以2倍为单位的
metatable：好熟悉是不是，就是元表啦
array：数组部分
node lastfree：散列桶的起始指针
gclist：GC相关链表
sizearray：数组大小

这样你就发现了，啊呀！table里面竟然既有数组又有hash表，有没有很惊讶！数组部分就是简单的一个Tvalue的数组，就不多说了，讲一下hash表的数据结构：

typedef union TKey {
  struct {
    TValuefields;
    struct Node *next;  /* for chaining */
  } nk;
  TValue tvk;
} TKey;

typedef struct Node {
  TValue i_val;
  TKey i_key;
} Node;

这样一看就简单了嘛！起始就是一个Node数组构成的桶，每个桶是一个Node的链表。

下面看一下怎么给table增加一个值呢：

/*
** inserts a new key into a hash table; first, check whether key's main 
** position is free. If not, check whether colliding node is in its main 
** position or not: if it is not, move colliding node to an empty place and 
** put new key in its main position; otherwise (colliding node is in its main 
** position), new key goes to an empty position. 
*/
static TValue *newkey (lua_State *L, Table *t, const TValue *key) {
  Node *mp = mainposition(t, key);
  if (!ttisnil(gval(mp)) || mp == dummynode) {
    Node *othern;
    Node *n = getfreepos(t);  /* get a free place */
    if (n == NULL) {  /* cannot find a free place? */
      rehash(L, t, key);  /* grow table */
      return luaH_set(L, t, key);  /* re-insert key into grown table */
    }
    lua_assert(n != dummynode);
    othern = mainposition(t, key2tval(mp));
    if (othern != mp) {  /* is colliding node out of its main position? */
      /* yes; move colliding node into free position */
      while (gnext(othern) != mp) othern = gnext(othern);  /* find previous */
      gnext(othern) = n;  /* redo the chain with `n' in place of `mp' */
      *n = *mp;  /* copy colliding node into free pos. (mp->next also goes) */
      gnext(mp) = NULL;  /* now `mp' is free */
      setnilvalue(gval(mp));
    }
    else {  /* colliding node is in its own main position */
      /* new node will go into free position */
      gnext(n) = gnext(mp);  /* chain new position */
      gnext(mp) = n;
      mp = n;
    }
  }
  gkey(mp)->value = key->value; gkey(mp)->tt = key->tt;
  luaC_barriert(L, t, key);
  lua_assert(ttisnil(gval(mp)));
  return gval(mp);
}

这么多代码，不要慌，mainposition函数就是在不同key类型的情况下，找到对应的桶，然后就查看该Node上有没有数据，没有的话直接保存key，返回即可，没有的话就重新申请空间，再串联到该散列桶上，其实就是简单的哈希表的插入操作而已。

该过程有个rehash的过程：

static void rehash (lua_State *L, Table *t, const TValue *ek) {
  int nasize, na;
  int nums[MAXBITS+1];  /* nums[i] = number of keys between 2^(i-1) and 2^i */
  int i;
  int totaluse;
  for (i=0; i<=MAXBITS; i++) nums[i] = 0;  /* reset counts */
  nasize = numusearray(t, nums);  /* count keys in array part */
  totaluse = nasize;  /* all those keys are integer keys */
  totaluse += numusehash(t, nums, &nasize);  /* count keys in hash part */
  /* count extra key */
  nasize += countint(ek, nums);
  totaluse++;
  /* compute new size for array part */
  na = computesizes(nums, &nasize);
  /* resize the table to new computed sizes */
  resize(L, t, nasize, totaluse - na);
}

static int computesizes (int nums[], int *narray) {
  int i;
  int twotoi;  /* 2^i */
  int a = 0;  /* number of elements smaller than 2^i */
  int na = 0;  /* number of elements to go to array part */
  int n = 0;  /* optimal size for array part */

  for (i = 0, twotoi = 1; twotoi/2 < *narray; i++, twotoi *= 2) {
    if (nums[i] > 0) {
      a += nums[i];
      if (a > twotoi/2) {  /* more than half elements present? */
        n = twotoi;  /* optimal size (till now) */
        na = a;  /* all elements smaller than n will go to array part */
      }
    }
    if (a == *narray) break;  /* all elements already counted */
  }
  *narray = n;
  lua_assert(*narray/2 <= na && na <= *narray);
  return na;
}

这里代码比较简单，就不多说了。思路就是把所有整数索引的key找到能囊括50%以上的数的索引，然后作为数组部分的新大小，之后的整数保存到hash表中。这样你就能理解，并不是所有的整数都保存在数组中了。

你可以看到，rehash其实是很麻烦的，像我们以前经常会写如下程序：

for i = 1,2000000 do
    local a = {}
    a[1] = 100;
    a[2] = 100;
    a[3] = 100;
    /*巴拉巴拉……*/
end

这样其实会每次触发2次rehash

我们可以使用预填充技术：local a={1,2,3}可以提高效率。

使用lua表的时候的一些建议：

尽量不要混用数组跟散列桶部分
表中nil尽量不要
避免rehash操作

popcorn丶

关注

8
点赞
踩
15

收藏

觉得还不错? 一键收藏
3
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录