LevelDB源码解析(10) TableBuilder（Memtable序列化）

背景

在LevelDB源码解析(4) MemTable中我们介绍了Memtable，LevelDB会先把key-value插入到MemTable中。如果MemTable写满了，就会新建一个MemTable进行写入。旧的Memtable(也叫Immutable Memtable）会被转成一个SST(Sorted String Table)文件存储，转换过程是由TableBuilder完成的，Table指的就是Sroted String Table。

TableBuilder类定义

TableBuilder用于把Memtable转成SST文件存储，下面是TableBuilder的类定义:

//源码文件：table/table_builder.h
class TableBuilder {
    public:
    TableBuilder(const Options& options, WritableFile* file);
    ~TableBuilder();

    void Add(const Slice& key, const Slice& value);
    void Flush();
    Status Finish();
    uint64_t FileSize() const;
    Status status() const;
    void Abandon();
    uint64_t NumEntries() const;
    Status ChangeOptions(const Options& options);

    private:
    void WriteBlock(BlockBuilder* block, BlockHandle* handle);
    void WriteRawBlock(const Slice& data, CompressionType, BlockHandle* handle);
    bool ok() const { return status().ok(); }

    struct Rep;
    Rep* rep_;
};

只有一个Rep类型的成员变量rep_，所有的数据、状态都存放在Rep中。Rep是一个struct类型，作为数据容器存在。定义如下：

//源码文件：table/table_builder.h
struct TableBuilder::Rep
{
  Rep(const Options& opt, WritableFile* f);

  Options options;
  Options index_block_options;
  WritableFile* file;
  uint64_t offset;
  Status status;
  BlockBuilder data_block;
  BlockBuilder index_block;
  std::string last_key;
  int64_t num_entries;
  bool closed;
  FilterBlockBuilder* filter_block;
  bool pending_index_entry;
  BlockHandle pending_handle;
  std::string compressed_output;
};

下面将分别介绍TableBuilder的关键成员函数，在介绍的过程中逐渐引入Rep的成员变量。

TableBuilder构造函数

//源码文件：table/table_builder.cc
TableBuilder::TableBuilder(const Options& options, WritableFile* file) : rep_(
    new Rep(options, file))
{
  if( rep_->filter_block != nullptr )
  {
    rep_->filter_block->StartBlock(0);
  }
}

这里先构造rep_，就是给rep_的options和file赋值，然后把rep_的其他域初始化为默认值。然后初始化filter_block_(其实StartBlock啥也没做。。。）。filter_block_类型是FilterBlockBuilder，用于构造filter block。

TableBuilder::WriteRawBlock

WriteRawBlock把一段数据(raw block)写入文件中，并返回block在文件中的offset和size。

//源码文件：table/table_builder.cc
void TableBuilder::WriteRawBlock(const Slice& block_contents, CompressionType type,
                                 BlockHandle* handle)
{
  Rep* r = rep_;

  //Step1
  handle->set_offset(r->offset);
  handle->set_size(block_contents.size());

  //Step2
  r->status = r->file->Append(block_contents);

  if( r->status.ok())
  {
    //Step3
    char trailer[kBlockTrailerSize];
    trailer[0] = type;
    uint32_t crc = crc32c::Value(block_contents.data(), block_contents.size());
    crc = crc32c::Extend(crc, trailer, 1);
    EncodeFixed32(trailer + 1, crc32c::Mask(crc));

    r->status = r->file->Append(Slice(trailer, kBlockTrailerSize));
    if( r->status.ok())
    {
      //Step4
      r->offset += block_contents.size() + kBlockTrailerSize;
    }
  }
}

handle的类型BlockHandle只有两个成员变量，用于保存一个block在文件中的offset和size。

uint64_t offset_;
uint64_t size_;

Step1：把raw block的offset和size存入handle。

Step2：把raw block数据写入文件，WritableFile是一个文件操作的抽象类，用于兼容不用操作系统的文件IO接口。

Step3：生成trailer并写入文件，trailer总共5个字节，trailer[0]存储CompressionType，用于表示数据是否压缩及压缩类型。trailer[1-4]存储crc校验和，crc校验和是基于raw block的数据域和trailer域计算出来的。

Step4：把rep_的offset加上数据域size和trailer域size之和，表示文件当前已写入的总size。

WriteRawBlock执行后，block的数据和trailer被写入了文件，handle存储了该block在文件中offset和size。

TableBuilder::WriteBlock

WriteBlock把一个block写入文件中，这里有别于WriteRawBlock，WriteBlock需要先对block做一下处理，然后调用WriteRawBlock完成写入。

//源码文件：table/table_builder.cc
void TableBuilder::WriteBlock(BlockBuilder* block, BlockHandle* handle)
{
  assert(ok());
  Rep* r = rep_;

  //Step1
  Slice raw = block->Finish();

  //Step2
  Slice block_contents;
  CompressionType type = r->options.compression;
  switch( type )
  {
    case kNoCompression:
      block_contents = raw;
      break;

    case kSnappyCompression:
    {
      std::string* compressed = &r->compressed_output;
      if( port::Snappy_Compress(raw.data(), raw.size(), compressed) &&
          compressed->size() < raw.size() - (raw.size() / 8u))
      {
        block_contents = *compressed;
      }else
      {
        block_contents = raw;
        type = kNoCompression;
      }
      break;
    }
  }

  //Step3
  WriteRawBlock(block_contents, type, handle);
  r->compressed_output.clear();
  block->Reset();
}

Step1：调用BlockBuilder的Finish完成block数据的最终构建，Finish会在尾部写入key的重启点列表，重启点是帮助多个key共用前缀的，以减少空间占用。详细逻辑请见LevelDB源码解析(8）BlockBuilder。

Step2：根据是否压缩来决定是否对block的数据进行压缩，如果需要压缩就调用压缩算法压缩后放到block_contents中，反之直接把原始数据赋给block_contents，这里使用Slice作为载体，所以不会拷贝完整数据，开销不大。

Step3：调用WriteRawBlock把block数据写入文件，然后清空compressed_output和block。

TableBuilder::Add

//源码文件：table/table_builder.cc
void TableBuilder::Add(const Slice &key, const Slice &value)
{
  Rep *r = rep_;
  
  //Step1
  assert(!r->closed);
  if ( !ok())
  {
    return;
  }
  if ( r->num_entries > 0 )
  {
    assert(r->options.comparator->Compare(key, Slice(r->last_key)) > 0);
  }

  //Step2
  if ( r->pending_index_entry )
  {
    assert(r->data_block.empty());
    r->options.comparator->FindShortestSeparator(&r->last_key, key);

    std::string handle_encoding;
    r->pending_handle.EncodeTo(&handle_encoding);
    r->index_block.Add(r->last_key, Slice(handle_encoding));

    r->pending_index_entry = false;
  }

  //Step3
  if ( r->filter_block != nullptr )
  {
    r->filter_block->AddKey(key);
  }

  //Step4
  r->last_key.assign(key.data(), key.size());
  r->num_entries++;
  r->data_block.Add(key, value);

  //Step5
  const size_t estimated_block_size = r->data_block.CurrentSizeEstimate();
  if ( estimated_block_size >= r->options.block_size )
  {
    Flush();
  }
}

这部分会涉及到data block、index entry、index block、filter block等概念，为了方便理解，这里简单说一下他们之间的关系，data block是key-value数据的block，一个data block包含成千上万(随便说的)个key-value数据。每个data block对应一个index_entry，所有index entry组成了index block。每个data block也对应一个filter block(比如BloomFilter)，不过一个filter block可能对应多个data block。

Step1：检查是否存在错误，以及检查本次Add的key是否大于last_key，TableBuilder只接受升序添加key，因为这个是为Memtable服务的，所以key必然是升序的。

Step2：每个data block完成后，会产生一个待写入的index entry，pending_index_entry表示是否有index_entry等待写入。如果有待写入的index etnry，就向index block中写入这个index_entry。写入逻辑如下：

Step2.1：先调用FindShortestSeparator，这个函数会比较last_key和key，找到短的str，使得last_key < str < key，并把last_key置为str（实际逻辑更复杂一些，但是核心思想就是这个）。主要目的是为了减少index entry占用的空间。
Step2.2：pending_handle是BlockHandle类型，存储了这个index entry对应的data block（即上一个data block）在文件中的offset和size。EncodeTo把offset和size编码到8字节的存储中。
Step2.3：调用index_block.Add添加index_entry，index_block和data_block复用了相同的类BlockBuilder，BlockBuilder的Add其实是添加key-value用的，index_entry也被当做key-value来处理的，key就是last_key，value是pending_handler编码产生的8字节。
Step2.4: 把pending_index_entry置为false，因为没有待写入的index entry了。

Step3：把key加入到filter block中，filter block用于查找时快速判断key是否在当前data block中，如果filter block判断不在，就一定不在，只有filter block判断可能在时，才会进入data_block进行查找。详细内容请见LevelDB源码解析(9）FilterBlockBuilder

Step4：更新last_key，对于下次Add来说，last_key就是本次Add的key。num_entries表示当前data block中有多少个key-value，这里加1。然后执行data_block.Add把本次Add的key-value加入到data block中。data block会把数据序列化后放到自己的buffer中。详细内容请见LevelDB源码解析(8）BlockBuilder

Step5：如果data_block的size达到阈值（默认为4K），就调用Flush把data block的buffer数据写入文件，这个data block也就完成了，同时pending_index_entry被置为true，表示有一个index entry（即这个data block的index entry）等待写入。

TableBuilder::Flush

Flush完成一个data block的落盘，以及对应filter block和index entry的创建。

//源码文件：table/table_builder.cc
void TableBuilder::Flush()
{
  Rep* r = rep_;
  assert(!r->closed);
  if( !ok())
  {
    return;
  }
  if( r->data_block.empty())
  {
    return;
  }
  assert(!r->pending_index_entry);
  
  //Step1
  WriteBlock(&r->data_block, &r->pending_handle);
  if( ok())
  {
    //Step2
    r->pending_index_entry = true;
    r->status = r->file->Flush();
  }
  if( r->filter_block != nullptr )
  {
    //Step3
    r->filter_block->StartBlock(r->offset);
  }
}

Step1：把data_block写入文件，并在pending_handle中保存offset和size，这个就是前面提到的index entry的value了。

Step2：把pending_index_entry置为true，并调用WritableFile::Flush把WritableFile内部的buffer写入磁盘(注意，不是Sync）

Step3：创建filter block，FilterBlockBuilder每2KB创建一个filter block，所以会出现多个data block共用一个filter block的情况。但是一个data block最多对应一个BF。这种策略的好处是，当data block很小时，避免创建太多的filter。

TableBuilder::Finish

Finish是TableBuilder的终结者，调用Finish时，就会完成SST文件的全部构建。

//源码文件：table/table_builder.cc
Status TableBuilder::Finish()
{
  Rep* r = rep_;

  //Step1
  Flush();
  assert(!r->closed);

  r->closed = true;

  BlockHandle filter_block_handle, metaindex_block_handle, index_block_handle;

  //Step2
  if( ok() && r->filter_block != nullptr )
  {
    WriteRawBlock(r->filter_block->Finish(), kNoCompression, &filter_block_handle);
  }

  //Step3
  if( ok())
  {
    BlockBuilder meta_index_block(&r->options);
    if( r->filter_block != nullptr )
    {
      std::string key = "filter.";
      key.append(r->options.filter_policy->Name());
      std::string handle_encoding;
      filter_block_handle.EncodeTo(&handle_encoding);
      meta_index_block.Add(key, handle_encoding);
    }
    WriteBlock(&meta_index_block, &metaindex_block_handle);
  }

  //Step4
  if( ok())
  {
    if( r->pending_index_entry )
    {
      r->options.comparator->FindShortSuccessor(&r->last_key);
      std::string handle_encoding;
      r->pending_handle.EncodeTo(&handle_encoding);
      r->index_block.Add(r->last_key, Slice(handle_encoding));
      r->pending_index_entry = false;
    }
    WriteBlock(&r->index_block, &index_block_handle);
  }

  //Step5
  if( ok())
  {
    Footer footer;
    footer.set_metaindex_handle(metaindex_block_handle);
    footer.set_index_handle(index_block_handle);
    std::string footer_encoding;
    footer.EncodeTo(&footer_encoding);
    r->status = r->file->Append(footer_encoding);
    if( r->status.ok())
    {
      r->offset += footer_encoding.size();
    }
  }
  return r->status;
}

Step1：调用Flush，把当前的data block落盘，并添加对应的index entry和filter

Step2：前面说过filter block复用了BlockBuilder，这里完成filter block的构建，并写入文件，逻辑和前面介绍的WriteBlock类似。

Step3：创建meta_index_block，同样复用BlockBuilder，里面只有一个key-value，key是filter.$Name，Name就是filter policy的名字，比如BloomFilter。value是filter block的offset和size的编码结果。随后把meta_index_block写入文件。metaindex_block_handle中存储了meta_index_block的offset和size。meta_index_block这个名字起得特别的不友好，这里应该叫filter_index_block更合适一些，即filter block的index。相应的，data block的index就是index_block。

Step4：如果有pending_index_entry就加入index block，逻辑和Add函数里添加index_entry类似，主要区别是找最短的最大key调用的是FindShortSuccessor，把last_key置为最短的比last_key大的string，也是为了节省空间。因为这是整个Memtable最后一个key了，所以新产生的key只需要大于last_key，不需要小于下一个key。最后调用WriteBlock把index block写入文件。index_block_handle中存储了index block的offset和size。

Step5：生成footer，并写入文件。Footer类有且只有两个BlockHandle成员变量，分别是metaindex_handle_和index_handle_，分别对应metaindex_block_handle和index_block_handle，分别赋值后，调用EncodeTo把footer编码到字符串中。编码格式如下表所示

8字节	8字节	8字节	8字节	8字节
metaindex block offset	metaindex block size	index block offset	index block size	魔数

最后把footer写入文件，更新rep_的offset_，整个SST文件到这里就写完了。

总结

最终TableBuilder产生的SST文件由5个区块组成，如下图所示。其中file index block就是代码里的metaindex block。这里为了直观一点，用filter index block的名字。

leveldb_SST_简版格式

详细的SST文件结构请见LevelDB源码解析(11) SST文件结构

源码版本

https://github.com/google/leveldb/releases/tag/1.22