DuckDB的Buffer Management的设计与实现参考了LeanStore[1],通过实现Pointer Swizzling来尽可能的兼顾in-memory database的性能和更强的处理out-of-core场景的能力。

在DuckDB的Buffer Manangement中,Block是缓存单元,类似传统Buffer Manangement中的Page。DuckDB中有两种Block,硬盘block(unswizzle block)和内存block(swizzled block)。对于block id小于MAXIMUM_BLOCK[其值等于2^62]的block就代表磁盘中的block,反之,如果block id大于等于MAXIMUM_BLOCK就是内存中的block,具体可以参考BlockManager::UnregisterBlock

void BlockManager::UnregisterBlock(block_id_t block_id, bool can_destroy) {
    if (block_id >= MAXIMUM_BLOCK) {
        // in-memory buffer: destroy the buffer
        if (!can_destroy) {
            // buffer could have been offloaded to disk 
            // remove the file
            buffer_manager.DeleteTemporaryFile(block_id);
        }
    } else {
        lock_guard<mutex> lock(blocks_lock);
        // on-disk block: erase from list of blocks in manager
        blocks.erase(block_id);
    }
}

RegisterMemory(idx_t block_size, bool can_destroy)方法会在内存中创建并注册一个新的Block。这里不需要block id,而是接收两个参数,一个block_size用于指定分配的block的大小,另一个can_destroy表示这个block是否可以在暂时不用时直接释放,也就是说如果can_destroy是true,那这个产生的block就会在使用之后立刻释放。

shared_ptr<BlockHandle> BufferManager::RegisterMemory(
    idx_t block_size, bool can_destroy) {
    D_ASSERT(block_size >= Storage::BLOCK_SIZE);
    auto alloc_size = GetAllocSize(block_size);
    // first evict blocks until we have enough memory to store this buffer
    unique_ptr<FileBuffer> reusable_buffer;
    auto res = EvictBlocksOrThrow(
        alloc_size, maximum_memory, &reusable_buffer,
        "could not allocate block of %lld bytes (%lld/%lld used) %s", 
        alloc_size, GetUsedMemory(), GetMaxMemory()
    );

    auto buffer = ConstructManagedBuffer(block_size, move(reusable_buffer));

    // create a new block pointer for this block
    return make_shared<BlockHandle>(
        *temp_block_manager, 
        ++temporary_id, 
        move(buffer), can_destroy, 
        alloc_size, move(res)
    );
}

RegisterMemory函数的具体实现的逻辑并不复杂。首先是检查内存限制,如果新增这块内存会导致内存超限,则首先驱逐掉暂时不用的blockEvictBlocksOrThrow。然后申请缓存空间,通过ConstructManagedBuffer建立FileBufferFileBuffer是数据真实存放的地方,在初始化时会在内存中开辟所需的block_size的空间,以便执行器直接使用,而在被落盘后,提供了面向磁盘进行读写的接口。最后,将这个ManagedBuffer移入BlockHandle中,交由BlockHandle进行管理。在最后创建BlockHandle的时候,使用了原子变量temporary_id,它默认初始化的值就是MAXIMUM_BLOCK[具体参考下面的BufferManager的构造函数],然后每次++temporary_id的时候就能保证新注册的block_id永远大于MAXIMUM_BLOCK,也就是说新的block是在内存中的。

RegisterBlock(block_id_t block_id)方法并不会分配新的block id,而是传入一个已经在磁盘上的block的block id,表示让缓存管理器来管理这个block。

shared_ptr<BlockHandle> BlockManager::RegisterBlock(
    block_id_t block_id, bool is_meta_block) {
    lock_guard<mutex> lock(blocks_lock);
    // check if the block already exists
    auto entry = blocks.find(block_id);
    if (entry != blocks.end()) {
        // already exists: check if it hasn't expired yet
        auto existing_ptr = entry->second.lock();
        if (existing_ptr) {
            //! it hasn't! return it
            return existing_ptr;
        }
    }
    // create a new block pointer for this block
    auto result = make_shared<BlockHandle>(*this, block_id);
    // for meta block, cache the handle in meta_blocks
    if (is_meta_block) {
        meta_blocks[block_id] = result;
    }
    // register the block pointer in the set of blocks as a weak pointer
    blocks[block_id] = weak_ptr<BlockHandle>(result);
    return result;
}

Reference:

  1. Viktor Leis, Michael Haubenschild, Alfons Kemper, and Thomas Neumann. “LeanStore: In-memory data management beyond main memory.”, IEEE International Conference on Data Engineering (ICDE), pp. 185-196, 2018.