CMU15445 Lecture 6 Hash Tables

本节主要介绍DBMS如何访问页中的数据

回顾

首先我们需要讨论DBMS怎么用execution engine去读写数据，主要使用两种数据结构：

Hash Tables
Trees

数据结构

可用之处

DBMS的内部在许多地方需要使用数据结构，比如：

Intenal Meata-Data:用于追踪数据库与系统状态的信息
Core Data Storage：用于存储数据库的tuple
Temporary Data Structs：DBMS可以在处理查询的时候，通过动态的创建数据结构来加速执行（比如用hash tale加速join）
Table Indices：辅助的数据结构，使得能够更加轻易的找到特定的tuple

设计数据结构需要做的考虑

数据的组织：我们需要搞清楚内存分布，以及什么信息存储在数据结构中，使得更高效的访问数据
并发性：需要考虑数据的并发访问

Hash Tables

Hash Tables能提供O(1)的时间复杂度(最坏是O(n))与O(n)的空间复杂度
Hash table的实现主要由两部分组成：

Hash Function
- how to map a large key space into a smaller domain
- trade-off between fast execution and collison rate
Hash Scheme
- how to handle key collisions
- trade-off between alloctaing a large hash table to reduce collisions and having to execute additional instructions when a collision occurs

Hash Functions

哈希函数，一般是输入key(可以数字，字母等)，输出Integer；我们需要使用cryptographically secure hash function，（cryptographic hash 函数开销相较于普通的hash大的多），而且hash function由数据库内部使用，不会泄露到外部

静态hash函数

静态hash是指hash table的size是固定的
现实世界中存在三个问题：

不能够提前知晓元素的数量
keys不是独一无二的
hash函数并不是完美的，也就是说不可能实现只要key1 != key2,就一定hash(key1) != hash(key2)

线性探测(Linear Probe Hashing)

线性探测处理碰撞的方法是搜寻下一个空位

删除时的问题
由于可能因为碰撞，将正确的key防在了不正确的slot中，有两种解决方案：
- The most common approach is to use “tombstones”. Instead of deleting the entry, we replace it with
  a “tombstone” entry which tells future lookups to keep scanning.
- The other option is to shift the adjacent data after deleting an entry to fill the now empty slot. However,
  we must be careful to only move the entries which were originally shifted
当 keys 可能出现重复，但 value 不同时，有两种做法：
- Separate Linked List
- Redundant Keys

Robin Hood Hashing

Cukoo Hashing

结论

通常为了减少 collision 和 comparisons，Hash Table 的大小应当是 table 中元素量的两倍以上。静态hash table需要提前预知数据数量，否则每次数量超过范围时都需要重建 Hash Table。它可能被应用在 Hash Join 的场景下，如下图所示：由于 A, B 表的大小都知道，我们就可以预判到 Hash Table 的大小。

动态hash函数

动态hash函数需要适当的时候扩容，The schemes perform this resizing in different ways that can either maximize reads or writes.

Chained Hashing

每个key对应一个链表，每个节点是一个bucket。写操作时需要设置latch

Extendible Hashing

这个方案的主要是拆分桶，而不是要桶一直增长

Linear Hashing

hash table的缺点

对于范围读取,hash不是一个好的选择

参考

note6

CMU15445 公开课数据库

CMU15445 Lecture 6 Hash Tables

回顾

数据结构

可用之处

设计数据结构需要做的考虑

Hash Tables

Hash Functions

静态hash函数

线性探测(Linear Probe Hashing)

Robin Hood Hashing

Cukoo Hashing

结论

动态hash函数

Chained Hashing

Extendible Hashing

Linear Hashing

hash table的缺点

参考

相关

mysql数据库死锁的产生原因及解决办法

【数据库】快速安装 SQL Server Express

关于SQL递归查询在不同数据库中的实现方法

GraphX 在图数据库 Nebula Graph 的图计算实践

主流开源分布式图数据库 Benchmark

图数据库 Nebula Graph 的安装部署

图数据库 Nebula Graph 是什么

Windows安装SQL Server数据库

SQL Server 维护计划实现数据库备份（Step by Step）(转)

mysql查出来的时间跟数据库相差13/14小时

Gin中记录Gorm数据库表生成-查询使用

如何用时序数据库 CTSDB 与 TARS 结合，解决海量监控数据难题

标签