Big Table

What is bigtable?

1. Its a DB for NoSql. 
    BigTable - Google
    Hbase - Open Source of Bigtable
    Cassandra - Facebook

3 layer NoSQL structure:

1. Row Key : usually a Hash Key, example Rowkey = UserId
2. Column Key: combination of 2 id: timestamp + user id (its ordered, can do range query)
3. value: usually it is string (i.e. store friend relationship value)

How to scale

Using "Consistent Hashing(row) + Horizontal Sharding" break Bigtable to MiniBigTables
Tablet = Minibittable
Tablet runs on Tablet server, every Tablet Server is like a smaller DB.  
But, Tablet server does not store data, it only maintains relationship with GFS(Google File System) service
Tablet server is basically a client of GFS. Data actually stored on GFS.

How to manage Tablet server?

Master + Slave model
On master, a hashmap is stored with consistent hashing information.

How do we read/write in bigtable

What is bigtable read/write key?
    Key = Row_Key + Column_Key
    Read --> Key; Return: Value
    Write --> Key, Value; Return: Done

How do we read big table?
1. Client send key --> Tablet master
2. Tablet master --> return tablet server id to client (slave id/ minitable id)
3. Client then communicate with slave tablet server to get the data. 
4. Slave tablet server communicate with GFS service.(Which has its own master/slave structure)

How do we write big table?
1. Client send key --> Tablet master
2. Tablet master --> return tablet server id to client (slave id/ minitable id)
3. Client then communicate with slave tablet server to write key/value pair.
4. Slave tablet server returns done.

What if we read when we write?
Race condition.
We need a lock -- We need a distributed lock (Chubby(Google internal), Zookeeper (outside world imitate google))
Distributed lock = Lock Server (We store a consistent hashmap in Lock Server)
The lock is on the key level, not locking the whole tablet server. Storing consistent hash map is just to be able to direct 
client key to the correct server.

Master server and lock server is not the same server.
    Master server usually does it's own job like maintaining slave servers(reboot, monitoring, etc)
    Lock server only handles consistent hash map.

Advantage of using a Lock Server:
1. Metadata is sotred on the lock server, then master server does need to store metadata anymore.


Does GFS have read and write conflict?
    When GFS done with one file writing(a chunk), it does not allow changes. Can only rewrite the whole chunk(64M)
    But bigtable allow modification on value when key is given.


How does tablet server write to GFS?
    It stores in tablet server first, as a buffer, when accumulated more data, then write to GFS.
    So what if tablet server crashes? Dont we loose all the buffered data?
        The answer is "write ahead log" written into tablet disk.Writing log is appending type of write, it is pretty quick
        even writing to disk. When accumulated enough data, then consume log to write to GFS service.

What if memory is not enough on tablet server?
    Short answer: move memory content to disk

How exactly do we write to memory?
First we need to know how tablet server store in memory:
    SSTABLE(Sorted Strings Table)
        Bigtable = a list of tablets(miniBigtable)
        Tablet = a list of SSTables
        SSTable = a list of sorted <key, value>

    How does tablet generate SStable?
        On each tablet server, there is a <SStable#,GFSaddress> map.
        When we write to tablet server, it stores it in memory, when enough, sort by key, then write to GFS.
        If tablet server crashed, it can just reboot, then ask master server for address of where all SStable for this
        tablet is stored in GFS, then read from GFS to recover.
        For data that are buffered though, it should exist in disk of tablet server as "write ahead log", so we have all
        the trace of buffered data, then we can read disk log to know what need to be write to GFS.

How to read bigtable?

Which SStable to read when we reach on tablet server?

  1. Read memory, if exist return

  2. If not, read SStable by the order of timestamp, latest sstable got read first

  3. Binary search key in each sstable , return when found.

So how do we maintain SStable to get rid of those stale key(i.e. same key got stored multiple times in diff tables)

  1. Periodically, We take all SStable of one tablet server, do a k-Linked-list merge, and always get rid of older duplicate keys entries.
  2. After merge we got an big SStable, then we slice it into SStable limit size and store it back to GFS. And abandon previous SStable. (So the re-write after k-merge is just appending write which is faster, tablet server will update with new addresses meta data for new request to be navigated to new set of SStables in GFS.)

When read, how to see if a key exist in a SStable or not?

  1. Binary search - as all key is already sorted. But this is a lot of disk read operation as binary search need multiple times search.
  2. Better way? --> Indexing each SStable
    1. Store some keys(for keys ABCDEFG, we only store/index ACG in memory), so we do binary search in memory with indexed key, find the range where target is at, then go into GFS to do disk binary search.
  3. Even better way?
    1. Bloomfilter!
    2. What is it? --> Something that help us quickly check if a key exist in a table.
    3. How does it work? -->
      1. For every key, we use 2 hash function to hash and take mod value to limit result it to certain range
      2. that range can be represented as a 32 length array? or 32 bit integer (this is the filter)
      3. So every key, after hashing, will mark 2 position in filter.
      4. If we want to check if a key exist in SStable, simple hash that key with 2 hash function, and see if those 2 positions has been marked in filter, if both true, then we know it exist. This filter updates when data stored into SStable.
  4. So for every SStable, we store a index of some of keys and Bloomfilter with it. So for a key look up, it will be much faster to determine whether a key exist and how a key is located in the GFS.

results matching ""

    No results matching ""