Understanding Merkle Tree

A Merkle Tree is a cryptographic tree structure used in computer science and distributed systems to efficiently verify the integrity of large sets of data (accuracy and consistency of data over its lifecycle).

Merkle Tree, also known as Hash Tree is a tree of hash values. It has a tree structure in which each leaf node is a hash of a small portion of the data, and each non-leaf node is a hash of its children. It is used in applications such as NoSQL databases, Git, Cryptocurrencies, File Systems, etc. Some key characteristics of Merkle Tree are:

Binary Tree Structure:
The Merkle Tree is a binary tree, where each leaf node represents a hash of data.
Leaf Nodes:
The data set is divided into fixed-size blocks or "leaves".
Each leaf node contains the hash of a specific data block or piece of information.
Non-Leaf Nodes:
Non-leaf nodes in the tree represent the hash of the concatenation of their child node's hashes.
If the number of leaves is odd, the last hash is duplicated to form a pair.
This process continues recursively until a single hash, known as the "Merkle Root", is obtained at the top of the tree.

Verifying Data Integrity: Verify the accuracy and consistency of data over its lifecycle

We can ensure the integrity of the data and precisely identify the location of any changes. For instance, we've downloaded a dataset from a server. Let's consider it to be a directory with four files: January, February, March, and April. If the file "March" got corrupted in our downloaded system, how do we go about verifying the integrity of the downloaded dataset?

Brute force Approach:

Re-download the entire dataset from the original source, resulting in two sets of directories in our downloaded environment.
Verify file changes by iterating through each file of the corrupted directory and checking for differences with the corresponding file of the original source.
The time complexity for this process is O(n * len), where n is the number of files, len is the length of each file, and it requires downloading the entire dataset again.

Merkle Tree Approach:
Generate the Merkle tree with all files of the corrupted directory.
The hash of the "Merkle Root" will differ from the root of the original dataset, indicating tampering with one/more files.
Follow the path from the root of the tree to identify the specific corrupted file (leaf node). In this instance, where "March" changed to "November," only that particular file needs to be re-downloaded.
Begin with the Merkle Root and identify the child node with a hash differing from its corresponding original node.
If the difference is in the Right child, navigate towards the Right child; otherwise, proceed towards the Left child.
Continue this process until reaching the leaf node, which denotes the corrupted file.
The time complexity is logarithmic as it follows a path from root to leaf in the tree. There is a reduced memory footprint since only the corrupted file requires re-downloading.

Real-Life Applications:

Distributed Leaderless Databases like Cassandra:
Data in Cassandra is replicated and synchronized across multiple nodes. Each node is responsible for a portion of the data. For simplicity, let's assume each node of Cassandra should have all the data.
Cassandra hashes each set of data to create a Merkle Tree for each node. Each node maintains a Merkle Tree where each leaf node represents the hash of the data, and the root node (Merkle Root) represents the hash of the entire dataset for that node.
New data/row might be written to one of the nodes, and meanwhile, nodes periodically exchange their Merkle Trees with each other.
Other nodes can compare their Merkle Trees to quickly identify if there are differences in their datasets. Only the differing data needs to be transferred for synchronization.
Version Control Systems:
Git operates on a snapshot-based model rather than a file-based model. Instead of storing the differences between versions, Git stores the entire snapshot of the project at each commit.
Git stores the file structure of a snapshot using a "Tree Object". A tree object is similar to a directory and can contain blobs (file content) or other tree objects (subdirectories).
The Merkle Tree structure is formed by hashing the content of each tree object or blob. The final hash becomes part of the "Commit Object", which additionally includes metadata such as the author, timestamp, etc.
Git can compare the stored Merkle Root(last commit) with a recomputed Merkle Root based on the current state of the working directory or repository. Git can quickly determine if any changes occurred in the repository.
Due to the Merkle Tree structure, if a file or directory remains unchanged between commits, its hash remains the same. This allows Git to efficiently store only the changes, and unchanged content is referenced by its existing hash.
Bit Torrent:
BitTorrent breaks down files into smaller fixed-size blocks. Blocks are grouped to form "pieces".
Hashes of individual pieces are used to construct a Merkle Tree. The resulting Merkle Tree has leaves representing the hashes of individual pieces, and the root represents the hash of the entire file.
The Merkle Root is included in the torrent file, along with other metadata.
Peers in the BitTorrent network can verify the integrity of downloaded pieces by comparing local piece hashes with the Merkle Root provided in the Torrent file.
Peers exchange information about which pieces they have and need. BitTorrent uses Piece Selection Algorithms to determine which pieces to request/download.
When a peer sends a piece to another peer, it includes a "Merkle Proof"—a set of hashes from the leaf (individual piece hash) to the root—allowing the receiving peer to efficiently verify the integrity of the received piece.
Blockchain Technology:
Blockchain relies heavily on Merkle Trees to secure and efficiently represent the transaction history within blocks.
Each block in a blockchain contains a Merkle Tree, and the root of this tree is included in the block header. This ensures efficient verification of the entire transaction history.

Client-Server Communication Model (Part-1)

The foundation of the internet lies in the communication between computers. Computers, acting as servers, own and can provide resources to other computers, serving as clients. The roles of computers shift over time – one moment, a computer may own resources, while at another time, it may require resources from others. Communication is essential for data exchange between them. The predominant protocol facilitating Client-Server communication is the HyperText Transfer Protocol (HTTP). HTTP is a protocol for fetching web pages, documents, images, media files, binaries, etc. Pretty much the whole of the internet runs on HTTP. We have discussed HTTP in detail here . A typical HTTP request flow is as follows: A client opens a connection and requests a resource from a server. The server calculates the response The server sends the response to the client on the same opened connection Some of the most popular Client Server Communication models are: Polling / Short Polling / AJAX Polling Long Po...

Tech With Rahul

Search This Blog