Whenever a software project needs to implement immutable records, people often start thinking of Blockchain or Distributed Ledger Technology like Hyperledger. Blockchain and DLT make use of cryptographic techniques that enable immutability, verifiability and integrity checks. However, Blockchain and DLT need much more than just these checks. It needs to prevent all the attempts of a double-spend by potential malicious users. In a trust-less environment it needs implementation of complex protocols which leads to consumption of huge amount of electricity. This makes writing data on Blockchain costly. DLTs like Hyperledger make use of simpler consensus algorithms between a “set of trusted nodes”, which reduces the cost. However, it still incurs significant costs if nothing more than immutability, verifiability and integrity were the concern.
This post is about how to implement data immutability, verifiability and integrity without using a Blockchain or a DLT in industrial strength software applications.
- Foundational technique
- Immutability and Verifiability
- Immutability, Verifiability and Integrity
Foundational technique – Cryptographic Hashing
To understand the solution, some foundational techniques must be understood. This section describes what is cryptographic hashing. Those who are already aware of cryptographic hashing, they may skip to the next section.
Cryptographic hash function
If you provide a stream of bytes to a cryptographic hash function, it generates a number called hash (also referred to as digest). The following properties make it a very useful tool:
- It is impossible to guess the generated hash value for a stream of bytes. To get the hash value, one has to run the algorithm, there is no shortcut.
- For a given input it always generates the same hash value.
- It is infeasible to deduce the input based on the hash value. That means it is an irreversible mathematical function.
- No two different stream of bytes result in the same hash value. Even a small change in the input stream generates a totally different number. (When two different stream of bytes produce the same hash value, we say the cryptographic hash function is broken. It is also referred to as there is a collision in the cryptographic hash function.)
- Any size of input stream will always result in the same size of hash value. Some hash functions generate hash values that are 256 bits long. If those functions are used, the result will always be 256 bits long.
Examples of commonly used cryptographic hash functions include:
- SHA-256 (returns 256 bit unsigned integers)
- RIPEMD-160 (returns 160 bit unsigned integers)
Sample hash values:
There are two things we observe from the table above:
- With a slight change in input, the hash values change dramatically and by looking only at the hash values, one cannot conclude that the first and second are even closely related. This is also referred to as avalanche effect. It is one of requirements of a cryptographic algorithms to have the avalanche effect.
- The values mentioned are actually text strings and do not look like numbers, although we expected them to be numbers.
They are actually numbers, represented this way to reduce the size of the presented text. For example, binary representation of the number 255 is ‘11111111’. Decimal representation is ‘255’. Hexadecimal representation is ‘ff’ or ‘FF’. The representations in the table above are hexadecimal representations of 32 byte and 20 byte numbers. base64encoding is commonly used to represent the hash values in cryptographic applications.
For quite some time, open source software has been distributed through various mirror sites so that downloads are sped up. Any user could download the software package quickly from a nearby mirror site. These nearby sites could be malicious and could provide compromised open source software packages. In order to overcome this problem, open source software builds would publish a checksum file on their website. This checksum file is used to verify that the downloaded package from a nearby mirror site is authentic. The checksum file actually contains a cryptographic hash of the software package.
For example: Apache OpenOffice – Download checksum files.
This is the content from SHA256 checksum file for Apache Open Office SDK for Linux x86-64 build.
It would not matter which mirror site the SDK is downloaded from. A user can easily verify the authenticity of the download using a simple command like:
$ sha256sum Apache_OpenOffice_incubating-SDK_3.4.1_Linux_x86-64_install-rpm_en-US.tar.gz e08f9c8acecba1ee0046f820b0abed97dfe90511bd733a65936fdf0ea9c22540 Apache_OpenOffice_incubating-SDK_3.4.1_Linux_x86-64_install-rpm_en-US.tar.gz
If the generated SHA256 does not match with the checksum file, the user knows that the software package has been tampered.
We will use the same technique to verify data objects. The data object we are interested in would have an ID. Any application or service asks for objects using IDs. If the data that the application gets does not generate the same hash value as the ID, then the application can easily conclude that the data is not what was asked for or the data has mutated. On a typical hardware the cryptographic hash function would take a few microseconds to generate the hash value. So, this is not costly and offers good verifiability.
In the context of this post, the verifiability that we are seeking is not because we operate in a trust-less or adversarial environment like Blockchains, but mainly because data can become corrupt or can be deliberately changed by hackers / attackers. This is for companies to verify that the data they hold has not been modified undesirably.
Application: Deterministic and Verifiable IDs
In a typical RESTful request, a client posts a request to a service, and the service returns back an ID. The service uses the ID to index the object that got created due to the request. However, with a predetermined ID generation technique, it is possible for the client to know the ID even before the service receives the request. This is done even in blockchains. The transactionID (also referred as TXID) is a hash computed from certain fields of a transaction request.
To be able to know the ID of an object that will be returned by a service even before the object is created in a service has significant benefits.
- Just by computing the ID from the fields of an object, and comparing it with the ID provided in the object, one can determine if the object is the right object. Housekeeping processes can easily determine data corruption or software bugs or potential attacks.
- Request need not be processed synchronously.
- The client can fire a batch of requests in just one call. Each individual request can easily be identified by the request id which will be deterministic for both the client and the service. This eliminates the need to create IDs on the client side, and map them to the server side IDs.
- Non-idempotent requests such as HTTP POST requests can achieve deterministic behaviour. Multiple POST requests (which could be because of software bug or infrastructural replays) will not cause harm, as the request ID is predetermined. The server can easily identify a duplicate request.
Example: To generate an OrderID, one could take the sha256 of a series of bytes of the quantity, price, dateTime of the order, clientID, and the assetID or assetSymbol.
OrderID = sha256(bytes(qty)||bytes(price)||bytes(dateTime)||bytes(clientID)||bytes(assetID))
Data format and ID Generation
While the scheme above for ID generation works, it has a certain drawback. For every type of objects, a developer would have to write an ID generator. This is not desirable. For a majority of the types of objects, the id generator should just be available easily. For this, the object itself can be serialised and the serialised stream of bytes can be hashed.
It is important to note that text based data structures like JSON, XML are not very well suited for this. The main reason behind this is that adding a space or TAB within the document will not alter the data for JSON or XML, however will yield a totally different hash value and therefore a totally different object ID. Hence, it is better to use serialisation formats designed for cross platform, multiple language environments and are deterministic. Compact data serialisation like protocol buffers, FlatBuffers, Apache Avro and even CBOR are much better suited for this.
Immutability and Verifiability
Software professionals often jump to Blockchain to achieve immutability even in a non-adversarial environment like most business applications. Any party explicitly trying to cheat would face the court, and fraudulent transactions can be reverted in the most common business applications seen throughout the world. Therefore, there is no need for all the complexity and consensus algorithms like proof-of-work or proof-of-stake. Even on DLTs there are algorithms like Raft which are used to achieve consensus. Although much lesser in the power consumption, raft could still be an overkill for some of the usual business applications.
Merkel DAG and Immutable Data Structures
There are many applications apart from Blockchains, that are already using Merkel trees and Merkel DAG (Directed Acyclic Graph). Applications like git, bittorent, NoSQL DBs like Cassandra, DynamoDB, IPFS, filesystems like ZFS and more have used these techniques. We will explore how to use this within data structures to achieve immutability. As we explore the Merkel DAG, it will become evident why it is “acyclic” in nature. A Merkel DAG is a tree like data structure which enables Immutability and Verifiability.
Any practical data structure is made from multiple constituent data structures. For example, A client would have an id, name, address, emailId, phoneNumber etc. Let us call each constituent data structure as a constituent block.
The name itself would have a firstName, middleName, lastName. The address would have line1, line2, city, state, country, postalCode etc. Each constituent data structure can be hashed, and its hash value can be stored in the parent data structure. When the parent data structure is hashed, it is hashed with the hash values of the constituent data structure. The image below elaborates this in detail.
Note that storing these in a key value store, also provides a side benefit of “deduplication”. If one hashes a pre-existing value like the City in the above example, then the generated hash will be exactly the same. Therefore, there cannot be 2 entries for “Seattle, Washington, USA” in the Key-Value Store.
If any aspect of the Client object above is changed, for example, if the city is changed, then the generated hash for the city object will change. This will need changes to the Address object to store the new hash. As soon as the new hash is captured in the Address object, the hash value of Address itself changes, and therefore will need a change in the Client object. Then the client ID changes because the hash value of the client object would change due to the new address hash value. Thus even a small change in any of the constituent objects nested within the client will make a change to the client ID itself. Thus creating a completely new object in the key value store. The old object still remains intact.
It is sometimes desired that the object should be mutable, however, as the object evolves, an audit log of all the changes must be captured without any missing steps in between. This is covered in the next section.
Immutability, Verifiability and Integrity
The most common use case is an immutable ledger as a source of truth for a series of events. This could be a series of events between two trading partners or access / requests made by users on a system or any events that have a huge business significance. It is important for the company to store and be sure that the stored data is not corrupt or even incomplete. No records in between can be missed or changed and the stored list must be verifiable. One of the applications of such records could be used for compliance, regulation and audit.
To understand the core of how it is achieved, let us use the Client object described in the previous section. We know that if a change is made to the client object, it gets a new client ID. However, we need to be able to deterministically link the new ID to the old one. As we capture the change the new object would link to the old one using the hash value of the previous version. This is exactly how blocks in a Blockchain are linked.
For this, a small change is needed in the client object. A new field is added at the top level Client object which points to the previous objects hash value. In the example I have called it previousHash. In Bitcoin they call it:
prev_block. It is a part of the block header, and it points to the previous block on Bitcoin.
So, what happens when a record is being captured the first time ever? It points to a genesis block! For people who are not familiar with Blockchain, the genesis block is where the Blockchain began. The first ever Block after going live! We need a similar concept. To keep it in line with the convention, we will call it the genesis record.
The object that will be hashed to create a genesis record in our case will be: hash of an empty object, in which all fields are either null or 0.
Each Client when onboarded would have its previousHash pointing to the genesisClientRecordId. The first record for the client would have some hash value for the name field, address field and the previousHash field would have the value of the genesis record. This first record when hashed would result in a unique clientID. For the sake of example lets say 0x29d1…
Let’s consider that city of the client is changed. In that case, the address hash value would change. When we do that, we would need to put in a previousHash pointing to 0x29d1… as shown below.
All that is really needed to reconstruct the sequence of events is the key value store and the latest hash value of the client object, the clientId: 0x23d1…
Also note that comparing only the hash value of the most recent record can be a check if systems are in sync.
- Synchronisation of requests-responses between two businesses.
- A typical use case is Request-for-quote, Quote, Order, Fills etc. Each message points to a previousHash.
- Mobile apps and other device apps keeping in sync.
- Distributed databases.
- The check for synchronisation is just the latest hash value. This is very cheap and most importantly reliable.
and many more.
The foundational technique for this implementation is Cryptographic Hashing. It is known that over time collisions could be found in hashing algorithms. To overcome this problem, the ID generation must be flexible enough to change the hashing algorithm and prevent a potential attack.
If the data that we are dealing with is small enough, a new hashing algorithm can be used to recompute all the IDs and relink the whole data. Also, all the clients need to be updated with the changed IDs. However;
- It is likely that the data is so large that to make this batch process run and complete the migration of the data, it could take more than just a few hours. In such cases, it should be possible to maintain historic data with an older hashing algorithm, while the new data is hashed using a different algorithm. This can easily be achieved by a technique called multihash, described below.
- Not all client applications may be ready to accept changes to IDs. Therefore, a central system which has a human readable filename and path kind of key can be used to store the hash values. This can be easily achieved by something really simple like an access controlled key value store or slightly more complex system like an LDAP system or Zookeeper or etcd.
Multihash is nothing more than a way of generating and storing the IDs in a way that you can detect which Cryptographic Hashing was used to generate the ID. The ID is nothing but a series of bytes which is encoded in a human readable format. This could be represented in base16 or base64 whatever is the preferred encoding.
As a part of the ID we need to identify which hashing algorithm is being used to generate the ID. This can easily be represented as an integer. Further, different hashing algorithms result in different size of hashes. So, another integer to represent the length of the hash value. Followed by that would be the actually hash value.
There are many ways to achieve the data immutability, verifiability and integrity. Depending upon the environment in which the software is to be operated, different level of security and secure protocols are needed. In most cases, a simple scheme as described above can be borrowed from Blockchain without incurring the overhead of a Blockchain.