I Almost Invented A Revolutionary Technology 20 Years Ago

in blockchain •  7 years ago  (edited)

Hopefully you've heard of the term "blockchain"; if not then it is my pleasure to be the first to whisper it into your ear. It has the potential to be one of the most excitingly disruptive technologies to come along in recent years. Many consider the blockchain to be as important as "the Internet", the Industrial Revolution, or perhaps even the printing press. Are they exaggerating? The more I think about the possibilities of the blockchain the more I realize just how beneficially huge the impact could be.

Did I almost invent the blockchain? Don't get me wrong -- twenty years ago I was nowhere close to what the blockchain has evolved into today, but I was on the path. Still "almost invented" is a mighty bit of a stretch for me to make here, but it seemed like a catchier title for this story. 😏 My intention in sharing this story is not to claim any sort of fame but to help some people to start to understand what the blockchain is and how it works.

20 Years Ago

At the end of the last century (sounds really long ago doesn't it) I was working for a pharmaceutical company doing various IT tasks. During a meeting with scientific leadership I learned that the FDA had recently published 21 CFR Part 11. The gist of this new regulation was to define criteria for how electronic data should be protected. The FDA wants drug companies to not only archive data obtained from lab instruments but to prove that the data is original, untampered-with, uncorrupted data. No matter how old the data is.

The FDA did not offer any helpful suggestions on how to achieve this somewhat lofty goal. The ideas emerging were to store the data to "permanent media" (like a recordable CD) and then send that media to a third party vendor who would stow the media in a secure facility -- only to be retrieved if both a representative of the company and the FDA signed a request to retrieve the media.

This was a decent solution, but it didn't really help prove that the data on the removable media (I'll just use "CD" from here out) had not been tampered with or corrupted. Consider a hypothetical situation where a company wanted to hide some information that could cost them big financial losses and harm their reputation. I'm certain that they could find a way to secretly have the data at the storage facility either destroyed or modified. Or perhaps maybe in a less sinister scenario the data just happened to get corrupted by stray cosmic ray or one of those pesky scratches on the CD. There must be a better way to prove that the data had not been tampered with or corrupted. The company didn't have a lot of time or budget to solve this particular problem so the solution had to be fairly simple.

The solution came to me within a day. In order to understand the solution I will first briefly explain what a "hash function" is. A hash function is a computing operation that when provided a chunk of data as input, the resulting output ("the hash" or "message digest") is a fingerprint of the original data. The important property of a hash function is if you provide the same input data you will always get the same hash value; and if you change any parts of that data a different hash value will be generated.

The first part of the solution was to take the data at the end of each week and compute the hash value and save that hash value. The data that was generated by these lab instruments was less than 650 megabytes every week. It took 10-20 minutes to compute the hash value. The resulting hash value is surprisingly small in size (128 bits).

The second part of the solution was to create two identical copies of the data each week. Both the original CD and the copy would be sent to offsite, third party vendor. One CD was marked as "retrievable" and could be obtained by an appointed company representative. The other copy was marked "secure" and could only be retrieved by an signed request from both FDA and company representatives. Furthermore logs of every CD retrieval request would be recorded.

In the first part of the solution a hash is generated whose value will help identify if this set of instrument data has changed. If some agent changes the data they must compute a new hash value to replace the original hash value. If the replacement of both the original data and the hash associated with it is possible, no one will know either had been tampered with. Therefore the hashes as well as the original research data must be protected from tampering.

The third part of the solution was the chain dependency. After another week of data was collected from the lab instruments, the previous week's hash value would be included with the current week's data. The unique hash value for the current week is then computed from this combination of this week's data plus the hash of the previous week's data. This new hash is later added to the lab data generated in the coming week. This creates a chain of dependencies since each hash value is uniquely representing the contents of the previous CD, and the contents of each previous CD also holds the hash value that uniquely represents the contents of the CD prior to it. For this solution we used the hash function known as MD5.

Hash A = MD5 (file001, file002, ..., file036)
Hash B = MD5 (file037, file038, ..., file062, Hash A)
Hash C = MD5 (file063, file064, ..., file089, Hash B)
etc.

CD A1 & CD A2 contain: file001 - file036
CD B1 & CD B2 contain: file037 - file062 and Hash A
CD C1 & CD C2 contain: file063 - file089 and Hash B

When the FDA retrieves a CD they would also retrieve any number of CDs following it. The FDA would then run the MD5 hash computation on the first CD to generate the hash of the contents of that disk. The hash value that the FDA just generated would then be compared to the hash that was archived on the following CD. Ideally this would be repeated until the end of the chain (the last CD) has been reached.

Now if someone attempted to change the data on one of the disks, that would cause the hash value to be different. Since the hash value is stored on the following CD they'd have to change both CDs -- but by changing the following CD, its hash is now different so they must change the CD after it, and so on. Basically, you'd have to change every CD following the one you originally changed. If every CD was checked out and replaced, that would raise suspicions -- especially since one set of copies requires a representative from the FDA to obtain.

CDs over time will degrade and suffer some data loss/corruption. If one CD fails to generate the correct hash value because of natural "bit rot", hopefully the data on the duplicate CD will still be intact. To avoid possible bit rot of the data on the CDs we recommended that after a period of time the oldest media be retrieved (and have its hash verified) and then cloned onto new media before returning to the secure storage facility. Ideally at some point the recordable CDs would be replaced with a new storage medium that would have a longer data lifespan.

The blockchain is also based on the concept of a chain dependency. Each CD in my solution is a block of data. Each CD/block is chained to the previous CD/block. This chain dependency created by putting a previous block's hash into the following block is a fundamental concept of the blockchain. Of course most blockchain implementations have taken this concept much further.

First of all, the average time to compute the MD5 hash for one CD was over 15 minutes. Now consider two years worth of CDs. If you were able to obtain all of those CDs in order to change every disk, you'd have to regenerate around a hundred hashes. At 15 minutes per hash, you could likely do it within a day. If it took hours per hash, you'd need a lot more time and resources. In other words, the harder it is to compute the hash, the more computing power is needed in order to change that many pieces. Blockchain implementations are designed to quickly follow transactions within a trusted blockchain while slowing down the ability to create new blocks of transaction data on the end of the blockchain.

With my data archival solution we kept 2 copies of the chain of CDs at a trusted third party storage vendor. This is what is known as a "centralized" data store. There are two bad words I just used: "trusted" and "centralized". Blockchain has improved the concept by distributing copies of the blocks across the internet. No longer is "trust" required with any one entity. The idea is the opposite. You don't trust anyone. You share the blockchain with everyone. The multiple copies of the ever-growing blockchain make it so that it would incredibly hard (nearly impossible) for anyone to intentionally violate that trust or for the blockchain to get completely lost/destroyed. If it was hard to obtain 100 CDs and their copies from a centralized location, imagine how hard it would be to obtain from millions of copies spread around the world.

There are a lot more details around the implementation of blockchain protocols that help deal with the lack of trust across the worldwide network and various other specifics around the hooks into the actual applications that use the blockchain. Many of these details are actually quite interesting little problems that have some of their own elegant solutions.

This concept of the blockchain in its purest form (blocks of data linked through hashes) seems so simple it is hard to believe that this hasn't been used thousands of times before it was formally "coined" in 2008. Unfortunately I didn't have the vision to make the connection to the distributed blockchain twenty years ago; and then I largely ignored it for almost 10 years after it surfaced. I regret that. I am definitely ready to start supporting and designing applications that are supercharged via the blockchain.

Authors get paid when people like you upvote their post.
If you enjoyed what you read here, create your account today and start earning FREE STEEM!
Sort Order:  

Write good

Congratulations @jdudleyh! You received a personal award!

Happy Birthday! - You are on the Steem blockchain for 2 years!

You can view your badges on your Steem Board and compare to others on the Steem Ranking

Vote for @Steemitboard as a witness to get one more award and increased upvotes!