Git: The other blockchain

Part 1: Blockchain

Blockchain is the technology behind Bitcoin and other cybercurrencies. That's about all anyone outside the software industry knows about it; that and the fact that lots of people are claiming that it's going to transform everything. (The financial industry, the Web, manufacturing supply chains, identity, the music industry, ... the list goes on.) If you happen to be in the software industry and have a moderately good idea of what blockchain is, how it works, and what it can and can't do, you may want to skip to Part 2.

Still with me? Here's the fifty-cent summary of blockchain. Blockchain is a distributed, immutable ledger. Buzzword is a buzzword buzzword buzzword? Blockchain is a chain of blocks? That's closer.

The purpose of a blockchain is to keep track of financial transactions (that's the "ledger" part) and other data by making them public (that's half of the "distributed" part), keeping them in blocks of data (that's the "block" part) that can't be changed (that's the "immutable" part, and it's a really good property for a ledger to have), are linked together by hashes (that's the "chain" part, and we'll get to what hashes are in a moment), with the integrity of that chain guaranteed by a large group of people (that's the other half of the "distributed" part) called "miners" (WTF?).

Let's start in the middle: how can we link blocks of data together so that they can't be changed? Let's start by making it so that any change to a block, or to the order of those blocks, can be detected. Then, the fact that everything is public makes the data impossible to change without that change being glaringly obvious. We do that with hashes.

A hash function is something that takes a large block of data and turns it into a very long sequence of bits (which we will sometimes refer to as a "number", because any whole number can be represented by a sequence of binary digits, and sometimes as a "hash", because the data has been chopped up and mashed together like the corned beef hash you had for breakfast). A good hash function has two important properties:

It's irreversible. Starting with a hash, it is effectively impossible to construct a block of data that will produce that hash. (It is significantly easier to construct two blocks with the same hash, which is why the security-conscious world moves to larger hashes from time to time.)
It's unpredictable. If two blocks of data differ anywhere, even by a single bit, their hashes will be completely different.

Those two together mean that if two blocks have the same hash, they contain the same data. If somebody sends you a block and a hash, you can compare the hash of the block and if it matches, you can be certain that the block hasn't been damaged or tampered with before it got to you. And if they also cryptographically sign that hash, you can be certain that they used the key that created that signature.

Now let's guarantee the integrity of the sequence of blocks by chaining them together. Every block in the chain contains the hash of the previous block. If block B follows block A in the chain, B's hash depends in part on the hash of block A. If a villain tries to insert a forged transaction into block A, its hash won't match the one in block B.

Now we get to the part that makes blockchain interesting: getting everyone to agree on which transactions go into the next block. This is done by publishing transactions where all of the miners can see them. The miners then get to work with ~~shovels and pickaxes~~ big fast computers, validating the transaction, putting it into a block, and then running a contest to see which of them gets to add their block to the chain and collect the associated reward. Winning the contest requires doing a lot of computation. It's been estimated that miners' computers collectively consume roughly the same amount of electricity as Ireland.

There's more to it, but that's blockchain in a nutshell. I am not going to say anything about what blockchain might be good for besides keeping track of virtual money -- that's a whole other rabbit hole that I'll save for another time. For now, the important thing is that blockchain is a system for keeping track of financial transactions by using a chain of blocks connected by hashes.

The need for miners to do work is what makes the virtual money they're mining valuable, and makes it possible for everyone to agree on who owns how much of it without anyone having to trust anyone else. It's all that work that makes it possible to detect cheating. It also makes it expensive and slow. The Ethereum blockchain can handle about ten transactions per second. Visa handles about 10,000.

Part 2: The other blockchain

Meanwhile, in another part of cyberspace, software developers are using another system based on hash chains to keep track of their software -- a distributed version control system called git. It's almost completely different, except for the way it uses hashes. How different? Well, for starters it's both free and fast, and you can use it at home. And it has nothing to do with money -- it's a version control system.

If you've been with me for a while, you've probably figured out that I'm extremely fond of git. This post is not an introduction to git for non-programmers -- I'm working on that. However, if you managed to get this far it does contain enough information to stand on its own,

Git doesn't use transactions and blocks; instead it uses "objects", but just like blocks each object is identified by its hash. Instead of keeping track of virtual money, it keeps track of files and their histories. And just as blockchain keeps a complete history of everyone's coins, git records the complete history of everyone's data.

Git uses several types of object, but the most fundamental one is called a "blob", and consists of a file, its size, and the word "blob". For example, here's how git idenifies one of my Songs for Saturday posts:

git hash-object 2019/01/05--s4s-welcome-to-acousticville.html
957259dd1e41936104f72f9a8c451df50b045c57

Everything you do with git starts with the git command. In this case we're using git hash-object and giving it the pathname of the file we want to hash. Hardly anyone needs to use the hash-object subcommand; it's used mainly for testing and the occasional demonstration.

Git handles a directory (you may know directories as "folders" if you aren't a programmer) by combining the names, metadata, and hashes of all of its contents into a type of object called a "tree", and taking the hash of the whole thing.

Here, by the way, is another place where git really differs from blockchain. In a blockchain, all the effort of mining goes into making sure that every block points to its one guaranteed-unique correct predecessor. In other words, the blocks form a chain. Files and directories form a tree, with the ordinary files as the leaves, and directories as branches. The directory at the top is called the root. Top? Top. For some reason software trees grow from the root down. After a while you get used to it.

Actually, that's not quite accurate, because git stores each object in exactly one place, and it's perfectly possible for the same file to be in two different directories. This can be very useful -- if you make a hundred copies of a file, git only has to store one of them. It's also inaccurate because trees, called Merkle Trees are used inside of blocks in a blockchain. But I digress.

Technically the hash links in both blockchains and git form a directed acyclic graph -- that means that the links all point in one direction, and there aren't any loops. In order to make a loop you'd have to predict the hash of some later block, and you just can't do that. I have another post about why this is a good thing.

And that brings us to the things that make git, git: commits. ("Commit" is used in the same sense, more or less, as it is in the phrase "commit something to memory", or "commit to a plan of action". It has very little to do with crime. Hashes are even more unique than fingerprints, and we all know what criminals think about fingerprints. In cryptography, the hash of a key is called its fingerprint.)

Anyway, when you're done making changes in a project, you type the command

git commit

... and git will make a new commit object which contains, among other things, the time and date, your name and email address, maybe your cryptographic signature, a brief description of what you did (git puts you into your favorite text editor so you can enter this if you didn't put it on the command line), the hash of the current root, and the hash of the previous commit. Just like a blockchain.

Unlike earlier version control systems, git never has to compare files; all it has to do is compare their hashes. This is fast -- git's hashes are only 20 bytes long, no matter how big the files are or how many are in a directory tree. And if the hashes of two trees are the same, git doesn't have to look at any of the blobs in those trees to know that they are all the same.

@ Blockchain 101 — only if you ‘know nothing’! – Hacker Noon @ When do you need blockchain? Decision models. – Sebastien Meunier @ Git - Git Objects @ git ready » how git stores your data @ Git/Internal structure - Wikibooks, open books for an open world @ Why Singly-Linked Lists Win* | Stephen Savitzky