I, like a lot of developers these days, really enjoy working with Git. It just makes sense to me. However, not every shop has bought in. Some use Subversion, Perforce, Mercurial, and/or Fossil to name a few. That is just to mention a few, as there are even more that I have not worked with. Git though, I have used enough to appreciate. While working with other SCM’s I’ve faced issues where I find myself wishing for one (or many) of Git’s features.
Working with Perforce the other day, I started editing a file. It was an unfamiliar codebase and I just wanted to get an idea of the flow of the application. When I hit save however I get a ‘readonly’ message. So I could do p4 open, or ignore the message and save anyway. Now, I may have to do this several times and its more than likely that none of the edits I made have saved to code. This is quite a hindrance. So I continue ignoring the readonly message and keep saving files and deleting the changes.
The thing I like about Git is that it stays out of your way for the most part, until it’s time to commit. So I can edit to my heart’s content and do a git status to see if I left anything in the code that shouldn’t be there, and quickly.
That being said, I wanted to build a little tool to abstract some of the differences of other SCM’s away. Things like a git status amongst others. The problem however, is that I have a hard time finding the time to build such a tool. Also, I haven’t really figured out all of the functionality I would like this tool to have.
Then I came across a video “Source Control Made Easy” by Jim Weirich, a man well known in the Ruby community, who recently passed away. I liked his teaching style and feel I’ve learned a lot from his talks. One of my personal favorites is on testing called “Roman Numerals Kata”. I didn’t know him Jim, but it seems like he would have been a fun person to be friends with.
“Source Control Made Easy”, is kind of a talk about Git, but not directly. Or at least it doesn’t seem that way at first. The following is part of the description for this video:
“In this 49-minute screencast, Jim Weirich takes you on a journey of how you might design and build a source control system from scratch. Along the way you’ll gain a deeper understanding of the first principles behind systems like Git, so things begin to make more sense.”
I highly recommend this video to anyone interested in learning about not only Git, but understanding the principles of source control in general. Also, note that the great people over at the Pragmatic Programmers are donating 100% of the purchase price to Jim’s family.
So, as a tribute to Jim Weirich, I decided to take a shot at implementing the source control system he talks about in ruby and to incorporate some test driven development as well.
I’ll start by writing out some quick user stories.
With those stories in place to drive my development, I wanted to take a few minutes up front to think about or pseudo code a quick possible implementation. I don’t really have much in terms of expectations here. I just want to get some of the ideas into focus.
calling initialize from within a directory should:
create new hidden directory .esc
create new sqlite database to store what?
create HEAD file which contains hash (manifest filename) of the latest snapshot (empty initially) (maybe a db entry)
create metadata file (metadata) which will contain the manifest hash, snapshot author (name, email), timestamp, comments, current head (parent to this snapshot)
create manifest file (manifest) which will contain hashes of the files in the snapshot, maps the hashes to original filenames and directories
get a list of all files with paths in the working directory
iterate through the list, calculate the file hash
search the repository for a file with the calculated hash
if found, just add the hash and the filename with pathname to the manifest file
if not found
check if we need to create a directory for the file (create hash directories: a..z maybe)
copy the file to the repository directory
add the hash and the original filename with pathname to the manifest file
calculate the hash of the manifest file rename it from manifest to hash
calculate the hash of the metadata file and rename it from metadata to hash update metadata file with snapshot hash
update HEAD to point to this snapshot (metadata filename)
check .esc for the metadata file (version number/hash)
if metadata file is not found, fail and inform the user
if found, get the manifest hash
print the metadata info out to the console
open the manifest
scan the file line by line,
get the actual path/filename for an entry and see if it exists in the working directory
if it doesn’t, just copy the file changing the hash to the filename and placing it in the correct directory
if it does, calculate the hash for the file in the working directory
if the hash is the same, don’t do anything with that file
if the hash is different, overwrite the existing file with the file from the repository
One more thing before we get on with the actual coding. I am trying to keep this simple. That being said, I may not adhere to any strict standards or practices. I will try to point them out as I go. This will free me up to:
Write code as fast as possible since it’s been hard enough to find the time to write these days.
Refactoring would be a great exercise for any reader who would like to continue this project. Ideally, I will do a follow up post where I refactor. I want to code this almost raw, and think about things as I go almost like the Roman Numeral Kata.
…but before we actually start writing the application code, let’s get our initial testing in place. Create a directory, and call it whatever you want. I am calling mine: custom_source_control.
Now open your favorite editor, create a new file named custom_source_control.rb and add the following to it.
So if you aren’t familiar with minitest you should continue reading the post. If you like what you’ve seen checkout the README over on github. Basically, we just described the first thing we’d like to test. The DSL’s that people write using Ruby are great and this almost reads like English (or some weird robot form of it). Let’s just look at the words between the quotes:
‘when a repository is initialized’, ‘must create a new hidden directory named .esc’
Another thing to point out is the before block. We can pretty much infer that the before block will run before any of our tests. Then there is that require 'minitest/autorun' thing at the top. That just makes the tests run when we execute the ruby file. Let’s make the ruby file executable, and execute it.
Here we gave the script the ability to be executed as a command. It ran the script… and failed, kind of.
Actually, this is ruby telling us that we don’t have a constant CustomSourceControl in our script, but were acting as if we did. CustomSourceControl is going to be our class. So we’ll need to add it. We are going to write our actual implementation code above our tests and everything else, but below the shebang. Wait, what’s a ‘shebang’? It’s that little #!/usr/bin/env ruby at the top of our file. Remember the chmod u+x custom_source_control.rb we just did. Well chmod u+x custom_source_control.rb tells the operating system the file is executable, and #!/usr/bin/env ruby tells it to use ruby to execute it.
Note the use of ..., do not type it in the editor. This is just me saying “more text may come before or after”.
We now get an error (denoted by the E) that the CustomSourceControl class doesn’t have a repository_exists? method. We will just add that method and retest:
We are getting an actual failure (denoted by the F) now. I mean, where do we get off expecting true to be returned from the repository_exists? method. We haven’t even implemented it, of course it’s going to return nil…
So let’s implement it.
…and we’re passing (denoted by the .)!
Seriously? We passed? Yeah… and I want you to know that I realize just returning true doesn’t mean that the repository actually exists. I know this for a few reasons, but let’s take the most straightforward way and prove this.
ls -la if you don’t know, or couldn’t tell, just lists all the files in a directory. We need the a in order to see all files, including the hidden ones. On *nix files/directories that begin with a . are hidden.
Nope, no .esc directory here… So what did that prove? The testing stuff? Why do it? We are taking a very systematic and pragmatic approach here. This testing stuff is good for a few reasons. We will see this come to light a bit later. For now though, please just accept it.
With this little bit of testing so far, we’ve really just exercised our minds as well as minitest. As our code grows and we move on to other projects, we will likely forget all the intricacies of what we wrote. Our tests here should give us a bit of confidence, even this early on in our development cycle. Also, our code is small and easy to manually test, so we know minitest is doing its job.
Let’s really implement the repository_exists? method now.
…and were failing again :(, that rush I got from passing never quite lasts long. Keep calm and carry on. It’s good that we’re failing, we should be failing. We have yet to create our .esc directory. Now I don’t know about you, but I want to get to passing again. I got that itch now…
There is nothing crazy going on here, just calling the mkdir(make directory) method on the Dir class. But guess what, we’re passing again. If we check the file system, we see our newly added directory.
Lets move on to our next test.
What? Why do we have 2 errors? We were just passing. If we look at our initial test we can see that the .esc file already exists. We have to clean up after ourselves like good TDD citizens. First let’s manually remove the existing .esc directory. Then let’s add some code to our tests that will clean up after each test is run.
We added the require 'fileutils' right above our class CustomSourceControl statement, and after our after block inside our main describe block above our before block. Was that confusing? If so, you can double check your work with the project I have hosted on github.
Rerun our tests and this time our initial test is passing again. We can now deal with the error we are seeing. I don’t know about you, but I am finding this process very helpful. It’s an iterative approach, one that you likely do anyway, just without the testing.
So, if history is any indicator of things to come (and we read the error message), we know the next step is to create the head_exists? method in our CustomSourceControl class.
We could take the same approach we took in our previous test and return true to start, then test, fail, refactor, etc… This would be the right approach, but I’ll leave that for you to do on your own. Get comfortable with the process and messages.
Once you’ve run through the exercise, after a few iterations, you should have a method similar to the repository_exists? method except instead of creating a directory, we’re going to create a file.
…but wait! Why is this you ask! Well we only create .esc when we call repository_exists?, and then after each test, it is removed if it exists. So .esc doesn’t exist anymore.
Let’s think about our story for a second. What we really care about, from a high level, is repository initialization. So lets skip this test and refactor a bit.
That S tells us that we’re skipping a test. Update the CustomSourceControl class with the following:
We’ve created an initialize_repository method. We’ve moved both the .esc directory creation and the HEAD file creation out of the xxx_exists? methods and into initialize_repository. This all makes sense. The xxx_exists? should only be responsible for checking that something actually exists, not creating anything. initialize_repository on the other hand, its purpose is to handle the tasks involved in initializing the repository. One of those tasks is creation of the repository structure.
If we run this now what do you think will happen?
Well, we fail since we haven’t actually called the initialize_repository method anywhere in our code. Is that what you guessed? So where should we call the initialize_repository method? If you guessed ‘in our before block’, you win.
Great, we’re passing again. Let’s remove the skip statement from the test and rerun.
It looks like we’re missing that head_contents method. We should add that.
You did it! You can now initialize a new repository! Let’s move on to our next story…
If you recall from our brief pseudo code, we’ve already kind of mapped out a few steps. If you don’t recall that, try again, try harder, or just reread that part above.
Let’s update our before block to include a new snapshot method.
As you might have guessed, it failed since we haven’t actually created a snapshot method. You also probably guessed that, that is going to be our next step. And you’d be right.
…and we’re passing. Now, let’s create a new describe block for our snapshot story and test the existence of a metadata file. The test will be inside our main describe CustomSourceControl do block, outside and below the describe 'when a repository is initialized' do.
Which will undoubtedly fail…
Now, let’s make it pass.
First, we are adding the necessary metadata_exists? method and checking that a file .esc/__metadata__ actually exists. It won’t, and has to be created as part of the snapshot so we add that code to our snapshot method.
Here is a homework assignment:
Follow the same process for the manifest file test/creation. Make sure you follow the process as you go.
It should look something like this:
(method to test existence)
…and last but not least:
(actually create the file)
Our next step is to get a list of files in the current working directory.
Notice how I skipped running the test and went straight to implementation? Well I didn’t actually skip the testing part. I just didn’t write it here. Keep this in mind. You should be testing as often as possible. Get familiar with the messages and try to understand what they are telling you is wrong.
We moved pretty quickly in that last cycle of: write a test, watch it fail, write code to make it pass. The last part in that cycle which I have not done (for the most part) is refactor. It’s called Red, Green, Refactor. Refactoring is an important part of the cycle and I normally wouldn’t skip over it. I am doing so here however to get you familiar with the other parts of the cycle with the intention that we will revisit and refactor in another post. I mentioned this before, but want to reiterate the point here.
Let’s create our SHA1 file hashes.
(write a test)
(watch it fail)
(write code to make it pass)
What did we just do here? First, we are adding openssl, which provides the methods necessary to hash files. In the cwd_hashes method, we’re creating an instance of OpenSSL::Digest::SHA1 and later using the hexdigest it provides to hash files in the current working directory.
We’re getting the wrong hash, and that’s because were actually updating the very file we’re hashing/testing. We can never (at least I can’t think of a cleaver way) pass like this. We need to account for this and work around it. What we’re going to do is create two files manually, sha1sum them and remove all but those two files from the hash returned from the cwd_hashes method.
Here I am going to use the cat command and paste the text in. You can use any means you’re comfortable with to create the files and add the test to them. It is important however that you add the empty newline by hitting the enter key at the end of the sentence. I am also assuming that you have shasum or sha1sum installed. If you don’t you can safely skip over that command.
Copy and paste in the following:
Then type ctrl+c to quit. Run shasum or sha1sum to get the files hash.
Repeat the process for the second test file.
Now let’s update our test.
This gets us passing the test in question, but now we’re failing a previous test. If we inspect the message we see that the newly introduced files are causing gets a list of files in the current working directory test to fail. We can simply add the new file names to the array of actual file names.
Now we need our list of files in the repository.
Ok, now that we have our list of existing files in the repository. We can compare what is new, with what already exists and return both of those lists.
Let’s add that deltas method.
Here, we’re going through our current working directory hashes and checking if any exist in the repository. If they do, we add them to the existing array. If they do not, we add them to the new array. Then we create a hash with the keys :new & :existing, add the arrays, and return that hash.
I think the next step should be to add the files to the manifest, then based off the manifest copy and hash the files added to the snapshot.
Here we’re getting the deltas and writing them to the manifest file. We also added a helper method hash_for_file to return the hash of any file we pass in. I can see this coming in handy.
We’re going to need to read that manifest file back out, so let’s add that method.
If we take another look at this test we see @csc.write_manifest method call. Really this should be happening in the snapshot method itself. So let’s make that call in snapshot and remove it from the test.
Next we need to copy the files in the manifest to the repository directory
We’re introducing a few methods here so let’s take our time with this.
Add the verify_manifest helper method.
Now that we’re failing, let’s start implementing the verify_manifest method.
Here we’re reading the __manifest__ file, and for each entry we get the 40 character hash (entry[0...40]) and checking the repo_files array (file names) for it.
This time we’re returning false, and it makes sense since we’re not actually copying the files just yet. So let’s work on the implementing the copy_manifest_files_to_repository method.
There is quite a bit going here. First, we open the __manifest__ file and break down each entry (line). What’s the deal with =~, and what are $1 and $2? =~ is the match operator in ruby. It will match the variable on the left (string or regular expression) to a regular expression on the right. It returns nil if a match is not found, and the position of the match if found. Also if there is a match the $1, $2, …, $9 will represent the capture blocks (whatever is enclosed in the ()). That is how we break down the entry into a hash and pathname. For the actual copying we created a helper method copy_entry_to_repository.
That was fun, wasn’t it? Let’s take the same approach and add the copy_manifest_files_to_repository call to the snapshot method. This will allow us to remove it from the test as well. Make sure you’re test still passes before moving on.
…it didn’t pass did it? We’re you able to figure out why? Did you attempt to fix it? Here is what I did. Based on the failure:
I went right to the gets a list of files in the current working directory test and saw that we’re only accounting for the HEAD file, which should always be in an initialized repository, and then the current working __manifest__ & __metadata__ files. This isn’t the case anymore since our snapshot method is doing more at this point. So what we really want is to make sure that at the point of this test at least those files exist. The must_include assertion provided by minitest is perfect for this.
Let’s update our gets a list of files in the current working directory test to the following:
Now we have to calculate the hash of the manifest file and rename it to the hash.
We added another helper method repository_file_exists?. It simply takes a file name and checks the repository for existence of the filename.
Now that we’re passing, let’s add the hash_and_copy_manifest method to the snapshot method and remove @csc.hash_and_copy_manifest from the test. Make sure you’re passing and move on.
We’re almost there. Next, we have to update the metadata file with the necessary info, then hash it.
Let’s clean up our test like we’ve done before. The @csc.write_metadata manifest_hash & @csc.hash_and_copy_metadata calls will happen in the snapshot methods so let’s delete them.
Now that we’re failing lets add the necessary calls to the snapshot method.
The last step for our snapshot story is to update HEAD to point to this snapshot (metadata filename)
Now refactor the update_head out of the test.
We expected the previous because we removed the update_head call and didn’t add it to snapshot. Then, we added the update_head to the snapshot method, but since that file is not empty anymore we’re failing our must create an empty HEAD file test. It looks like we’re going to have to refactor a bit more.
Let’s refactor the before block. We know all of our tests depend on @csc to be an instance of CustomSourceControl and they all need an initialized repository. The thing is our when a repository is initialized tests don’t require a snapshot. So let’s move that out and into a before block inside the when we take a snapshot tests.
…and our snapshot story is complete! On to our final story!
At some point we’re going to need a way to list all the snapshots csc knows about. One way to do this would be to get the HEAD snapshot then recursively scan through the metadata files and their parents all the way up to root, then just list them out. This might end up in a log subcommand. For now, I am trying to keep the functionality really basic. I am going to manually build up the repository with 2 snapshots, then pick the first snapshot to checkout.
To keep this testable, we’ll do this with a before block for this set of tests.
So that’s going to create the first snapshot. I am going to use the pry gem to suspend the test so that I can manually inspect the .esc directory. If you use it, make sure to type quit when you’re finished inspecting things.
There is a way you can accomplish this without having to install a gem. Add a call to ruby’s sleep method with a time of something long enough for you to carry the tasks out for yourself. That would look like this:
* Be sure to clean up after yourself by deleting the suspends the test so we can inspect the .esc directory test when you are finished.
Getting the hashes:
So if we inspect HEAD we see the metadata file hash.
We then take that file hash, which is the metadata file and inspect that:
This shows that this is the first snapshot as denoted by the Snapshot Parent: root. So let’s take a look at the manifest next Snapshot Manifest: 87b17efdc68c9c1d806c4bd05ce70d9baacd22bf
You can quitpry now. If you used the sleep method do these tasks and it still hasn’t woke up and finished just hit ctrl+c to kill the tests. Let’s add a new file test_file_3.txt and update an existing one test_file_2.txt:
We’re also going to want to clean up after ourselves again:
Ok let’s run the test again and this time make note of the hashes. For me, HEAD is 35d91f744401d8d4828c65bd65029dc07119d5a7. The metadata file (35d91f744401d8d4828c65bd65029dc07119d5a7) shows:
So let’s take the parent metadata 36e0583c25d5e8107538afa345122e9529b9d6fd and take a look:
Yep, 1b1ab1bef308608786e9a1ae2e30e370dd032939 that’s the one we want. Just to be sure, I ran through this process a few more times. I noticed that 2 of the hashes were changing, while the remaining files stayed the same. So I opened one of the files where the file hash had changed. I spotted the issue right away… The timestamp! Since the timestamp changed each time I ran it I was not getting a consistent set of hashes. In the spirit of keeping it simple, I am just going to change the timestamp to a constant value ‘2014-03-07 23:59:59 -0800’. This may seem hacky, and it is. :)
This time, our consistent hash is 3b9158d6cd90b07811496330d873d8a71651cd8b.
We can remove the suspends the test so we can inspect the .esc directory test.
You know the drill by now.
Here we’re reading the metadata file and getting manifest hash. Then we’re reading the manifest file and breaking down the entries again, this time calling copy_entry_to_working_directory method to copy the files from the repository to the current working directory.
Ugh… another issue with the hashing. Not so much the hashing actually, but the fact that we’re editing the very file we’re trying to code/test custom_source_control.rb. This test can never pass. So what are, our options? Well, the first that comes to mind is to just run the script from another directory. We can do this by adding the directory we’re working in to our path. This actually would’ve solved an issue we faced earlier as well. However, I tried to avoid it to keep things simple.
First, let’s skip the current failing test.
Ok, we’re passing again so we can restructure a bit. Let’s get the current working directory.
Now let’s add it to our path so we can execute it from a different directory. We can even bring our command a little closer to the command Jim Weirich mentions: csc, by creating a symlink. Create a new directory (it can even be within the current directory), I am calling it test_dir. Then, let’s move the test files into the test_dir and change into that directory.
Some of our tests will fail since I did a few hackish things here and there. Again, I was trying to cut down on the amount of possible new concepts. Oh well… Rerun the tests and let’s see what we get.
Ok so what are we working with here. Well we no longer need to account for the custom_source_control.rb file. Let’s go update that. So this:
We can also remove the keep_if’s since we were trying to guard against anything but our control files (test_file_1.txt, test_file_2.txt). So this:
…and we’re back to passing! Let’s remove that skip statement and continue working on that last test. If you recall, we need to suspend the test long enough so that we can go through the metadata files and get our root snapshot.
For me, that hash is 485ac882b4e89e929584acdfed522499f0a45464. With that let’s update the test and run it.
For the win…
We… are… passing! Good job! I really enjoyed writing this post. I hope this was helpful for you. Just a few last things before you go.
How do I use this thing now that it’s built?
Well, while we have the methods to handle some of the functionality, we haven’t added the ability to pass arguments on the command line. You can add something very simple like the following code:
First change the require 'minitest/autorun' to require 'minitest/spec' and add the following to the bottom of the file.
You would then be able to call it from the command line like this:
You should notice the timestamp still shows 2014-03-07 23:59:59 -0800. You can remove that line of code, but the tests will fail again.
We don’t really clean up after ourselves so that functionality needs to be added.
Getting the checkout hash is also a manual process so that csc log functionality we talked about would come in handy.
We are not handling any types of errors mind you
…so its not quite production ready.
What do I do next?
Some of the things I’d like to address in a future post include:
Separating the tests from the actual implementation code.
DRY’ing out our code. Many times I have had to fight the urge to do it in this post. I really wanted this information to be approachable by anyone though, so I didn’t use any gems, even minitest/given which was created by Jim Weirich.
Testing for more edge cases, and fixing any bugs we find.
Adding code coverage.
Adding the ability to handle command line arguments with OptionParser.
Adding tests and functionality to diff checkins.
Adding tests and functionality to list the history (metadata file hashes from head all the way back to root)
Possibly turning this into a gem.
Lastly I’d like to thank a few people for helping with this post. Austin Puri, thanks for running through this as a developer and giving some great feedback. Devon Mahnken, for catching a lot of spelling and English grammar mistakes. After all your corrections, for the first time I think my father was right about me being a robot. Really appreciate the help guys!
You can double check your work with the project I have hosted on github
You may have some questions that this didn’t quite answer. Feel free to email me or leave a comment.