Version Control with Git

Objectives

Know what version control is and why we use it
Understand how Git stores version history as a graph
Practice using Git as a solo programmer

Version control

Version control systems are essential tools of the software engineering world. More or less every project — serious or hobby, open source or proprietary — uses version control. Without version control, coordinating a team of programmers all editing the same project’s code will reach pull-out-your-hair levels of aggravation.

Version control means keeping track of multiple versions of a program or a document. You’ve probably already used some form of version control in your computing experience up to this point:

Dropbox has a version history, with old versions of files that you save in it
Google Docs has a version history for a document
Word processors and editors all have undo, which helps you go back to very recent versions of your editing

You may have even implemented some simple version control yourself, by keeping multiple copies of files with version numbers in the filename. Let’s explore a scenario of doing that in programming, to see what features we want for a good software version control system.

Suppose Alice is working on a problem set by herself.

Alice

Version 1

hello.ts

She starts with one file hello.ts in her pset, which she works on for several days.

At the last minute before she needs to hand in her pset to be graded, she realizes she has made a change that breaks everything. If only she could go back in time and retrieve a past version!

A simple discipline of saving backup files would get the job done.

Alice

Version 1

hello.1.ts

Version 2

hello.2.ts

Version 3

hello.ts

HEAD

Alice uses her judgment to decide when she has reached some milestone that justifies saving the code. She saves the versions of hello.ts as hello.1.ts, hello.2.ts, and hello.ts. She follows the convention that the most recent version is just hello.ts. We will call the most recent version the head.

Now when Alice realizes that version 3 is fatally flawed, she can just copy version 2 back into the location for her current code. Disaster averted! But what if version 3 included some changes that were good and some that were bad? Alice can compare the files manually to find the changes, and sort them into good and bad changes. Then she can copy the good changes into version 2.

This is a lot of work, and it’s easy for the human eye to miss changes. Luckily, there are standard software tools for comparing text; in the UNIX world, one such tool is diff. A better version control system will make diffs easy to generate.

Cloud	Version 1 hello.1.ts	Version 2 hello.2.ts	Version 3 hello.ts

Alice	Version 1 hello.1.ts	Version 2 hello.2.ts	Version 3 hello.ts

Alice also wants to be prepared in case her laptop gets run over by a bus, so she saves a backup of her work in the cloud, uploading the contents of her working directory whenever she’s satisfied with its contents.

If her laptop ever gets flattened, Alice can retrieve the backup and resume work on the pset on a fresh machine, retaining the ability to time-travel back to old versions at will.

Furthermore, she can develop her pset on multiple machines, using the cloud provider as a common interchange point. Alice makes some changes on her laptop and uploads them to the cloud. Then she downloads onto her desktop machine at home, does some more work, and uploads the improved code (complete with old file versions) back to the cloud.

		Cloud

Version 5L hello.ts	Alice on laptop		Alice on desktop	Version 5D hello.ts

If Alice isn’t careful, though, she can run into trouble with this approach. Imagine that she starts editing hello.ts to create “version 5” on her laptop. Then she gets distracted and forgets about her changes. Later, she starts working on a new “version 5” on her desktop machine, including different improvements. We’ll call these versions “5L” and “5D,” for “laptop” and “desktop.”

When it comes time to upload changes to the cloud, there is an opportunity for a mishap! Alice might copy all her local files into the cloud, causing the cloud to contain version 5D only. Later Alice syncs from the cloud to her laptop, potentially overwriting version 5L, losing the worthwhile changes. What Alice really wants here is a merge, to create a new version based on the two version 5’s.

At this point, considering just the scenario of one programmer working alone, we already have a list of operations that should be supported by a version control scheme:

reverting to a past version
comparing two different versions
pushing full version history to another location
pulling history back from that location
merging versions that are offshoots of the same earlier version

Multiple developers

Now let’s add Bob, another developer, into the picture. The picture isn’t too different from what we were just thinking about.

			Cloud

Version 5A hello.ts	Version 5A greet.ts	Alice		Bob	Version 5B hello.ts	Version 5B greet.ts

Alice and Bob here are like the two Alices working on different computers. They no longer share a brain, which makes it even more important to follow a strict discipline in pushing to and pulling from the shared cloud server. The two programmers must coordinate on a scheme for coming up with version numbers. Ideally, the scheme allows us to assign clear names to whole sets of files, not just individual files. (Code files often depend on other code files, so thinking about them in isolation has the risk of creating inconsistencies.)

				Cloud

Log: 1: Alice, 7pm, ... ... 4: Bob, 8pm, ... 5A: Alice, 9pm, ...	Ver. 5A hello.ts	Ver. 5A greet.ts	Alice		Bob	Ver. 5B hello.ts	Ver. 5B greet.ts	Log: 1: Alice, 7pm, ... ... 4: Bob, 8pm, ... 5B: Bob, 9pm, ...

Merely uploading new source files is not a very good way to communicate to others the high-level idea of a set of changes. So let’s add a log that records for each version who wrote it, when it was finalized, and what the changes were, in the form of a short human-authored message.

Pushing another version now gets a bit more complicated, as we need to merge the logs. It’s easier to merge logs than to merge TypeScript files, since logs have a simpler structure – but without tool support, Alice and Bob will still need to merge their logs manually! We also want to enforce consistency between the logs and the actual sets of available files: for each log entry, it should be easy to extract the complete set of files that were current at the time the entry was made.

But with logs, all sorts of useful operations are enabled. We can look at the log for just a particular file: a view of the log restricted to those changes that involved modifying some file. We can also use the log to figure out which change contributed each line of code, or, even better, which person contributed each line, so we know who to complain to when the code doesn’t work. This sort of operation would be tedious to do manually; the automated operation in version control systems is called annotate (or, unfortunately, blame).

Multiple branches

It sometimes makes sense for a subset of the developers to go off and work on a branch, a parallel code universe for, say, experimenting with a new feature. The other developers don’t want to pull in the new feature until it is done, even if several coordinated versions are created in the meantime. Even a single developer can find it useful to create a branch, for the same reasons that Alice was originally using the cloud server despite working alone.

In general, it will be useful to have many shared places for exchanging project state. There may be multiple branch locations at once, each shared by several programmers. With the right set-up, any programmer can pull from or push to any location, creating serious flexibility in cooperation patterns.

The shocking conclusion

Of course, it turns out we haven’t invented anything here: Git does all these things for you, and so do many other version control systems.

Distributed vs. centralized

Dan		Carol

Cloud

Alice		Bob

Traditional centralized version control systems like CVS and Subversion do a subset of the things we’ve imagined above. They support a collaboration graph – who’s sharing what changes with whom – with one primary server, and copies that only communicate with the primary server.

In a centralized system, everyone must share their work to and from the primary repository (often abbreviated to repo). Changes are safely stored in version control if they are in the primary repository, because that’s the only repository.

Dan		Carol

Cloud

Alice		Bob

In contrast, distributed version control systems like Git and Mercurial allow all sorts of different collaboration graphs, where teams and subsets of teams can experiment easily with alternate versions of code and history, merging versions together as they are determined to be good ideas.

In a distributed system, all repositories are created equal, and it’s up to users to assign them different roles. Different users might share their work to and from different repos, and the team must decide what it means for a change to be in version control. Does a change in one programmer’s repo need to be shared with a designated collaborator or server before the rest of the team considers it official?

Version control terminology

Repository or repo: a local or remote store of the versions in our project
Working copy: a local, editable copy of our project that we can work on
File: a single file in our project
Version or revision: a record of the contents of our project at a point in time
Change or diff: the difference between two versions
Head: the current version

Features of a version control system

Reliable: keep versions around for as long as we need them; allow backups
Multiple files: track versions of a project, not single files
Meaningful versions: what were the changes, why were they made?
Revert: restore old versions, in whole or in part
Compare versions: to see what changed
Review history: for the whole project or individual files
Not just for code: prose, images, …

It should allow multiple people to work together:

Merge: combine versions that diverged from a common previous version
Track responsibility: who made that change, who touched that line of code?
Work in parallel: allow one programmer to work on their own for a while (without giving up version control)
Work-in-progress: allow multiple programmers to share unfinished work (without disrupting others, without giving up version control)

Git

The version control system we’ll use in 6.102 is Git. It’s powerful and worth learning. But Git’s user interface can be terribly frustrating. What is Git’s user interface?

In 6.102, we will use Git on the command line. The command line is a fact of life, ubiquitous because it is so powerful.
The command line can make it very difficult to see what is going on in your repositories. You may find SourceTree (shown on the right) for Mac & Windows useful. On any platform, gitk can give you a basic Git GUI. Ask Google for other suggestions.

An important note about tools for Git:

Programming editors like VS Code often have built-in support for Git, with helpful icons or highlighting that shows (for example) which files you have changed and which files have not yet been added to the repo. However, we do not recommend using editor menus or plugins to run git commands like add, commit, or push. The course staff may not be able to help you if you run into problems.
GitHub makes desktop apps for Mac and Windows. Because the GitHub app changes how some Git operations work, if you use the GitHub app, course staff will not be able to help you.

Getting started with Git

On the Git website, you can find two particularly useful resources:

Pro Git documents everything you might need to know about Git.
The Git command reference can help with the syntax of Git commands.

The command line

One thing that makes learning Git harder for many students is that it’s a command-line program. If you’re not familiar with the command-line, this can be confusing.

A command-line is an interface to your computer, similar to the Mac Finder or Windows Explorer, except that it’s text-based. As the name implies, you interact with it through “commands” — each line of input begins with a command and has zero or more arguments, separated by spaces.

Make sure you’re using the right command line

On MacOS and Linux, open your Terminal application.

On Windows, make sure you open Git Bash.

Don’t use the Windows Command Prompt; it has a different command syntax, not the Unix shell syntax that is used in Mac and Linux.
Don’t use the Windows Subsystem for Linux either; although it does use Unix shell syntax, it differs from the Unix shell in some other important ways (like the line-ending convention in text files) that will cause problems for you in this class.

Common commands

You will have to be able to move around on the command line. Here are the most important commands for doing that.

cd (stands for “change directory”)

Changes the current directory. If you’re in a directory that has a subdirectory called hello, then cd hello moves into that subdirectory.

Use cd .. to move to the parent directory of your current directory.
pwd (“print working directory”)

Prints out the current directory, if you’re not sure where you are.

On a well-configured system, your current directory is displayed as part of the prompt that the system shows when it’s ready to receive a command. If that’s not the case on your system, ask for help configuring your prompt.
ls (“list”)

Lists the files in the current directory.

Use ls -l for extra information (a “long” listing) about the files. Use ls -a (stands for “all”) to show hidden files, which are files and subdirectories whose names begin with a period.
mkdir (“make directory”)

Creates a new directory in the current directory. To create a directory called goodbye, use mkdir goodbye.
up arrow and down arrow

Use up arrow to put the command you just ran back on the command line. You can now edit that command to fix a typo, or just press enter to run it again.

Use the up and down arrow keys to navigate through your history of commands, so you never have to re-type a long command line.
Ctrl+A and Ctrl+E

Use Ctrl+A to bring your cursor to the beginning of a line or Ctrl+E to bring your cursor to the end of a line. (Home or End may also work, if your keyboard has those keys and your terminal prompt interprets them correctly.)

You may find these shortcuts useful for editing long commands, to avoid pressing the left or right arrow keys to move your cursor across the entire command.
Tab completion

You can use the tab button to autocomplete a file or directory name.

For example, if you want to change your directory to a child directory called source, you can type out cd s on the command line and hit tab, which will autocomplete the command (cd source) if there is only one child directory that starts with s or list out all children directories that start with s.

The Git object graph

Read: Pro Git 1.3: What is Git?

That reading introduces the three pieces of a Git repo: .git directory, working directory, and staging area.

All of the operations we do with Git — clone, add, commit, push, log, merge, … — are operations on a graph data structure that stores all of the versions of files in our project, and all the log entries describing those changes. The Git object graph is stored in the .git directory of your local repository. Another copy of the graph, e.g. for PS0, is on github.mit.edu, stored in an IS&T data center somewhere.

Copy an object graph with `git clone`

How do you get the object graph from github.mit.edu (or any other remote storage) to your local machine in order to start working on the problem set? git clone copies the graph.

Suppose your username is bitdiddle and you run:

git clone ssh://github.mit.edu/.../ps0-bitdiddle.git ps0

Git performs several steps. Hover or tap on each step to update the diagram below:

Create an empty local directory ps0, and ps0/.git.
Connect to github.mit.edu and copy the object graph from ps0-bitdiddle.git into ps0/.git.
Check out the current version of the main branch.

Diagram for highlighted step:

We still haven’t explained what’s in the object graph. But before we do that, let’s understand step 3 of git clone: check out the current version of the main branch.

The object graph is stored on disk in a convenient and efficient structure for performing Git operations, but not in a format we can easily use. In Alice’s invented version control scheme, the current version of hello.ts was just called hello.ts because she needed to be able to edit it normally. In Git, we obtain normal copies of our files by checking them out from the object graph. These are the files we see and edit in VS Code.

We also decided above that it might be useful to support multiple branches in the version history. Multiple branches are essential for large teams working on long-term projects. To keep things simple in 6.102, we will not use branches and we don’t recommend that you create any. Every 6.102 Git repo comes with a default branch called main, and all of our work will be on the main branch. (The default branch used to be called master by convention. If you see that in online Git tutorials or Stack Overflow answers, just replace it with main instead.)

So step 2 of git clone gets us an object graph, and step 3 gets us a working directory full of files we can edit, starting from the current version of the project.

Practice with GitStream

This reading includes links to a Git tutor called GitStream. GitStream allows you to practice Git on your machine: for each exercise, you clone a GitStream repository, then follow the instructions on the web page. GitStream will give you feedback in both the terminal and on the web as you complete each exercise.

GitStream will not work with multiple exercise pages open at the same time.

Don’t open exercises in multiple tabs. If an exercise doesn’t work, please close all open GitStream pages and try again.

If you encounter a problem, please ask for help.

GitStream → Practice git clone

Note that GitStream doesn’t keep track of whether you’ve already done this exercise. To see which GitStream exercises you’ve already done, look at Omnivore.

Get the history of the repository

Let’s finally dive into that object graph!

Clone an example repo with some interesting history:

git clone https://github.com/6031/ex05-hello-git hello-git

And make sure you are inside its directory:

cd hello-git

Here are the most useful commands:

git lol

shows a list of all the commits in the repository. It’s an alias for git log plus a few options, which you created when you followed the Getting Started instructions.
git show

shows the last commit on the repository. It will show you the commit message as well as all the modifications.

Long output: if git lol or git show generates more output than fits on one page, you will see a colon (:) symbol at the bottom of the screen. You will not be able to type another command! Use the arrow keys to scroll up and down, and quit the output viewer by pressing q. (You can also press h to see the viewer’s other commands, but scrolling and quitting are the most important to know.)

Commit IDs: every Git commit has a unique ID, the hexadecimal numbers you see in git lol or git show. The commit ID is a unique cryptographic hash of the contents of that commit. Every commit, not just within your repository but within the universe of all Git repositories, has a unique ID (with extremely high probability).

You can reference a commit by its ID (usually just by the first few characters). This is useful with a command like git show, where you can look at a particular commit rather than only the most recent one.

You will also see commits identified by ID in tools like github.mit.edu and Didit.

Here’s the output of git lol for this example repository:

* b0b54b3 (HEAD, origin/main, origin/HEAD, main) Greeting in Java
*   3e62e60 Merge
|\  
| * 6400936 Greeting in Scheme
* | 82e049e Greeting in Ruby
|/  
* 1255f4e Change the greeting
* 41c4b8f Initial commit

The history of a Git project is a directed acyclic graph (DAG). The history graph is the backbone of the full object graph stored in .git, so let’s focus on it for a minute.

Each node in the history graph is a commit a.k.a. version a.k.a. revision of the project: a complete snapshot of all the files in the project at that point in time. Each commit is identified by its commit ID, displayed as a hexadecimal number.

Except for the initial commit, each commit has a pointer to a parent commit. For example, commit 1255f4e has parent 41c4b8f: this means 41c4b8f happened first, then 1255f4e.

Some commits have the same parent. They are versions that diverged from a common previous version, for example because two developers were working independently.

And a commit can have two parents. This is a version that ties divergent histories back together, for example because those developers then merged their work together again.

A branch — remember main will be our only branch for now — is just a name that points to a commit.

Finally, HEAD points to our current commit — almost. We also need to remember which branch we’re working on. So HEAD points to the current branch, which points to the current commit.

reading exercises

HEAD count

Using commands like git lol, or commands from Pro Git or a tool like SourceTree, answer these questions about the hello-git example repo:

How many commits are in this project?

(missing explanation)

How many different versions of hello.txt are there?

(missing explanation)

How many new files have been added to the project?

How many times has an existing file been modified?

And how many times has a file been deleted?

(missing explanation)

(Stuck? Try looking for options or arguments to git log that can answer these questions.)

First impression

What were the original contents of hello.txt?

(missing explanation)

Graph-ical

Which of these are a correct representation of the history of this repository?

Choose all the correct answers.

(missing explanation)

Around and around

What would be the meaning of a cycle in the history graph?

Diverging changes were made in parallel

More than two diverging histories were merged in a single merge

Some commit is its own ancestor

Some commit is a descendant of itself

A pair of commits contain inverse changes

This is impossible

(missing explanation)

Add to the object graph with `git commit`

How do we add new commits to the history graph? git commit creates a new commit.

In some alternate universe, git commit might create a new commit based on the current contents of your working directory. So if you edited hello.ts and then did git commit, the snapshot would include your changes.

We’re not in that universe; in our universe, Git uses that third and final piece of the repository: the staging area (a.k.a. the index, which is only a useful name to know because sometimes it shows up in documentation).

The staging area is like a proto-commit, a commit-in-progress. Here’s how we use the staging area and git add to build up a new snapshot, which we then cast in stone using git commit:

Modify hello.txt, git add hello.txt, git commit

Hover or tap on each step to update the diagram, and to see the output of git status at each step:

If we haven’t made any changes yet, then the working directory, staging area, and HEAD commit are all identical.
Make a change to a file. For example, let’s edit hello.txt.
Other changes might be creating a new file, or deleting a file.
Stage those changes using git add.
Create a new commit out of all the staged changes using git commit.

$ git status
On branch main
Your branch is up-to-date with 'origin/main'.

nothing to commit, working directory clean

$ git status
On branch main
Your branch is up-to-date with 'origin/main'.

Changes not staged for commit:
  (use "git add ..." to update what will be committed)
  (use "git checkout -- ..." to discard changes in working directory)

    modified:   hello.txt

no changes added to commit (use "git add" and/or "git commit -a")

$ git add hello.txt 
$ git status
On branch main
Your branch is up-to-date with 'origin/main'.

Changes to be committed:
  (use "git reset HEAD ..." to unstage)

    modified:   hello.txt

$ git commit
[main 8a8858a] Update the greeting again
 1 file changed, 1 insertion(+), 1 deletion(-)
$ git status
On branch main
Your branch is ahead of 'origin/main' by 1 commit.
  (use "git push" to publish your local commits)

nothing to commit, working directory clean

Get the status of your repository

Git has some nice commands for seeing the status of your repository.

The most important of these is git status. Run it any time to see which files are:

untracked (completely new, never before added to Git);
modified but still unstaged;
modified and staged. If you git commit, only staged changes will be included in the commit.

Note that the same file might have both staged and unstaged changes, if you modified the file again after running git add.

When you have unstaged changes, you can see what the changes were (relative to the last commit) by running git diff. Note that this will not include changes that were staged (but not committed). You can see uncommitted, staged changes by running git diff --staged.

Use git status frequently to keep track of whether you have no changes, untracked files, unstaged changes, or staged changes; and whether you have new commits in your local repository that haven’t been pushed, as will be discussed next.

Send & receive object graphs with `git push` & `git pull`

We can send new commits to a remote repository using git push:

git push origin main

Hover or tap on each step to update the diagram:

When we clone a repository, we obtain a copy of the history graph.
Git remembers where we cloned from as a remote repository called origin.
Using git commit, we add new commits to the local history on the main branch.
To send those changes back to the origin remote, use git push origin main.

And we receive new commits using git pull. In addition to fetching new parts of the object graph, git pull also updates the working copy by checking out the latest version (just like git clone checked out a working copy to start with). If the remote repository and the local repository have both changed, git pull will try to merge those changes together.

Pushing

After you’ve made some commits, you’ll want to push them to a remote repository. In 6.102, you should have only one remote repository to push to, called origin. To push to it, you run the command:

git push origin main

The origin in the command specifies that you’re pushing to the origin remote. By convention, that’s the remote repository you cloned from.

The main refers to the main branch, the default branch in our Git repositories. We won’t use branches other than main in 6.102. All our commits will be on main, and that’s the branch we want to push.

Once you run this, you will be prompted for your password and hopefully everything will push. You’ll get a line like this:

a67cc45..b4db9b0  main -> main

GitStream → Practice the add-commit-push workflow

GitStream → Practice adding a new file

GitStream → Practice deleting a file

GitStream will not work with multiple exercise pages open at the same time.

Don’t open exercises in multiple tabs. If an exercise doesn’t work, please close all open GitStream pages and try again.

If you encounter a problem, please ask for help.

Version Control with Git

Objectives

Version control

Multiple developers

Multiple branches

The shocking conclusion

Distributed vs. centralized

Version control terminology

Features of a version control system

Git

Getting started with Git

The command line

Make sure you’re using the right command line

Common commands

The Git object graph

Copy an object graph with git clone

Practice with GitStream

Get the history of the repository

reading exercises

Add to the object graph with git commit

Get the status of your repository

reading exercises

Send & receive object graphs with git push & git pull

Pushing

Copy an object graph with `git clone`

Add to the object graph with `git commit`

Send & receive object graphs with `git push` & `git pull`