Reference (cache) repositories to speed up clones: git clone --reference

Update: Please read this update on my experience before using the technique in this article.
Update: See the drush implementation of this approach in the comment below.

DamZ taught me a great new piece of git trivia today. You can use a local repository as a kind of cache for a git clone.

Let's create a reference repository for Drupal (and it will be bare, because we don't need any files checked out)

git clone --mirror git://git.drupal.org/project/drupal.git ~/gitcaches/drupal.reference

That makes a complete clone of Drupal's full history in ~/gitcaches/drupal.reference

Now when I need to clone the Drupal project's entire history (as I might do often in testing) I can

git clone --reference ~/gitcaches/drupal.reference git://git.drupal.org/project/drupal.git

And the clone time is on the order of 2 seconds instead of several minutes. And yes, it picks up new changes that may have happened in the real remote repository.

To go beyond this (again from DamZ) we can have a reference repository that has many projects referenced within it.

mkdir -p ~/gitcaches/reference
cd ~/gitcaches/reference
git init --bare
for repo in drupal views cck examples panels  # whatever you want here
do
  git remote add $repo git://git.drupal.org/project/$repo.git
done

git fetch --all

Now I have just one big bare repo that I can use as a cache. I might update it from time to time with git fetch --all. But I don't have to. And I can use it like this:

cd /tmp
git clone --reference ~/gitcaches/reference git://git.drupal.org/project/drupal.git
git clone --reference ~/gitcaches/reference git://git.drupal.org/project/examples.git

We'll try to use this technique for the testbots, which do several clean checkouts per patch tested, as it should speed them up by at least a minute per test.

Edit: Here is the version that I used with the testbots, as it appears as a gist:

nbproject
This is a repository that has objects for all Drupal projects which
are enabled for testing.


The list of projects can be created with:

echo "select uri from project_projects p,pift_project pp where pp.pid = p.nid" | mysql gitdev >/tmp/projects.txt

#!/bin/bash

repo=/var/cache/git/reference_cache
source_base_url=git://git.drupal.org/project
projects="drupal devel image mailhandler poormanscron privatemsg project redirect weather ecommerce captcha nodewords simpletest translation pathauto media comment_notify userpoints taxonews nodequeue porterstemmer g2 vote_up_down google_analytics taxonomy_filter geshifilter opensearch chessboard path_redirect token admin_menu services adminrole potx autoassignrole content_access l10n_server versioncontrol user_delete openlayers rules xmlsitemap css_injector userpoints_contrib talk millennium elements mollom linkchecker piwik storm plugin_manager realname languageicons project_issue_file_review uuid role_change_notify profile_permission userpoints_nc securepages_prevent_hijack search_by_page libraries og_statistics grammar_parser skinr faces nd encrypt password_change entitycache examples blogapi drupal_queue profile2 entity better_exposed_filters clock proxy contact vars simpletest_selenium sshkey multicron errornot fontyourface transformers date_popup_authored smartcrop embeddable edge rtsg field_collection comment_allow_anonymous field_formatter_settings myspace_sync references properties"


# Create a temporary directory and arrange for cleanup on exit.
TEMPDIR=`mktemp -d`
trap "rm -Rf $TEMPDIR" EXIT

if  ! test -d  $repo; then
  mkdir $repo
fi

cd $repo
if  ! test -d ".git"; then
  git init --bare
  git config core.compression 1
fi

# In each project:
# * Clone it
# * Use that clone as a remote in our reference repo
echo Cloning all projects to temp repos
for project in $projects
do
  echo "Cloning $project..."
  git clone --bare $source_base_url/$project.git $TEMPDIR/$project
  git remote add $project $TEMPDIR/$project
done

# Fetch all the new (local) remotes we gathered
git fetch --all

echo "Fixing URLs on all remotes to point to the real repo"
# Now change the remotes to the correct remote URL.
for project in $projects
do
  git remote set-url $project $source_base_url/$project.git
done

echo "Re-fetching from the real repo"
# To update, all we need to do is...
git fetch --all

6 Comments

Adding a new repository can be slow

Adding a new repository to the cache can be slow, because on the first fetch Git will try to determine if it has any revision in common with the remote repository, and the only way to do that is to send out the list of *all* the commits in the local repository.

The trick to workaround this is to clone to an empty repository first, and to fetch from there into the cache repository. Here is a simple bash script that automate this process. Execute with import.sh [remote name] [remote url] when inside the cache repository:

#!/bin/bash

set -ex

# Create a temporary directory.
TEMPDIR=`mktemp -d`
trap "rm -Rf $TEMPDIR" EXIT

# First clone the directory separately.
git clone $2 $TEMPDIR

# Then fetch from the temporary dir into our main repo.
git remote add $1 $TEMPDIR/
git fetch $1 --tags

# Then change the remote URL and fetch normally.
git remote set-url $1 $2
git fetch $1 --tags

This is the full drush

This is the full drush command to use for those who are interested:
drush dl --package-handler=git_drupalorg --cache drupal

You can also add it as a default setting to your ~/.drush/drushrc.php file:

<?php
$command_specific
['dl'] = array(
 
'package-handler' => 'git_drupalorg',
 
'cache' => TRUE,
);
?>

On demand?

This post got me thinking so I put up a more generalized bash script that caches on demand and it’s not specific to Drupal.

https://gist.github.com/8839519ec5b823e047bf

Replacing ‘git’ with ‘git-cached’ will get it working.