Git Accidents — Remove sensitive data from git history

Deepak
5 min readJun 7, 2023
Photo by Tim Mossholder on Unsplash

Have you ever faced a scenario where you uploaded a file in public/private repo on git and later on realized that you should not have added it? For e.g. file containing some private or confidential information like password/PII/private key or file that is too big and is killing the performance?

Recognizing mistake is first step. Now that you have realized your mistake, you must be wondering about how to remove the private information from repo. You can easily remove it from current commit but what about all the commits that you have already done in past?

Thankfully git provide couple of option to deal with this situation.

But before going into different options, I would like to highlight that once your data is in up there in repo, even for few seconds, consider it as compromised. Now depending upon criticality of data you should take appropriate steps. Passwords are most critical and it is always recommended to change it once you suspect that it has been leaked(which in this case consider it has been leaked by the time you realized). If it is IP of your honeypot server or other servers in DMZ zone than you might not be worried about it too much, but in some cases you still want to remove its references from public repo.

Let’s talk about options to remove private information/file from all commits till date in repo.

  1. git filter-repo
  2. BFG Repo-Cleaner

I found both the tools useful but in this post I am going to talk about BFG in particular since i found it easy and fast.

Lets start with installing BFG Repo-Cleaner. I am running my example on mac. For installation instructions on other system see this link

I used brew to install BFG Repo-Cleaner

brew install bfg

Once installed check if bfg command line is working — run command “bfg help”

bfg help
bfg 1.14.0
Usage: bfg [options] [<repo>]

-b, --strip-blobs-bigger-than <size>
strip blobs bigger than X (eg '128K', '1M', etc)
-B, --strip-biggest-blobs NUM
strip the top NUM biggest blobs
-bi, --strip-blobs-with-ids <blob-ids-file>
strip blobs with the specified Git object ids
-D, --delete-files <glob>
delete files with the specified names (eg '*.class', '*.{txt,log}' - matches on file name, not path within repo)
--delete-folders <glob> delete folders with the specified names (eg '.svn', '*-tmp' - matches on folder name, not path within repo)
--convert-to-git-lfs <value>
extract files with the specified names (eg '*.zip' or '*.mp4') into Git LFS
-rt, --replace-text <expressions-file>
filter content of files, replacing matched text. Match expressions should be listed in the file, one expression per line - by default, each expression is treated as a literal, but 'regex:' & 'glob:' prefixes are supported, with '==>' to specify a replacement string other than the default of '***REMOVED***'.
-fi, --filter-content-including <glob>
do file-content filtering on files that match the specified expression (eg '*.{txt,properties}')
-fe, --filter-content-excluding <glob>
don't do file-content filtering on files that match the specified expression (eg '*.{xml,pdf}')
-fs, --filter-content-size-threshold <size>
only do file-content filtering on files smaller than <size> (default is 1048576 bytes)
-p, --protect-blobs-from <refs>
protect blobs that appear in the most recent versions of the specified refs (default is 'HEAD')
--no-blob-protection allow the BFG to modify even your *latest* commit. Not recommended: you should have already ensured your latest commit is clean.
--private treat this repo-rewrite as removing private data (for example: omit old commit ids from commit messages)
--massive-non-file-objects-sized-up-to <size>
increase memory usage to handle over-size Commits, Tags, and Trees that are up to X in size (eg '10M')
<repo> file path for Git repository to clean

Clone a fresh copy of your repo with — — mirror flag

git clone --mirror git@github.com:deepakjd2004/xxxxxx.git

Cloning into bare repository 'xxxxxxx.git'...
remote: Enumerating objects: 77, done.
remote: Counting objects: 100% (77/77), done.
remote: Compressing objects: 100% (47/47), done.
remote: Total 77 (delta 15), reused 69 (delta 10), pack-reused 0
Receiving objects: 100% (77/77), 45.04 KiB | 217.00 KiB/s, done.
Resolving deltas: 100% (15/15), done.

I encourage you to peek inside the newly downloaded directory. You will see that it does not consist of normal files from your repo but it is full copy of git database for your repo, take backup of this full copy for safety.

In my case I am going to replace every occurrence of ssh_key with **Removed** and this needs to happen across all commits. For all the possible options with bfg tool please look here

In the current directory, use editor of your choice to create a file with any name and put text you want to replace with **Removed**. I am calling my file as search-replace.txt. For e.g.

cat search-replace.txt
Q9/KOf2ze1E06qUF2mLPwX5y6dTMCKGkkV+HDQlIp/F0ZTfWIQ9/KOf2ze1E06qUF2mLPwX5y6dTMCKGkkV+HDQlIp/F0ZTfWIQ9/KOf2ze1E06qUF2mLPwX5y6dTMCKGkkV+HDQlIp/F0ZTfWI

Run below command to update your commits and all branches and tags so they are clean, but it doesn’t physically delete the unwanted stuff.

bfg --replace-text search-replace.txt xxxxxxxx.git

Above command will generate a report in current folder. You can examine the report to make sure that it is going to make changes that you intend to do.

Next, go to cloned directory (that you did with — mirror flag)

cd xxxxx.git

Run below command to examine the repo and make sure your history has been updated, and then use the standard git gc command to strip out the unwanted dirty data, which Git will now recognize as surplus to requirements.

git reflog expire --expire=now --all && git gc --prune=now --aggressive

Enumerating objects: 78, done.
Counting objects: 100% (78/78), done.
Delta compression using up to 12 threads
Compressing objects: 100% (58/58), done.
Writing objects: 100% (78/78), done.
Reusing bitmaps: 8, done.
Building bitmaps: 100% (9/9), done.
Total 78 (delta 16), reused 53 (delta 0)
Computing commit graph generation numbers: 100% (9/9), done.

Once you are happy then do the git push

git push

Now go back into your repo and check if confidential data is removed from all commits. For future make sure to add any sensitive file in .gitignore file or manually scrub the sensitive information before uploading.

Also you should note that if the commit that introduced the sensitive data exists in any forks of your repository, it will continue to be accessible, unless the fork owner removes the sensitive data from their fork or deletes the fork entirely. You cannot remove sensitive data from other users’ clones of your repository, but you can permanently remove cached views and references to the sensitive data in pull requests on GitHub by contacting GitHub Support.

Thanks for reading the article.

--

--