Categories
Data Driven

Git mining

Software development leverages version control for multiple reasons. I assume that you are familiar with git blame used to trace a person who committed particular change, git checkout to several branches to find out which version does contain a bug and which doesn’t.

The history of changes is however useful not only for survival when something goes wrong – it is also an exceptional source of evolutionary data about the past and the current state of the project.

Here we’re going to dive deeper into this fascinating topic and I’ll show you a couple if simple commands and (more complex) shell scripts which you can use to explore git repositories.

The blog post is inspired by the book Software Design X-Rays which I admire for scientific approach to software engineering realm.

by Adam Tornhill

Then and now

We’re going to explore two areas. One is the past and the other one is current state. I emphasize this distinction because the allowed level-of-detail is significantly different – we won’t go further than a 2D excel chart, because anything more complex cannot be used for a business justification anyway.

  • Current state – exploring a situation in a specific point in time, you can dive deep and prepare various tables and charts, but it won’t be possible to discover any trends
  • The past – if you want to introduce time factor and draw conclusions from change over a period of time, you have to limit yourself to simple metrics, because we humans simply cannot imagine anything more complex than a 2D chart

Why would you do this?

If you have so far been going through your professional career without these tools, you may wonder, why you should care. Well…

  • If you jump into a long-living project and want to find out how it works, you have two ways: ask people or collect data. While talking to colleagues seems to be a convenient way, collected information will be biased. You’re going to do it anyway, but why not to begin with raw data that you can use to prepare specific questions?
  • If you’re responsible for a project, probably other people want you to report about its state from time to time. On-demand presentations can be a solution, but collecting data in a continuous manner is something everyone would appreciate.
  • Before calling an important architectural decision one should prepare a thorough analysis. The more accurate data you have, the more certain you can feel about the analysis and therefore the idea as well.
  • Collecting data in a continuous manner will let you make predictions. For instance, how do you know when you should plan to split your repository in two? Probably a simple extrapolation based on so-far growth would be reasonable, but you’re going to need data.

The basics

For all exercises, you’re going to need a git repository, please git clone yourself something you’re familiar with and let’s go!

Before using any publicly available project, it seems sound to check if it is popular and stable. First of all, let’s check how old is the last commit so that we can evaluate if the project is still under development or maintenance.

git log -1

Okay, this was simple. In open source projects, we rely on contributors, so let’s find out how many contributors took part in last 100 commits.

git log -100 --format='%ae'| sort | uniq | wc -l

A useful fact publicly visible for all github repositories is total commit count to-date.

git rev-list HEAD | wc -l

An interesting trait of a code repository is number of files of specific extension, especially the one which can contain code. To count it, sometimes you’re going to need to provide a well-defined filter in order to explore only specific directories.

git ls-files *.cs | wc -l

This information is valuable and can provide some useful insight for quality comparison with other repositories you know.

Most of information we discuss here is useful only in context. You’re going to benefit from comparing analyzed repository against other well-known to you so that you know what to expect before diving deeper.

Now let’s check how many actual lines of code there are.

#this will fail to read files with spaces in path
git ls-files *.cs | xargs cat | wc -l
#this will read all files, but runs much slower
git ls-files *.cs | xargs -I{} cat {} | wc -l

#update 3/3/2023: this one I learned later and combines all pros of the above ones
git ls-files -z *.cs | xargs -0 cat | wc -l

Combining these numbers can lead some rough estimation of quality. In my example there are:

  • 4119 files with code
  • 549096 lines of code

A division gives roughly 133 lines of code in single file. I believe we can agree that, in general, the lower this factor the better quality of code we expect to find.

Now you can explore git documentation yourself and play with these commands – I’m sure you can find other factors that are easy to extract, yet can be very valuable.

Advanced: current state

Most of software people are familiar with the problem of pyramid of doom and would confirm that level of indentation of particular code can be at least a significant hint to evaluate quality of code. The basic idea is simple: the less indentation the better.

The power of this concept is that it can be applied to virtually any programming language.

Therefore I have come up with a script which can quickly estimate “quality” of given bunch of files. It counts spaces of indentation in all encountered files.

#this will fail to read files with spaces in path
git ls-files *.cs | xargs grep -ho '^[[:blank:]]*' | tr -d "\n" | wc -c
#this will read all files, but runs much slower
git ls-files *.cs | xargs -I{} grep -ho '^[[:blank:]]*' {} | tr -d "\n" | wc -c

#update 3/3/2023: this one I learned later and combines all pros of the above ones
git ls-files -z *.cs | xargs -0 grep -ho '^[[:blank:]]*' | tr -d "\n" | wc -c

Okay, so for the code snippet below the script will count 148 spaces of indentation, whereas the total number of lines of code is 23 which gives a ratio of 6.43 space of indentation for a line.

using System;
using System.Collections.Generic;
using System.Collections.Immutable;
using System.Linq;
using Microsoft.CodeAnalysis;
using Microsoft.CodeAnalysis.Diagnostics;
namespace CodeFix
{
    [DiagnosticAnalyzer(LanguageNames.CSharp)]
    public class CodeFixAnalyzer : DiagnosticAnalyzer
    {
        private static DiagnosticDescriptor Rule
            = new DiagnosticDescriptor(
                DiagnosticId,
                Title,
                MessageFormat,
                Category,
                DiagnosticSeverity.Warning,
                true,
                Description);
    }
}

High quality open source repositories, like aspnetcore, exhibit this factor at about 10.

I went much further with it and created a complex bash script able to indirectly investigate quality of code. As said before, be aware that this information is valuable especially for comparison (tracking quality changes of your repository or comparing quality with other repository you already know).

You can call the script with the following command.

bash snapshot.sh -f "*.cs" -c | tee file_path

Your file_path should now have a format similar to what you can see below.

,,,,,,,,,,Lines count, Complexity^2, Age
eng,,,,,,,,,,1818,121,590
eng/helix,,,,,,,,,725,144,497
eng/helix/content,,,,,,,,725,144,497
eng/helix/content/RunTests,,,,,,,725,144,497
eng/tools,,,,,,,,,1093,121,652
eng/tools/BaselineGenerator,,,,,,,,399,169,462
eng/tools/RepoTasks,,,,,,,,694,81,761

If you import it to a spreadsheet, the table will look like this.

LinesComplexity2Age
eng1818121590
enghelix725144497
enghelixcontent725144479
engtools1093121652
engtoolsBaseline399169462
engtoolsRepoTasks69481761

This table is a partial view of directories with *.cs files in aspnetcore repository. The first three columns represent directory structure and the latter provide some information about these directories.

  • Lines is total lines of code files.
  • Complexity is calculated as count of spaces of indentation squared (to be easier to distinguish, because the values are similar).
  • Age is an average age of a file in relation to size (old “big” file is “older” than old “small” file).
What are we looking for?
  • Complex, young code is a perfect candidate for refactoring, since it is often changed and we assume that these changes are difficult.
  • Complex, old code can be considered stable and be extracted to separate libraries and reused in your project to avoid developers time reading it.
  • Young code in general should be investigated for the sake of encapsulation. If it changes too often, maybe it is because it is changed for many reasons.

After playing a little bit with a spreadsheet, you’re going to come up with a similar diagram. Apparently, the code residing in eng/tools/Baseline directory turns out to be relatively young and complex – a perfect candidate to look for refactoring opportunities. Whereas eng/tools/RepoTasks is quite stable, so it might be a good idea to consider extracting it from actively maintained repository to a stable library.

Advanced: the past

To be honest, people rarely care about current state. Questions you’re going to be asked are how much time do you need to do X? and when will we have Y?.

Hence, based on ideas from the previous section, now we’ll try to show some important factors in the context of time and hopefully call some predictions.

Let’s see how is our code base is growing. To do so, we’re going to check out every 100th commit and collect number of lines of code at every one of them.

git rev-list HEAD | sed -n '0~100p' | xargs -I{} sh -c "git checkout -f --quiet {} ; git log -1 --format='%cs' | tr '\n' '\t' ; git ls-files *.cs | xargs cat | wc -l"

With this information, you can easily create a linear chart which will let you make predictions and answer questions about expected future of your code base.

Now let’s talk about merges. In most big-enough projects we do not care about each and every commit, but focus on pull requests or merge requests, whatever name you use.

Although git does not know if a merge was reviewed or not, we can still track merges and with the assumption that your branch is allowed to change only through pull requests, we can proceed.

As an example, let’s check how many of PRs have carried any file with “Test” in its path. With this command, we can roughly estimate how often the tests change and therefore how reliable they are.

git log --merges --max-count=100 --name-only --first-parent --oneline --pretty=format:'' | sed -z 's/\n/\t/g;s/\t\t/\n/g' | grep -ci 'test'

Summary

These examples show how powerful git is in terms of data analysis. Such experiments can lead to quite interesting conclusions which ultimately will simplify calling important architectural decisions. Let me know in the comments if you use any similar queries to collect data from git history!

Leave a Reply

Your email address will not be published. Required fields are marked *