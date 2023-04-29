



Code at Scale by Phil Norman

At Google, tens of thousands of software engineers contribute billions of rows to a single repository. Stored in a system called Piper, this repository contains basically anything code-related: source code for shared libraries, production services, experiment programs, diagnostic and debugging tools.

This open approach is very powerful. For example, if an engineer doesn’t know how to use the library, they can find examples just by searching. It also allows kind-hearted individuals to perform important updates across the repository, such as migrating to new APIs or following language developments such as Python 3 and Go generics.

However, the code is not provided for free. It’s expensive to create, but it also takes real engineering time to maintain. At least, you can’t easily skip such maintenance if you want to avoid big costs later.

But what if you have less code to maintain? Are all those lines of code really necessary?

Mass delete

Large projects accumulate dead code. There will always be modules that are no longer needed, or programs that were used in the early stages of development but haven’t been run in years. In fact, whole projects are created, work for a while, and then become useless. Sometimes it cleans up, but cleanup takes time and effort, and it’s not always easy to justify the investment.

However, this dead code has not been removed and still incurs a cost. The automated test system does not know that it should stop running dead tests. People doing big cleanups don’t realize that this code is never executed, so there’s no point in migrating it.

What if you could automatically clean up dead code? That’s exactly what people started thinking a few years ago at the Zrich Engineering Productivity team’s annual hackathon. The Sensenmann project, named after the German word for death personified, has been a huge success. He submits over 1000 removal changelists every week and has so far removed nearly 5% of all C++ on Google.

Its goal is simple (at least in principle): automatically identify dead code and submit a code review request (“changelist”) to remove it.

What do you want to remove?

Google’s build system, Blaze (an internal version of Bazel), helps determine this. You can create a dependency graph by representing dependencies between binary targets, libraries, tests, source files, etc. in a consistent and accessible way. This allows you to find libraries that are not linked with any binaries and suggest their removal.

However, this is only the beginning of the problem. What about all these binaries? All the one-shot data migration programs and obsolete system diagnostic tools? If they are not removed, all the libraries they depend on are also preserved.

The only real way to know if programs are useful is to see if they are running. So for internal binaries (programs that run in Google’s data centers or employee workstations) a log entry is written when the program runs, with the time and which specific binary it is. By aggregating this, we get all the binary activation signals used by Google. If the program has not been used for a long time, it will try to submit a deletion changelist.

What should not be deleted

Of course, there are exceptions. Some program code is provided merely as an example of how to use the API. Some programs only run where they can’t get a log signal. There are many other exceptions and removing the code would be harmful. For this reason, it’s important to have a blocklist system in place so that you can mark exceptions and avoid annoying people with bogus changelists.

Development is in the details

Consider a simple case. There are two binaries that each depend on their own library and two binaries that depend on a third shared library. Drawing this (ignoring source files and other dependencies), we find a structure that looks like this:

If you see that main1 is in active use and main2 was last used more than a year ago, propagate the liveness signal through the build tree, along with everything that main1 depends on You can mark liveness signals. Any leftovers can be removed. main2 depends on lib2, so the same change removes those two targets.

So far so good, but real production code has unit tests whose build targets depend on the libraries under test. This immediately makes graph traversal much more complicated.

The test infrastructure runs all tests including lib2_test even though lib2 is never “actually” run. This means that test execution cannot be used as a “liveness” signal. If used, lib2_test is considered live and lib2 is kept forever. We can only clean up untested code, which seriously hinders our efforts.

What we really want is for each test to share the fate of the library they are testing. To do this, make the library and its tests dependent on each other and create loops in the graph.

This makes each library and its tests a strongly connected component. You can use the same technique as before. Mark the “live” nodes, then look for a collection of “dead” nodes to remove, but this time use Tarjan’s strongly connected components algorithm to process the loop.

Simple, isn’t it? If the relationship between the test and the library under test can be easily identified, then yes. Unfortunately this is not always the case. In the example above, we have a simple naming convention for matching tests to libraries, but in general you can’t rely on that heuristic.

Consider the following two cases.

On the left is the implementation of the LZW compression algorithm, shown as separate compression and decompression libraries. The test actually tests both to make sure the data is not corrupted after zipping and unzipping. On the right is web_test which tests the web server library. We use the URL Encoder library for support, but haven’t actually tested the URL Encoder itself. On the left side I want to consider LZW tests and both LZW libraries as one connected component, while on the right side I want to exclude the URL encoder and consider web_test and web_lib as connected components.

Despite requiring different treatments, these two cases have the same structure. In practice, he can encourage the engineer to mark libraries like url_encoder_lib as “test-only” (i.e. used only to support unit testing). This helps in his web testing case. Otherwise, the current approach is to use the edit distance between the test name and the library name to select the library most likely to match a given test. Processing test coverage data is likely required to be able to identify cases using one test and two libraries, as in the LZW example, and has not yet been explored.

Focus on users…

The ultimate beneficiaries of dead code removal are the software engineers themselves, many of whom appreciate help in keeping their projects tidy, but who wants automated changelists that attempt to remove code they wrote? Not everyone will be happy to receive it. This is where the social engineering aspect of the project comes into play. This is as important as software engineering.

Automatic code removal is a concept that is unfamiliar to many engineers, and many resist it just as they did when unit testing was introduced 20 years ago. Changing people’s minds takes time, effort, and a lot of careful communication.

Sensenmann’s communication strategy has three main parts. The change description is the most important, as it is the first thing reviewers will see. They should be concise, but should provide sufficient background for all reviewers to make decisions. Achieving this balance is difficult. If it’s too short, many people won’t be able to find the information they need. Too long and you end up with a wall of text that no one wants to read. Well-labeled links to support documentation and FAQs are very helpful here.

The second part is the supporting documentation. Again, concise and clear language is important, as is an easy-to-navigate structure. Different people need different information. Some people just want the peace of mind that a source control system can rollback deletions. Some people need guidance on how best to deal with bad changes, for example fixing build system misuses. With careful consideration and repeated user feedback, the supporting documentation can be a useful resource.

The third part is handling user feedback. This can be the hardest part at times. Feedback is often more negative than positive, and can sometimes require calmness and a great deal of diplomacy. is the best way to avoid negative feedback in the future.

upwards and upwards

It may sound strange to automatically remove code. Code is expensive to write and is generally considered an asset. However, unused code takes time and effort to maintain and clean up. Once your code base reaches a certain size, it makes sense to invest engineering time in automating the cleanup process. At Google’s scale, it’s estimated that automatic code removal was dozens of times more profitable in terms of maintenance cost savings.

Implementation requires solutions to inherently technical and social problems. Much progress has been made in both of these areas, but they are not yet fully resolved. However, as improvements are made, the rate of acceptance of deletions will increase, making automatic deletions more and more impactful. This kind of investment doesn’t make sense everywhere, but if you have a huge single repository it probably makes sense for you too. At least at Google, being able to reduce his C++ maintenance burden by 5% is a huge benefit.

