Home

Awesome

Andersen's pointer analysis

Andersen's algorithm is a famous pointer-analysis algorithm proposed in L.O. Andersen's 1994 Ph.D. thesis. The core idea of his algorithm is to translate the input program with statements of the form "p = q" to constraints of the form "q's points-to set is a subset of p's points-to set", hence it is sometimes also referred to as "inclusion-based" algorithm. The analysis is flow-insensitive and context-insensitive, meaning that it just completely ignores the control-flows in the input program and considers that all statements can be executed in arbitrary order.

This project is an implementation of Andersen's analysis in LLVM. The entire algorithm is broken down into three phases:

In phase 1, we treats structs in LLVM-IR field-insensitively. This will yield worse result, but the analysis efficiency and correctness can be more easily guaranteed. We plan to move to a field-sensitive implementation in the future, but for now we want to do the quick dirty things first. Dynamic memory allocations are modelled by their allocation site.

In phase 2, two constraint optimization techniques called HVN and HU are used. The basic idea is to search for pointers that have equivalent points-to set and merge together their representations. Details can be found in Ben Hardekopf's SAS'07 paper.

In phase 3, two constraint solving techniques called HCD and LCD are used. The basic idea is to search for strongly-connected-components in the constraint graph on-the-fly. Details can be found in Ben Hardekopf's PLDI'07 paper ("The Ant and the Grasshopper").

Publications

Program Analysis and Specialization for the C Programming Language. Ph.D. thesis

Exploiting Pointer and Location Equivalence to Optimize Pointer Analysis. International Static Analysis Symposium (SAS), 2007

The Ant and the Grasshopper: Fast and Accurate Pointer Analysis for Millions of Lines of Code. ACM Conference on Programming Language Design and Implementation (PLDI), 2007

Building the project

To build Andersen's analysis, you need to have a C++ compiler with C++14 support installed (e.g. g++ 4.9 or later, clang++ 3.4 or later) as well as cmake 2.8.8 or later. It should compile without trouble on most recent Linux or MacXOS machines.

  1. Making sure that LLVM 6.0 is installed somewhere on your system and CMake is able to find it by setting LLVM_DIR variable if necessary (see here for more details). You might want to look at the commit history for how to cope with the API changes if you are interested in targeting older versions of LLVM (the project was initially written for LLVM 3.9 and patches were made whenever LLVM makes a new release).

  2. Checkout this project

  3. Build this project

cd <directory-you-want-to-build-this-project>
cmake <project-source-code-dir> -DCMAKE_BUILD_TYPE=<specify build type (Debug or Release)> -DBUILD_TESTS=<specify whether you want to build test files (ON or OFF)>
make

Note that in the configuration step you might want to consider setting the build mode (Release/Debug with or without Asserts) to match the build mode of your LLVM library.

Using Andersen's analysis

The analysis is implemented as an LLVM pass. By default it does not dump anything into the console, hence the only way you can extract information from it is to write another pass that take the AndersenAA pass as a prerequisite and make alias queries using AndersenAA's public interfaces. AndersenAA conforms to the standard LLVM AliasAnalysis pass, so it shouldn't be too difficult if you know how to use other build-in alias analysis in LLVM (like basicaa).

If you want points-to information rather than alias information, things become trickier. The Andersen pass does have all the points-to information available: check out Andersen::getPointsToSet(). Note that memory objects, in our case, are represented by their corresponding allocation site.

Limitations

Related projects

Check out my tpa repo for a full-blown flow-sensitive, context-sensitive, field-sensitive pointer analysis for LLVM. It hasn't been updated for a while and is woefully undocumented, but I hope it could give up some ideas of how to write such a thing.

If you only want a more precise alias analysis for your compiler pipeline, I'd recommend using the new CFLAliasAnalysis I contributed to in GSoC 2016. The Andersen variant is highly experimental, but it is in-tree and is much less of a hassle to use.