By clicking “Accept”, you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. View our Privacy Policy for more information.

What is Reachability-Based Dependency Analysis?

Dependencies bring in risk, both operational and security. These risks can be difficult to detect and triage because while developers release and keep track of packages, actual reuse happens at the code level. Reachability analysis tracks the exact portion of the code in a dependency that is being reused by a client program. This allows us to tell whether a security problem in the dependency code affects the client, whether a dependency update is safe to perform, whether the client is linking against incompatible licensed code and whether code removals will affect downstream clients.

Open Report

View Report

Written by

Georgios Gousios

George Apostolopoulos

Published on

December 21, 2022

Topics

SCA

Security

Open Report

View Report

Engineering challenges

The convenience of declarative specification of dependencies has led to the massive adoption of package managers and package repositories. While startup costs seem low for projects, dependency reuse does not come for free. The software engineering academic community has thoroughly documented several issues around dependency reuse. From the perspective of a library user, it is hard to keep track of dependency updates especially for transitive dependencies, and assess their impact on the client code base; the semantic versioning API provisions are rarely respected in practice by library maintainers; and entrusting valuable data on code that the package manager automatically downloads is often not a wise choice. From the perspective of the library maintainer, it is difficult to evolve APIs, for example by removing methods, without breaking clients, while incentives for professional maintenance of library code are lacking. In addition, as libraries may also depend on other libraries, the library maintainers face the same issues as library users face.

In addition to the development challenges, dependencies bring in risk, both operational and security. For a long time the software industry has used tools to manage some aspects of operational risk, like license exposure. But as organizations realize the extent of the reliance on third party software, the understanding of operational risk evolves. Today, organizations start worrying about the possibility of a particular piece of third party software being abandoned by its maintainers (often referred to as the “bus factor”), or be the target for political statements.

Security

And of course, there is the security risk in terms of vulnerabilities that may be lurking inside third party code. Again, the software industry has been using tools to track this risk for some time but in the recent years two watershed events added a whole new urgency to this task. First the solarwinds compromise made everyone aware of the concept of the “software supply chain” and how compromises in one of the early stages of the chain can have catastrophic and very hard to prevent failures in the end users that sit at the end of the supply chain and a devastating “multiplier” effect And then, the Log4J compromise created the nightmare scenario of a vulnerability that is easy to exploit remotely and is present in one of the most popular open source packages in the java ecosystem. This immediately made a massive amount of software and systems vulnerable and resulted in a very significant cost to remediate, along the way exposing the limitations of tools and organizational response playbooks (do I use Log4J? In which software? What is the impact?). All these even caused the government to both try and help companies address the danger but also nudge them to take action.

So, today the security landscape for software has changed. New types of attacks have been developed that target the package managers and the whole infrastructure of discovering and using open source software. Consumers of open source software start to worry about the security best practices of the open source software they use and the possibility that this software gets compromised. Efforts are starting to identify, audit and harden the “critical” open source packages [ref: google’s project]. All these are new use cases with little coverage from existing tools. At the same time the old vulnerability management tools (and the organizations that use them) are collapsing under the weight of a constantly accelerating stream of new vulnerabilities being discovered. Simply looking for CVEs published against open source software is gradually becoming useless since any medium sized project with multiple open source dependencies is almost guaranteed to have some vulnerabilities in their dependency graph. Even though these vulnerabilities may not be triggered at all, either because there is no path in the code that leads to them or the particular configuration of the vulnerable software makes it impossible to reach the vulnerability, security teams are forced to investigate and act on 100s of potential vulnerabilities because they lack the tools to get the proper context.

Changing the Abstraction

At Endor Labs, we believe that current tools used to keep track of dependencies (SBOM generators, update notifiers, ...) work at the wrong level of abstraction: while developers release and keep track of packages, actual reuse happens at the code level. This disconnect sits at the root of most problems described above. If we could somehow track the exact portion of the code in a dependency that is being reused by a client program, we would be able to tell whether a security problem in the dependency code affects the client, whether a dependency update is safe to perform, whether the client is linking against incompatible licensed code and whether code removals will affect downstream clients.

We therefore use (static) call graphs as a more precise instrument for the analysis of dependency management.

What are Call Graphs?

A call graph is the result of applying a program analysis to determine whether a function f calls another function y. Call graphs can be both static and dynamic.

Static Call Graphs

Static call graphs are constructed as follows: we instantiate the analysis by populating a graph with V unique nodes that correspond to functions in the program and its dependencies. For each function, we go over the function body (f) and identify all locations (callsites) where a call to an external function happens. We then need to identify the target of the call (y) within the set V. Depending on the language used and its calling semantics we may need to create one or more edges (E) to the identified targets. The process continues until we have covered all call sites. The result is a graph G=<V,E> where an edge in E identifies statically a potential function call.

Why potential and not exact? Because of calling semantics. In languages with static dispatch and no function pointers, we can indeed create complete callgraphs. But no such useful language exists in practice, because abstraction mechanisms require dynamism. Virtual dispatch, i.e. the mechanism object oriented languages use to determine the target of an interface call at runtime, requires the presence of a dynamically updatable table of potential call targets (known as vtable). For static analysis to work, we need to imitate the vtable behavior, therefore we need to link all potential calls.

Dynamic Call Graphs

Dynamic call graphs are more straightforward to construct, provided that the (language or operating system) runtime environment of a program can be instrumented. If this is the case, we can inject probes (triggers) that fire i) every time a function call is made, and ii) every time we return from the function call. We then need to maintain a stack per running thread: once a function call is done, we push the name of the called function to the stack; when the function returns, we pop the stack and write to the output a pair caller -> callee. At the end of the execution, the written pairs naturally form a graph, which is a call graph for the particular execution. Tools like DTrace, callgrind and various profilers enable developers to create dynamic call graphs in a straightforward manner.

Tradeoffs

Various trade-offs exist between static and dynamic call graphs. For the reasons outlined above, static call graphs are over-approximations of a program's behavior in practice, and therefore imprecise. This means that if they are constructed naively, they are bound to flood developers with false positive findings (for example, calls to dependency code with vulnerabilities which may never happen in practice). Various techniques exist to make static call graphs more precise. On the other hand, dynamic call graphs are de facto precise, as they represent real program executions; however, they are not complete, as they are the result of one (and not all) possible program execution.

At Endor Labs, we rely on static call graphs to perform dependency analysis at a fine-grained level. We chose static analysis as our weapon, as it is minimally intrusive to the developer workflow, can provide results to developers while the program is being developed, does not depend on good test coverage to provide results and, most importantly, because completeness is an absolute must for any downstream security analysis tasks. Our world class team of static analysis experts applies the most advanced static analysis per programming language to minimize false positives to the degree enabled by the current state of the art. We also use dynamic call graphs internally to form benchmarks against which our call graphs are being measured and optimized.

To deal with ecosystem scale, we create a call graph per package version, store those call graphs and then, given a client dependency set, we stitch them on the fly. In an upcoming post, we will elaborate on how we are able to scale our static analysis to ecosystems, by presenting our Java analysis stack.

Reachability and Impact Analysis

Once we have a call graph, a whole new level of clarity is available for dependency analysis.

The availability of the call graph contextual information is extremely valuable in addressing multiple use cases in both security and operational risk. Assuming we have built the dependency graph of our software, knowing the call graph that underlies it allows us to determine two key pieces of information:

Reachability: which answers the question “is my code calling a vulnerable function?”
Impact: which answers the question “if I change this function, which parts of the codebase will be impacted”?

They may both sound similar but the way they are computed is different:

Reachability involves a “forward” pass in the call graph starting from the client and checking to see which packages in the dependency set can be reached following the call graph. If there is no path leading to a package, that package is unreachable.
Impact involves a “backward” pass in the call graph, starting from the target function and going back the call graph to determine possible places where this function is called. If none of these paths leads to the client code, the impact set of the target function is empty.

It is not hard to see how useful this context can be. At the dependency level, if we know that a dependency is not reachable from a client application we can try to remove it, completely removing any operational and security risk coming from thats dependency. If we are investigating vulnerabilities, and we know which vulnerable functions are involved, we can determine which of these vulnerabilities are not reachable and can prioritize dependency updates accordingly. Other more subtle use cases also become possible. The call graph information will provide us information about how “much” of a dependency is used: are we just using one API out of the multiple APIs provided by this dependency? Could we possibly replace or remove a dependency that is not that tightly coupled with our code, especially if it brings a lot of potential risk? As an anecdotal example, in our own codebase there was a dependency that was bringing in a very large amount of other dependencies, only because we used a very simple API from it. Without the tools it is not even possible to figure out that this is happening, let alone think how to fix it.

Combining the above with the visibility that comes from the combined dependency and call graphs, we are building a very powerful tool that allows developers or security engineers to show how dependencies are used in their application, where dependencies are being imported from, determine the operational and security risk that comes with each dependency and asses what is the best strategy for remediating this risk.

There are caveats: computing the call graphs is complex and inherently imprecise so there is always a risk of false positives (where we believe that something is reachable while it is not) and false negatives (where something that is reachable is deemed unreachable). Different use cases have different sensitivity to false positives and false negatives; e.g. for security risk, false negatives can be problematic. But eventually, the right workflows will make using these tools most effective: the reachability information should be consumed for prioritizing attention and resources. Security teams already have limited budgets and can not deal with every single vulnerability or review every new dependency. By providing the right context we can help teams prioritize and use limited resources most effectively.

Acknowledgements

Much of the initial work described here was done at the first author's Software Analytics lab at the TU Delft and then, at a larger scale, by the FASTEN project. Portions of this post are based on: P. Boldi and G. Gousios, “Fine-Grained Network Analysis for Modern Software Ecosystems,” ACM Trans. Internet Technol., vol. 21, no. 1, Dec. 2020.

The Challenge

The Solution

The Impact

Get new posts in your inbox.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Get new posts in your inbox.

Welcome to the resistance

Oops! Something went wrong while submitting the form.

Get new posts in your inbox.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Improve Kubernetes Security with Signed Artifacts and Admission Controllers

AppSec Goes to Devnexus: Lessons from a Thriving, Modern Java Community

XZ Backdoor: How to Prepare for the Next One

XZ is A Wake Up Call For Software Security: Here's Why

SSDF Compliance and Attestation

You Have a Shadow Pipeline Problem

Artifact Signing 101 - On-Demand Webinar

Prioritizing SCA Findings with Reachability Analysis - On-Demand Webinar

Signing Your Artifacts For Security, Quality, and Compliance

Remediating Vulnerabilities vs. Maintaining Current Dependencies

Detect Malicious Packages Among Your Open Source Dependencies

How to Ingest and Manage SBOMs - Tutorial

How to Improve SCA in GitHub Advanced Security - Tutorial

How to Generate SBOM and VEX - Tutorial

How to Use AI for Open Source Selection - Tutorial

How to Scan and Prioritize Valid Secrets - Tutorial

Tom Gleason Joins Endor Labs as VP of Customer Solutions

Introducing CI/CD Security with Endor Labs

Highlights from State of Dependency Management 2022 - Webinar

Reachability Analysis for Python, Go, C# - Webinar

How Security and Engineering Can Scale Open Source Security - Webinar

Introduction to Open Source Security - Webinar

Comparing SBOMs Generated at Different Lifecycle Stages - Webinar

Why We Need Static Analysis When Prioritizing Vulnerabilities - Webinar

State of Dependency Management 2022

OWASP Top 10 Risks for Open Source

How to Prioritize Reachable Open Source Software (OSS) Vulnerabilities - Tutorial

What You Need to Know About Apache Struts and CVE-2023-50164

You Found Vulnerabilities in Your Dependencies, Now What?

Why SCA Tools Can't Agree if Something is a CVE

Chris Hughes Joins Endor Labs as Chief Security Advisor

What’s in a Name? A Look at the Software Identification Ecosystem

Why Different SCA Tools Produce Different Results

Why Your SCA is Always Wrong

Whatfuscator, Malicious Open Source Packages, and Other Beasts

What Security Teams Need to Know about Software Development

What Breaking Changes Teach Us about Security

What is VEX and Why Should I Care?

What are Maven Dependency Scopes and Their Related Security Risks?

What is Reachability-Based Dependency Analysis?

VMware Achieves SBOM Compliance for Over 100 Services with Endor Labs

Understanding Python Manifest Files

CSRB Log4j Report - The Response is as Dangerous as the Vulnerability

Strengthening Security in .NET Development with packages.lock.json

Endor Labs Raises $70M in Series A Funding to Reform Application Security

The Government's Role in Maintaining Open Source Security

Static SCA vs. Dynamic SCA: Which is Better (and Why it’s Neither)

From Cloud Security to Code Security: Why We've Raised $25M to Take on OSS Dependency Sprawl

Visualizing the Impact of Call Graphs on Open Source Security

SBOM vs. SBOM: Comparing SBOMs from Different Tools and Lifecycle Stages

Endor Labs Launches with $25M Seed Financing to Tackle Massive Sprawl of Open Source Software (OSS)

Key Questions for Your SBOM Program

SBOMs are Just a Means to an End

Reviewing Malware with LLMs: OpenAI vs. Vertex AI

SBOM Requirements for Medical Devices

Polyrepo vs. Monorepo - How Does it Impact Dependency Management?

Open Source Security 101: How to Evaluate Your Open Source Security Posture

Announcing the Endor Labs Hyperdrive Program for Resellers and Solution Providers

The Open Source Security Index Top 5

MileIQ Securely Reimagines a Decade Old Product with Endor Labs

LLM-assisted Malware Review: AI and Humans Join Forces to Combat Malware

Open Source Licensing Simplified: A Comparative Overview of Popular Licenses

Make Developers' Lives Easier with Endor Labs & GitHub Advanced Security

More Than 30 Industry-Leading CISOs Personally Invest in Endor Labs

Introducing JavaScript Reachability and Phantom Dependency Detection

Introduction to Program Analysis

Introducing the OpenSSF Scorecard API

Introducing Reachability-Based SCA for Python, Go, and C#

How Zero Trust Principles Can Accelerate Enterprise Adoption of OSS

Introducing a Better Way to SCA for Monorepos and Bazel

How to Quickly Measure SBOM Accuracy for Maven Projects (for Free)

Why I Joined Endor Labs to Build our India Team

How To Evaluate Secret Detection Tools

How to Get the Most out of GitHub API Rate Limits

How CycloneDX VEX Makes Your SBOM Useful

Exploring Risk: Understanding Software Supply Chain Attacks

Faster SCA with Endor Labs and npm Workspaces

Combining EPSS and Reachability Analysis to Optimize Vulnerability Management

Endor Labs’ ‘State of Dependency Management 2023’ Report Offers Insight on Explosive Popularity of AI and LLMs—and How They Impact Application Security

Endor Labs Wins Intellyx Digital Innovation Award