By clicking “Accept”, you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. View our Privacy Policy for more information.

Dependency Resolution in Python: Beware The Phantom Dependency

Phantom dependencies are dependencies used by your code that are not declared in the manifest. If you miss them, they can sneak reachable risks into your application, lead to false positives, or inaccurate SBOMs. All very spooky. This article breaks down how phantom dependencies happen, and how to catch them.

Open Report

View Report

Written by

Anand Sawant

Published on

September 28, 2023

Topics

SCA

Open Report

View Report

Package managers are tools that simplify the process of installing and managing packages that your application depends on. All mainstream programming languages have their own ecosystem of packages and package managers to go with them. Java has maven, C# has NuGet, Javascript has NPM and Python has pip.

Using a package manager to install dependencies involves defining a manifest file that declares all the dependencies that the application requires. For Python projects, this can be done with setup.py, pyproject.toml or requirements.txt files. For example, here is an excerpt from the package installs for the Open AI baselines project:

Here we see seven dependencies explicitly declared, with version restrictions applying only to the `gym` package. For the other packages, the package manager will resolve the most appropriate version given python version, operating system version, operating system architecture and compatibility with other packages.

Obstacles to create an accurate dependency tree

While package managers make installing packages easy they do not make package management easy. By package management we mean: (1) keeping package versions up to date, (2) removing packages that are no longer used, (3) identifying issues in package versions currently in use, and flagging this to the developer.

As a result of this, there is a lot of onus on developers/engineers/DevOps to ensure that manifest files are regularly updated. However, this can often go wrong in one of the following ways:

Adding a dependency to the virtual environment but not updating the manifest file‍

Python makes it easy to install a dependency into the virtual environment of a project or directly into global scope by simply invoking the `pip install <package_version>` command, without having to update any manifest files. In addition, dependencies might be provided by the host operating system and those automatically become part of Python’s library search path. Unlike Java, where maven/gradle are used to build/run the application, so they need all used dependencies to compile the code and append them to the runtime classpath. This means that if a package has been added to the virtual environment and used without updating the requirements.txt, the Python runtime will not complain and just continue running the application. What makes this situation worse is that instead of updating the manifest file, engineers update the platform with the new packages directly and the only documentation of this might exist in a README file or buried deep in a slack conversation. This leads to the build not being easily reproducible in new environments where all packages are not pre-installed.‍

Allowing the platform to provide packages and their versions‍

With Python being the dominant platform for all forms of AI development, there are packages such as Tensorflow and Torch that are often required by an application. However, choosing the specific version of tensorflow or torch is often left up to the platform where the application is running. This is because the version and the setup of the package is highly dependent on the platform and (if using anaconda) the conda environment. If we look at the Open AI baselines projects’ manifest that we showed earlier we see no mention of tensorflow. However, the README file for the project indicates that anyone using the project must install tensorflow. In the case of provided dependencies such as this, it is by design that the package manifest and the actual dependency set are out of sync.

Removing packages that are no longer used‍

Unlike golang where the package manager and the compiler are tightly intertwined and when a package is no longer being used in the code it also has to be removed from the manifest file, Python has no such requirement. This can lead to manifest file bloat and the declaration of dependencies on packages that are no longer used.‍

Direct usage of transitive dependencies is allowed‍

New age package managers such as Bazel, which are not tied to one language or ecosystem, have a restriction that only dependencies that have been directly declared can be used by the application, thereby disallowing direct usage of transitive dependencies. Python does not impose such restrictions. Applications have direct access to any package that has been installed in the virtual environment or in global scope irrespective of whether it was declared as a direct dependency.

So what do the intricacies of Python’s dependency management system mean for the SCA tools world?

Incorrect SBOMs‍

Traditional SCA tools rely on the Python manifest file to resolve dependencies. However, as highlighted above, such an approach can be problematic. This can lead to the SBOM being incorrect and, consequently, to compliance issues as well as implications for trust of downstream consumers.

False negatives‍

Not having an accurate picture of what is being used is a major obstacle to flagging what vulnerabilities affect an application. This limits the utility of the SCA tool being used while simultaneously providing the engineering/AppSec team with a false sense of security. While false positives may create noise, false negatives are more nefarious, allowing for potential incidents to occur.

False positives‍

There is going to be unnecessary noise generated for dependencies that are not used but still in the manifest. This leads to increased overhead for the engineering team that has to deal with the false positives and track down that indeed a package is not being used and then justify the findings to security peers who are often relying on the reporting from their security tooling, such as SCA.

Consequences of poor dependency management

So, how do we fix this? The traditional answer that most SCA vendors would provide to an organization is to ask them to fix their dependency management so that a clean scan can be performed. However, fixing is often absurdly complex as the majority of the institutional knowledge about which package is used, why it is used and where it is used is often contained in slack channels or other internal communication tools and not explicitly documented. Having a DevOps team track all this down while not knowing the intricacies of the application is a recipe for disaster. We here at Endor Labs believe that there is a better way to do dependency resolution.

If we revisit the Open AI baselines project and do what every other SCA does, we would see the following dependency tree:

‍

However, we know that this project requires tensorflow as a provided dependency, but this is not something that the SCA tool picks up. We use our static analysis framework to process the source code of the file to understand what packages are being imported and used in the application. For example, here is an excerpt of imports from one of the files in the baselines project:

‍

We see clearly that there is a usage of tensorflow and that this is not one of those cases where the project states that it needs a package but does not use it. By tracking the direct imports of the application and connecting them to the file that declares the class/method in the virtual environment or in global scope we are able to recover the first level/direct dependencies of the application. We then proceed to recursively traverse all the files of these dependencies to find their dependencies (or the applications transitive dependencies) until we have traversed the entire dependency tree. With our approach we see the following dependency tree:

‍

What we see here is that tensorflow is indeed a dependency and that the version being used MacOS specific and version 2.13.0. Furthermore, with this dependency tree we also see other packages that are directly depended on (despite them being brought in by tensorflow transitively) such as werkzeug, six and others. This shows that our approach is able to overcome the first, third and fourth challenges to Python dependency resolution that were outlined before. The second challenge is something we address with our reachability analysis.

Conclusion

Overall, what we have seen here is that dependency resolution for Python is non-trivial and it requires more than just a reliance on manifest files to do correctly. Compared to traditional SCA vendors our approach that relies on static analysis of the code results in a more complete resolved dependency tree. Software asset inventory has been a critical security control for decades, but without proper dependency resolution and visibility, organizations can’t protect what they don’t know is being used. If there is one main message that all readers should take away from this blog post it is: Declaration (or lack thereof) of dependencies != actual usage of dependencies.

The Challenge

The Solution

The Impact

Get new posts in your inbox.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Get new posts in your inbox.

Welcome to the resistance

Oops! Something went wrong while submitting the form.

Get new posts in your inbox.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Improve Kubernetes Security with Signed Artifacts and Admission Controllers

AppSec Goes to Devnexus: Lessons from a Thriving, Modern Java Community

XZ Backdoor: How to Prepare for the Next One

XZ is A Wake Up Call For Software Security: Here's Why

SSDF Compliance and Attestation

You Have a Shadow Pipeline Problem

Artifact Signing 101 - On-Demand Webinar

Prioritizing SCA Findings with Reachability Analysis - On-Demand Webinar

Signing Your Artifacts For Security, Quality, and Compliance

Remediating Vulnerabilities vs. Maintaining Current Dependencies

Detect Malicious Packages Among Your Open Source Dependencies

How to Ingest and Manage SBOMs - Tutorial

How to Improve SCA in GitHub Advanced Security - Tutorial

How to Generate SBOM and VEX - Tutorial

How to Use AI for Open Source Selection - Tutorial

How to Scan and Prioritize Valid Secrets - Tutorial

Tom Gleason Joins Endor Labs as VP of Customer Solutions

Introducing CI/CD Security with Endor Labs

Highlights from State of Dependency Management 2022 - Webinar

Reachability Analysis for Python, Go, C# - Webinar

How Security and Engineering Can Scale Open Source Security - Webinar

Introduction to Open Source Security - Webinar

Comparing SBOMs Generated at Different Lifecycle Stages - Webinar

Why We Need Static Analysis When Prioritizing Vulnerabilities - Webinar

State of Dependency Management 2022

OWASP Top 10 Risks for Open Source

How to Prioritize Reachable Open Source Software (OSS) Vulnerabilities - Tutorial

What You Need to Know About Apache Struts and CVE-2023-50164

You Found Vulnerabilities in Your Dependencies, Now What?

Why SCA Tools Can't Agree if Something is a CVE

Chris Hughes Joins Endor Labs as Chief Security Advisor

What’s in a Name? A Look at the Software Identification Ecosystem

Why Different SCA Tools Produce Different Results

Why Your SCA is Always Wrong

Whatfuscator, Malicious Open Source Packages, and Other Beasts

What Security Teams Need to Know about Software Development

What Breaking Changes Teach Us about Security

What is VEX and Why Should I Care?

What are Maven Dependency Scopes and Their Related Security Risks?

What is Reachability-Based Dependency Analysis?

VMware Achieves SBOM Compliance for Over 100 Services with Endor Labs

Understanding Python Manifest Files

CSRB Log4j Report - The Response is as Dangerous as the Vulnerability

Strengthening Security in .NET Development with packages.lock.json

Endor Labs Raises $70M in Series A Funding to Reform Application Security

The Government's Role in Maintaining Open Source Security

Static SCA vs. Dynamic SCA: Which is Better (and Why it’s Neither)

From Cloud Security to Code Security: Why We've Raised $25M to Take on OSS Dependency Sprawl

Visualizing the Impact of Call Graphs on Open Source Security

SBOM vs. SBOM: Comparing SBOMs from Different Tools and Lifecycle Stages

Endor Labs Launches with $25M Seed Financing to Tackle Massive Sprawl of Open Source Software (OSS)

Key Questions for Your SBOM Program

SBOMs are Just a Means to an End

Reviewing Malware with LLMs: OpenAI vs. Vertex AI

SBOM Requirements for Medical Devices

Polyrepo vs. Monorepo - How Does it Impact Dependency Management?

Open Source Security 101: How to Evaluate Your Open Source Security Posture

Announcing the Endor Labs Hyperdrive Program for Resellers and Solution Providers

The Open Source Security Index Top 5

MileIQ Securely Reimagines a Decade Old Product with Endor Labs

LLM-assisted Malware Review: AI and Humans Join Forces to Combat Malware

Open Source Licensing Simplified: A Comparative Overview of Popular Licenses

Make Developers' Lives Easier with Endor Labs & GitHub Advanced Security

More Than 30 Industry-Leading CISOs Personally Invest in Endor Labs

Introducing JavaScript Reachability and Phantom Dependency Detection

Introduction to Program Analysis

Introducing the OpenSSF Scorecard API

Introducing Reachability-Based SCA for Python, Go, and C#

How Zero Trust Principles Can Accelerate Enterprise Adoption of OSS

Introducing a Better Way to SCA for Monorepos and Bazel

How to Quickly Measure SBOM Accuracy for Maven Projects (for Free)

Why I Joined Endor Labs to Build our India Team

How To Evaluate Secret Detection Tools

How to Get the Most out of GitHub API Rate Limits

How CycloneDX VEX Makes Your SBOM Useful

Exploring Risk: Understanding Software Supply Chain Attacks

Faster SCA with Endor Labs and npm Workspaces

Combining EPSS and Reachability Analysis to Optimize Vulnerability Management

Endor Labs’ ‘State of Dependency Management 2023’ Report Offers Insight on Explosive Popularity of AI and LLMs—and How They Impact Application Security

Endor Labs Wins Intellyx Digital Innovation Award