Project popularity indicators in the PyPI repository
Reinventing your own bicycles is an exciting and very important activity in the life of every developer. But sometimes you're lacking inspiration, or the clock is ticking, and you go in search of a ready-made project to solve your problem. For example, let's say you want to find a library that works like requests, but with request caching and TTL support. You just have to use the Python Package Index repository search engine, from which you download packages using the "pip install..." method by default, and searching for "requests cache" you immediately find several packages to choose from:
- django-in-request-cache
- django-request-cache
- requests-cache
- requests-cache-latest
- requests-etag-cache
- requests-filecache
Then you can choose the package that best suits your needs. Checking the info on the page, you see that, in addition to the description, the projects have reputation statistics.
Figure 1. An example of a popular package
A solid 1000 stars on a GitHub package is a strong argument for choosing it, since it seems to be popular among developers. And even if another package on the list seems to fit your needs slightly better, chances are that the package's reputation will play a major role in your final decision.
Project statistics on GitHub are one of the most commonly used ways to assess the quality of a project and the trust the community has in it. However, this only works for package developers who act in good faith. When investigating malicious packages, we often find that attackers sometimes "tweak" the statistics of their projects in PyPI, making them more attractive to potential victims. We decided to find out how serious this problem is.
The perspective of a package publisher
The administrators of the Python Package Index have tried to make it very simple to add a package to PyPI.org: all you have to do is create an account and then upload your package release via the twine console utility. For added convenience, there are detailed instructions explaining all the possible pitfalls in the process: Python Packaging User Guide: Packaging Python Projects.
When following the tutorial on creating a package on PyPI.org, in the Configuring metadata section you need to create a pyproject.toml with the following contents:
[project]
name = "example_package_YOUR_USERNAME_HERE"
version = "0.0.1"
authors = [
{ name="Example Author", email="author@example.com" },
]
description = "A small example package"
readme = "README.md"
requires-python = ">=3.7"
classifiers = [
"Programming Language :: Python :: 3",
"License :: OSI Approved :: MIT License",
"Operating System :: OS Independent",
]
[project.urls]
"Homepage" = "https://github.com/pypa/sampleproject"
"Bug Tracker" = "https://github.com/pypa/sampleproject/issues"
In this section, you're asked to change only the name of the project, leaving the rest of the information the same.
After completing the guide, you might end up with a suspicious situation on your page: your newly minted project already has some nice looking statistics.
Figure 2. Hello, Habr!
At this stage, you realize that something really strange is going on: there's no way of validating that this "sampleproject" project really belongs to you, but the statistics came from it. This leads to the kind of malpractice that we see with malicious packages.
Figure 3. An example of a malicious package
Figure 4. The package installs the miner
Statistics
Our cybersecurity threat research team began monitoring PyPI.org a year ago, in March 2022. We collected information not only on all the existing projects on PyPI, but also on those that were removed during the year. In this analysis, we will try not to resort to auditing the code itself, but refer only to the metainformation available on the project page: the malicious contents can differ depending on the attacker's goals (we wrote about these goals in the previous article), but we're interested in looking just at the data, independent of the project's logic.
Figure 5. GitHub repository rankings according to use as a homepage in PyPI projects
As you can see in the diagram above, the most popular repository is sampleproject. This is not surprising: it's owned by the Python Packaging Authority and is used in the packaging instructions mentioned above.
Among the projects that link to sampleproject, one in five has "test", "hello", or "example" in their name. This suggests that they were probably posted by users learning how to publish packages. You can also find packages which were named in a manner similar to cybersquatting (the registration of a domain using a name similar to that of an existing company or trademark unrelated to the registrant): gptcloud, scancer, onelibrary, onetrees, scfl. Besides that, there are also clear cases of typosquatting (using a package name similar to an original, in the hope of "trojaning" a developer who makes a typo): pandass, google.com, source_s3. There are also attractive offers: cash_app_money_generator_2022_updated_free_cash_app_hack_no_verification_1_6yd3wc, google_play_gift_card_redeem_code_generator. In the latter case, it's worth noting that such phishing accounts are easy to identify not only by the number of separators, but also by the reuse of metainformation fields (for example, the same contact email address or the same homepage is seen everywhere):
Figure 6. The incredible creativity of package creator
As we observed over the course of the year, a tenth of the packages created with the sampleproject homepage were removed. Among them there were both legitimate packages and malicious ones. Examples of malicious packages include tiktok8 (a trojan that downloads a payload), gardenscapes_hack_coins_free_working_2022 (an adware that opens a website via selenium), and kfaction (a Discord account stealer).
Let's also highlight the GitHub repositories that have a large number of remote PyPI packages:
- gmyrianthous/example-publish-pypi (45 out of 110 packages—41%). Another guide for creating packages.
- psf/requests (74 out of 80 packages—93%). The official repository of the requests project. There's an interesting detail here: almost all the removed packages are malicious and have names that imitate the name of the one you're looking for: requst, requests, reeuests, request, etc. This is an example of the previously mentioned typosquatting technique.
- kotko/bravado-decorators (40 out of 41 packages—98%). This situation is also remarkable: this is part of the activity of the security researcher kotko. In some of the packages, system information is collected along with an external IP address and then sent to the server. For the user, this is already undesirable activity.
How PT PyAnalysis detects the reuse of metainformation
Figure 7. Statistics on the connections between PyPI projects and GitHub repositories
Detecting cases of StarJacking is a task that can be completed step by step. First, from just under 450,000 packages, we select those which have no specified homepage: this eliminates a third of the packages. The homepage of another 4% of the total packages displays a 404 error, so technically StarJacking doesn't work: no stars are imported, and practically no falsification takes place (more information on that in the "Other interesting observations" section).
It gets more difficult from here. The project points to a working GitHub link. In order to confirm the connection, it must be two-way. A GitHub repository can be considered to have "recognised" a PyPI package if one of these two situations occurs:
- In the installation files of the Python project in the repository (setup.py, setup.cfg, pyproject.toml or Pipfile) there is an explicit indication of the package name—PyPI.
- The repository's documentation files state that the project can be installed using the "pip install project_name_to_pypi" command.
We cannot rely on the presence of identical developer nicknames (it's difficult to directly confirm that the PyPI and GitHub accounts belong to the same person; more in the "Other interesting observations" section) or project names, since an attacker can copy them relatively easily.
It turns out that if a PyPI package links to a repository that is not a Python project (for example, if the package is an API client to a service in another programming language), then it must be checked manually. According to our statistics, the proportion of such packages is 1%. All we can do is warn users about the possible dangers, since most of the packages are legitimate. Meanwhile, there's nothing to stop attackers from linking to them in the guise of, say, an SDK. That's why we are closely monitoring such packages.
Figure 8. Even if this developer is in fact the author of the "broken" package, the repository has no way of knowing. An attacker can create both packages with little difficulty
Some of the packages, which link to "not their own" repositories, can be cleared of suspicion with the help of a transitive connection of packages through the very same developer:
Figure 9. Tracking the connections between packages through a single developer
Here "dream-sdk" is validated through the confirmed "dream-core" package, connected to it by the same developer.
When we understood that the presence of the same author in both packages means that the suspected package can be considered valid, we managed to clear almost 20,000 packages of suspicion (5% of the total).
Which packages constitute the 4% of confirmed packages using StarJacking
We have identified the reasons why a package should be categorized as untrustworthy. Fortunately, most of them fell into this category as a result of the developer's error, rather than out of malicious intent.
A small proportion of packages using StarJacking (5.9%, 1153 cases) consist of copying metainformation from popular Python packages. Among the "victims of popularity" are pyscaffold (with clearly suspicious imitators: ceritfi, pycparsre, reqjests), poetry (juwpyter), and certifi (bettercolor, virtualenvy, reqyests). But in most cases, the packages look like forks or simply capability tests.
The remaining StarJacking cases are also not very sunny: they are either the results of the sampleproject tutorial we've looked at (which, believe it or not, is not included in the list of popular packages downloaded from PyPI.org), or automatically generated dummy packages which don't display any suspicious activity. Such results of automation include, for example, packages from the user alexjxd:
Figure 10. A marvelous display of productivity :)
Other interesting observations
The abnormal number of "broken" pages on GitHub.
We noticed that among the many PyPI packages with non-existent GitHub repositories, a significant number belong to a single account:
Figure 11. The share of the most productive developers with "broken" names
The developer wizardforcel is very productive: since 2020 he has been involved in 13,000 projects, often creating dozens of packages per hour. The vast majority have meaningless names, but some could be deceptive: re_for_beginners (hello to the author), kubernetes_aws_shouce.
Such behavior cannot be ignored. We notified the Python Package Index administrators about this user and they removed him and all his achievements :)
But we didn't have to wait long for a comeback! A successor to wizardforcel quickly appeared: apachecn. He has recreated 9 wizardforcel projects and is not showing signs of spam activity. We believe that this developer was running an experiment for his own purposes, but the removal of his epic achievements led him to put a stop to his activities, leaving only the necessary packages.
Figure 12. Our feelings while investigating wizardforcel / apachecn
Email addresses for comparing developers from different services prove nothing
When registering for GitHub and PyPI, email verification is required. However, in the project card, the email is taken not from the developer's profile, but rather from the metainformation specified by the user. This is understandable: if the development is being carried out on behalf of a company or the product has its own dedicated email address, then it makes sense to specify it here. The email is not displayed in the user profile, either. Thus, we cannot trust the email specified in the PyPI project, as it can be copied by an attacker.
If you had the email addresses of all existing projects, as well as those that were removed during the year, you could half-jokingly try to find a correlation between an email address and the chance of its removal.
Conclusion
Open source software is an exciting area when it comes to studying attackers' behavior.
In this article, we've been discussing the StarJacking problem on PyPI.org — manipulating imported project reputation statistics to create a more attractive image and deceive users.
Copying an impressive number of stars to packages with incomprehensible names doesn't make much sense, since users are unlikely to install packages such as "gggggghghghghghghfyrtfyuhgjuh" or "h8shdf89d" (real packages, StarJacking on requests), even if they have more than forty thousand stars. But by combining this technique with typosquatting or coming up with more convincing names, it's possible to create more appealing results: we periodically hear about packages appearing with names like "requeste", "fast_http", "websocket_cli". But the "rickquests" package, which fell victim to the fight against StarJacking, will remain forever in our hearts:
Figure 13. Don't judge a book by its cover ;)
Based on the results of our study, we've added a warning to the PT PyAnalysis service about this technique of developers taking stars from a repository that doesn't actually belong to them. This should spur our users to be careful and question the trustworthiness of any projects used. The manipulation of project statistics will most likely attract increased attention to the project, even if done unintentionally; the developer's inattention to the project's metainformation could simply indicate a low coding culture.
In our next article, we will discuss obfuscation in open-source Python projects: how to analyze it and how popular it is among attackers.
Author: Stanislav Rakovsky, Senior Threat Analysis Specialist at Positive Technologies