All sufficiently big public package registries are a mess full of malware, name squatting, and drama:
- crates.io has a single user owning names like “any”, “bash”, and “class”.
- npmjs.com had a drama with left-pad when a single maintainer of a single one-liner package broke the internet.
- pypi.org appears in tech news monthly with another group of researchers discovering another malware campaign.
Today PyPI malware made news yet again, so I decided to take a look at the other side of PyPI: name squatting and some other interesting stats along the way.
Get the data
We could manually try random package names and check their owner but there is a better way. Seth Michael Larson, the Security Developer-in-Residence at the Python Software Foundation has a public repository pypi-data with a partial dump of the PyPI database.
- Download the latest dump. If you want to reproduce my results, pick the same as I’m going to use: 2023-10-31 (spooky! 🎃).
- Extract:
gunzip pypi.db.gz
. - Either open the dump in the sqlite CLI (
sqlite3 pypi.db
) or use the DB Browser for SQLite GUI which is very cool (but may crash if you’re not careful with queries you run).
Probing the data
The table packages
contains all packages with their name, the latest released version number, the last update date, and some other info. For example, let’s select stats for textdistance:
SELECT * FROM packages WHERE name = 'textdistance';
field | value |
---|---|
name | textdistance |
version | 4.6.0 |
requires_python | >=3.5 |
yanked | 0 |
has_binary_wheel | 0 |
has_vulnerabilities | 0 |
first_uploaded_at | 2023-09-28T08:30:50 |
last_uploaded_at | 2023-09-28T08:30:51 |
recorded_at | 2023-10-30 21:49:00 |
downloads | 308733 |
scorecard_overall | 4.8 |
in_google_assured_oss | 0 |
Unfortunately, we don’t have any information about past releases, like how many releases the package had, how many files, when the first one was uploaded, etc. Also, maintainers are in a separate table because a single package may have multiple maintainers and maintainers may have multiple packages (many-to-many):
SELECT * FROM maintainers WHERE package_name = 'textdistance';
field | value |
---|---|
name | orsinium |
package_name | textdistance |
Finding the most prolific users
Who published the most packages?
SELECT maintainers.name, COUNT(*) as cnt
FROM packages, maintainers
WHERE packages.name = maintainers.package_name
GROUP BY maintainers.name
ORDER BY cnt DESC
LIMIT 20;
name | cnt |
---|---|
OCA | 14928 |
alexjxd | 1577 |
wix-ci | 1539 |
yandex-bot | 1196 |
openstackci | 735 |
vemel | 734 |
microsoft | 671 |
davisagli | 520 |
hansalemao | 501 |
hannosch | 500 |
icemac | 449 |
google_opensource | 415 |
faassen | 401 |
agroszer | 361 |
dlech | 360 |
thejcannon | 360 |
adafruit-travis | 352 |
pycopy-lib | 347 |
azure-sdk | 343 |
aws-cdk | 337 |
You may recognize some names on the list.
- The apparent leader is OCA, also known as Odoo Community Association. Odoo is a popular open-source enterprise CRM with Python backend. Their PyPI account holds a bunch of Odoo plugins.
- Next goes alexjxd, also known as Alex Jiang. This is an Alibaba employee, and their account holds alibabacloud-python-sdk components. It is poorly documented but what I noticed is that all components have a date suffix, like
ddosbgp-20180201
. So, it’s some kind of additional versioning going on. - The third place goes to wix-ci, holding a bunch of plugins for wix.com.
- The yandex-bot, claimed to be owned by “Yandex Security Team”, owns 1200 names, including names like and, nu, aiostat, apilib, cpp_grader, tmp2, minify, and many other generic names. Each description says: “A package to prevent Dependency Confusion attacks against Yandex”. So, we see name squatting to prevent name squatting. “The best defense is a good offense”. Should this be allowed? And the whole situation suddenly takes a political turn when you consider that Yandex LLC is a Russian company.
You can check the rest of the list yourself if you’re curious. For now, let’s find something more interesting.
Finding the top name squatters
Name squatting is when someone registers a bunch of common names to sell them later. It is very common with DNS, social media, and package registries. This is why Steam is steampowered.com.
The best heuristic would be to find users with the most single-release packages, but we don’t have this information in the dataset. Instead, we can have a look at users with all packages having the same version number. The assumption is that when all names are registered using one tool or one placeholder project metadata, they all will have the same version.
SELECT
maintainers.name,
packages.name,
version,
COUNT(*) as cnt_prj,
COUNT(DISTINCT version) as cnt_ver
FROM packages, maintainers
WHERE packages.name = maintainers.package_name
GROUP BY maintainers.name
HAVING cnt_ver = 1
ORDER BY cnt_prj DESC
LIMIT 20;
maintainer | package | version | projects |
---|---|---|---|
wix-ci | artifactory-check | 0.0.1 | 1539 |
alexanderkjall | abyss-airflow-reprocessor | 0.0.1 | 243 |
doxops | data-dags | 0.0.1 | 53 |
akarmakar | nvidia-cudf-cu11 | 0.0.1.dev5 | 48 |
shadowwalker2718 | audiolm | 0.0.1.dev0 | 41 |
tanium-security | macmiller-common | 0.0.dev1 | 29 |
wxpay_sec_team | autogencase | 0.0.1 | 29 |
squadrone | algorand-wallet-client | 0.0.0 | 28 |
GHGSat | gfa-ghg-hres | 0.1.1 | 24 |
aws-solutions-konstruk-support | aws-solutions-konstruk-aws-apigateway-dynamodb | 0.8.1 | 24 |
coalgo | coalg | 0.0.0 | 24 |
elula-ai | elulalib | 0.0.0 | 23 |
felya152 | felya-1-1 | 0.1.0 | 23 |
girder-robot | girder | 3.1.24 | 22 |
deeznuts1337 | cloudsec | 0.0.0 | 19 |
mapsme | omim-airmaps | 10.3.0rc2 | 19 |
souljaboy | eai | 0.1 | 19 |
hashemshaiban | aladrisy | 0.0.1 | 18 |
stastnypremysl | pycom-artifactory-automation | 0.0.1 | 18 |
edtb | testwizard-android-set-top-box | 3.7.0 | 17 |
- The thing I haven’t noticed about wix-ci before is that all the packages are released in one go, between 2021-02-11 and 2021-02-14, and haven’t been touched since. When I check the content of the packages, they are all empty, without any code inside. Busted!
- alexanderkjall, also known as Alexander Kjäll, holds 244 packages with the description “PyPi package created by Schibsted’s Product & Application Security team”. Yet another example of “to prevent squatting, let’s squad first”. The names include schlearn (which sounds like sklearn), s3-helpers, christian, ip-library, datadog-linter, etc.
- doxops is yet another company squatting their private names.
- akarmakar squats package names for nvidia, like nvidia-raft-dask-cu116. If you try to install any of these, you’ll get an installation failure telling you to use NVIDIA Python Package Index. This is similar to other cases of “safety squatting” but at least this time it serves a purpose for public project users, not just employees of a single company.
- shadowwalker2718 is the first instance of name squatting on the list done not by a big company. All the names they hold are the names of the real ML projects that you find on GitHub but which don’t provide a PyPI distribution. They squatted chatdoctor for ChatDoctor, controlnet for ControlNet, autogpt for AutoGPT, etc. Most of the registered projects have the description copied from the real project and even some dependencies but no code inside.
I checked more users from the list. Lots and lots of squatters. Some are companies squatting their internal names, some are individuals holding nice names for sale.
Finding more squatters
We can tweak the query above to show us people with versions between 2 and 5. Some of the squatters might slightly change the version number or re-release a package with new fake content.
SELECT
maintainers.name,
packages.name,
version,
COUNT(*) as cnt_prj,
COUNT(DISTINCT version) as cnt_ver
FROM packages, maintainers
WHERE packages.name = maintainers.package_name
GROUP BY maintainers.name
HAVING cnt_ver BETWEEN 2 AND 5
ORDER BY cnt_prj DESC
LIMIT 20;
maintainer | package | version | cnt_prj | cnt_ver |
---|---|---|---|---|
thejcannon | botocore-a-la-carte | 1.31.73 | 360 | 3 |
stale.pettersen.schibsted | apikeycheck | 0.0.1 | 224 | 2 |
anon_ssregistrar | addr-match | 0.0.0 | 218 | 3 |
noteed | openerp-account | 7.0.406 | 206 | 2 |
pokoli | proteus | 7.0.0 | 193 | 5 |
DnA_DGAT_Chapter | abcdefg | 0.0.0 | 160 | 2 |
wangc | lab-b | 1.0 | 119 | 3 |
tcw | an | 0.0.4 | 114 | 5 |
sifer | aaaaa | 1.0.1 | 98 | 2 |
aws-solutions-constructs-team | aws-solutions-constructs-aws-alb-fargate | 2.45.0 | 86 | 3 |
takealot | ab-test-client | 0.0.1rc0 | 82 | 4 |
kafkaservices | audit-friday | 0.1 | 74 | 4 |
yinsuo.mys | haas-python-ads1xx5 | 0.0.8 | 74 | 3 |
Pinkyy | aisi-od-training | 0.0.1rc1 | 72 | 3 |
se2862890720 | ci-connector | 0.0.47 | 57 | 2 |
mdazam1942 | car-connector-framework | 4.0.1 | 56 | 5 |
rieder | amuse | 2023.10.0 | 50 | 5 |
cloudwright | cloudwright-airtable | 0.0.0.post1 | 49 | 2 |
doerlbh | aikido | 0.0.0 | 49 | 5 |
elad_pt | adios2 | 0.0.1 | 48 | 5 |
Another interesting query is to filter out maintainers having all packages with one of the predefined version numbers:
SELECT
maintainers.name,
packages.name,
version,
COUNT(*) as cnt
FROM packages, maintainers
WHERE packages.name = maintainers.package_name AND version IN ('0.0.0', '0.0.1', '0.1.0', '1.0.0')
GROUP BY maintainers.name
ORDER BY cnt DESC
LIMIT 20;
maintainer | package | version | cnt |
---|---|---|---|
wix-ci | artifactory-check | 0.0.1 | 1539 |
alexjxd | alibabacloud-acm20200206 | 1.0.0 | 415 |
platform-kiwi | alertlib | 0.0.0 | 269 |
alexanderkjall | abyss-airflow-reprocessor | 0.0.1 | 243 |
stale.pettersen.schibsted | apikeycheck | 0.0.1 | 223 |
anon_ssregistrar | addr-match | 0.0.0 | 218 |
airbyte-engineering | airbyte-source-gong | 0.1.0 | 186 |
DnA_DGAT_Chapter | abcdefg | 0.0.0 | 160 |
pycopy-lib | pycopy-aifc | 0.0.0 | 151 |
micropython-lib | micropython-aifc | 0.0.0 | 138 |
workiva | admin-frugal | 0.0.0 | 123 |
openstackci | act | 0.0.1 | 112 |
microsoft | archai | 1.0.0 | 107 |
sifer | bbbb | 1.0.0 | 96 |
datakund_test | allmovie-scraper | 1.0.0 | 91 |
azure-sdk | azure-agrifood-nspkg | 1.0.0 | 75 |
heyWFeng | decrypt4pdf | 0.0.1 | 63 |
mvinyard2 | adata-query | 0.0.1 | 60 |
doxops | data-dags | 0.0.1 | 53 |
abhishek4273 | monk-colab | 0.0.1 | 50 |
This method gives quite a few false positives (legit people who release lots of one-off packages) but still, finds some interesting cases.
Putting it all together
So, how many squatters we’ve found? Combining all the methods above and manually removing false positives:
Companies:
- airbyte-engineering (Airbyte)
- akarmakar (Nvidia)
- alexanderkjall (Schibsted)
- alexjxd (Alibaba)
- doxops (Dox)
- elad_pt (Cycode)
- Pinkyy (SBB)
- platform-kiwi (Kiwi)
- wix-ci (Wix)
- workiva (Workiva)
- yandex-bot (Yandex)
Individual squatters:
- anon_ssregistrar
- datakund_test
- DnA_DGAT_Chapter
- doerlbh
- eywalker
- kafkaservices
- kislyuk
- mvinyard2
- rebelliondefense
- se2862890720
- shadowwalker2718
- sifer
- takealot
- tcw
- wangc
With a better dataset, we could have better heuristics. Maybe, one day, I’ll go and find packages with only one small release with almost no code inside. Or a bunch of packages reserved in one go.
Questions to think about
- Should name squatting be allowed? Should the PyPI team care?
- Should we do something?
- Should we allow private companies to reserve names from their internal registry “for security reasons”?
- Should all package names be namespaced to the author, like on GitHub or Docker Hub?
- Should we limit the number of packages per user? Should we tell Microsoft to go and maintain their own PyPI instance?