All sufficiently big public package registries are a mess full of malware, name squatting, and drama:

  • crates.io has a single user owning names like “any”, “bash”, and “class”.
  • npmjs.com had a drama with left-pad when a single maintainer of a single one-liner package broke the internet.
  • pypi.org appears in tech news monthly with another group of researchers discovering another malware campaign.

Today PyPI malware made news yet again, so I decided to take a look at the other side of PyPI: name squatting and some other interesting stats along the way.

Get the data

We could manually try random package names and check their owner but there is a better way. Seth Michael Larson, the Security Developer-in-Residence at the Python Software Foundation has a public repository pypi-data with a partial dump of the PyPI database.

  1. Download the latest dump. If you want to reproduce my results, pick the same as I’m going to use: 2023-10-31 (spooky! 🎃).
  2. Extract: gunzip pypi.db.gz.
  3. Either open the dump in the sqlite CLI (sqlite3 pypi.db) or use the DB Browser for SQLite GUI which is very cool (but may crash if you’re not careful with queries you run).

Probing the data

The table packages contains all packages with their name, the latest released version number, the last update date, and some other info. For example, let’s select stats for textdistance:

SELECT * FROM packages WHERE name = 'textdistance';
fieldvalue
nametextdistance
version4.6.0
requires_python>=3.5
yanked0
has_binary_wheel0
has_vulnerabilities0
first_uploaded_at2023-09-28T08:30:50
last_uploaded_at2023-09-28T08:30:51
recorded_at2023-10-30 21:49:00
downloads308733
scorecard_overall4.8
in_google_assured_oss0

Unfortunately, we don’t have any information about past releases, like how many releases the package had, how many files, when the first one was uploaded, etc. Also, maintainers are in a separate table because a single package may have multiple maintainers and maintainers may have multiple packages (many-to-many):

SELECT * FROM maintainers WHERE package_name = 'textdistance';
fieldvalue
nameorsinium
package_nametextdistance

Finding the most prolific users

Who published the most packages?

SELECT      maintainers.name, COUNT(*) as cnt
FROM        packages, maintainers
WHERE       packages.name = maintainers.package_name
GROUP BY    maintainers.name
ORDER BY    cnt DESC
LIMIT       20;
namecnt
OCA14928
alexjxd1577
wix-ci1539
yandex-bot1196
openstackci735
vemel734
microsoft671
davisagli520
hansalemao501
hannosch500
icemac449
google_opensource415
faassen401
agroszer361
dlech360
thejcannon360
adafruit-travis352
pycopy-lib347
azure-sdk343
aws-cdk337

You may recognize some names on the list.

  • The apparent leader is OCA, also known as Odoo Community Association. Odoo is a popular open-source enterprise CRM with Python backend. Their PyPI account holds a bunch of Odoo plugins.
  • Next goes alexjxd, also known as Alex Jiang. This is an Alibaba employee, and their account holds alibabacloud-python-sdk components. It is poorly documented but what I noticed is that all components have a date suffix, like ddosbgp-20180201. So, it’s some kind of additional versioning going on.
  • The third place goes to wix-ci, holding a bunch of plugins for wix.com.
  • The yandex-bot, claimed to be owned by “Yandex Security Team”, owns 1200 names, including names like and, nu, aiostat, apilib, cpp_grader, tmp2, minify, and many other generic names. Each description says: “A package to prevent Dependency Confusion attacks against Yandex”. So, we see name squatting to prevent name squatting. “The best defense is a good offense”. Should this be allowed? And the whole situation suddenly takes a political turn when you consider that Yandex LLC is a Russian company.

You can check the rest of the list yourself if you’re curious. For now, let’s find something more interesting.

Finding the top name squatters

Name squatting is when someone registers a bunch of common names to sell them later. It is very common with DNS, social media, and package registries. This is why Steam is steampowered.com.

The best heuristic would be to find users with the most single-release packages, but we don’t have this information in the dataset. Instead, we can have a look at users with all packages having the same version number. The assumption is that when all names are registered using one tool or one placeholder project metadata, they all will have the same version.

SELECT
    maintainers.name,
    packages.name,
    version,
    COUNT(*) as cnt_prj,
    COUNT(DISTINCT version) as cnt_ver
FROM      packages, maintainers
WHERE     packages.name = maintainers.package_name
GROUP BY  maintainers.name
HAVING    cnt_ver = 1
ORDER BY  cnt_prj DESC
LIMIT     20;
maintainerpackageversionprojects
wix-ciartifactory-check0.0.11539
alexanderkjallabyss-airflow-reprocessor0.0.1243
doxopsdata-dags0.0.153
akarmakarnvidia-cudf-cu110.0.1.dev548
shadowwalker2718audiolm0.0.1.dev041
tanium-securitymacmiller-common0.0.dev129
wxpay_sec_teamautogencase0.0.129
squadronealgorand-wallet-client0.0.028
GHGSatgfa-ghg-hres0.1.124
aws-solutions-konstruk-supportaws-solutions-konstruk-aws-apigateway-dynamodb0.8.124
coalgocoalg0.0.024
elula-aielulalib0.0.023
felya152felya-1-10.1.023
girder-robotgirder3.1.2422
deeznuts1337cloudsec0.0.019
mapsmeomim-airmaps10.3.0rc219
souljaboyeai0.119
hashemshaibanaladrisy0.0.118
stastnypremyslpycom-artifactory-automation0.0.118
edtbtestwizard-android-set-top-box3.7.017
  • The thing I haven’t noticed about wix-ci before is that all the packages are released in one go, between 2021-02-11 and 2021-02-14, and haven’t been touched since. When I check the content of the packages, they are all empty, without any code inside. Busted!
  • alexanderkjall, also known as Alexander Kjäll, holds 244 packages with the description “PyPi package created by Schibsted’s Product & Application Security team”. Yet another example of “to prevent squatting, let’s squad first”. The names include schlearn (which sounds like sklearn), s3-helpers, christian, ip-library, datadog-linter, etc.
  • doxops is yet another company squatting their private names.
  • akarmakar squats package names for nvidia, like nvidia-raft-dask-cu116. If you try to install any of these, you’ll get an installation failure telling you to use NVIDIA Python Package Index. This is similar to other cases of “safety squatting” but at least this time it serves a purpose for public project users, not just employees of a single company.
  • shadowwalker2718 is the first instance of name squatting on the list done not by a big company. All the names they hold are the names of the real ML projects that you find on GitHub but which don’t provide a PyPI distribution. They squatted chatdoctor for ChatDoctor, controlnet for ControlNet, autogpt for AutoGPT, etc. Most of the registered projects have the description copied from the real project and even some dependencies but no code inside.

I checked more users from the list. Lots and lots of squatters. Some are companies squatting their internal names, some are individuals holding nice names for sale.

Finding more squatters

We can tweak the query above to show us people with versions between 2 and 5. Some of the squatters might slightly change the version number or re-release a package with new fake content.

SELECT
    maintainers.name,
    packages.name,
    version,
    COUNT(*) as cnt_prj,
    COUNT(DISTINCT version) as cnt_ver
FROM     packages, maintainers
WHERE    packages.name = maintainers.package_name
GROUP BY maintainers.name
HAVING   cnt_ver BETWEEN 2 AND 5
ORDER BY cnt_prj DESC
LIMIT    20;
maintainerpackageversioncnt_prjcnt_ver
thejcannonbotocore-a-la-carte1.31.733603
stale.pettersen.schibstedapikeycheck0.0.12242
anon_ssregistraraddr-match0.0.02183
noteedopenerp-account7.0.4062062
pokoliproteus7.0.01935
DnA_DGAT_Chapterabcdefg0.0.01602
wangclab-b1.01193
tcwan0.0.41145
siferaaaaa1.0.1982
aws-solutions-constructs-teamaws-solutions-constructs-aws-alb-fargate2.45.0863
takealotab-test-client0.0.1rc0824
kafkaservicesaudit-friday0.1744
yinsuo.myshaas-python-ads1xx50.0.8743
Pinkyyaisi-od-training0.0.1rc1723
se2862890720ci-connector0.0.47572
mdazam1942car-connector-framework4.0.1565
riederamuse2023.10.0505
cloudwrightcloudwright-airtable0.0.0.post1492
doerlbhaikido0.0.0495
elad_ptadios20.0.1485

Another interesting query is to filter out maintainers having all packages with one of the predefined version numbers:

SELECT
    maintainers.name,
    packages.name,
    version,
    COUNT(*) as cnt
FROM     packages, maintainers
WHERE    packages.name = maintainers.package_name AND version IN ('0.0.0', '0.0.1', '0.1.0', '1.0.0')
GROUP BY maintainers.name
ORDER BY cnt DESC
LIMIT    20;
maintainerpackageversioncnt
wix-ciartifactory-check0.0.11539
alexjxdalibabacloud-acm202002061.0.0415
platform-kiwialertlib0.0.0269
alexanderkjallabyss-airflow-reprocessor0.0.1243
stale.pettersen.schibstedapikeycheck0.0.1223
anon_ssregistraraddr-match0.0.0218
airbyte-engineeringairbyte-source-gong0.1.0186
DnA_DGAT_Chapterabcdefg0.0.0160
pycopy-libpycopy-aifc0.0.0151
micropython-libmicropython-aifc0.0.0138
workivaadmin-frugal0.0.0123
openstackciact0.0.1112
microsoftarchai1.0.0107
siferbbbb1.0.096
datakund_testallmovie-scraper1.0.091
azure-sdkazure-agrifood-nspkg1.0.075
heyWFengdecrypt4pdf0.0.163
mvinyard2adata-query0.0.160
doxopsdata-dags0.0.153
abhishek4273monk-colab0.0.150

This method gives quite a few false positives (legit people who release lots of one-off packages) but still, finds some interesting cases.

Putting it all together

So, how many squatters we’ve found? Combining all the methods above and manually removing false positives:

Companies:

  1. airbyte-engineering (Airbyte)
  2. akarmakar (Nvidia)
  3. alexanderkjall (Schibsted)
  4. alexjxd (Alibaba)
  5. doxops (Dox)
  6. elad_pt (Cycode)
  7. Pinkyy (SBB)
  8. platform-kiwi (Kiwi)
  9. wix-ci (Wix)
  10. workiva (Workiva)
  11. yandex-bot (Yandex)

Individual squatters:

  1. anon_ssregistrar
  2. datakund_test
  3. DnA_DGAT_Chapter
  4. doerlbh
  5. eywalker
  6. kafkaservices
  7. kislyuk
  8. mvinyard2
  9. rebelliondefense
  10. se2862890720
  11. shadowwalker2718
  12. sifer
  13. takealot
  14. tcw
  15. wangc

With a better dataset, we could have better heuristics. Maybe, one day, I’ll go and find packages with only one small release with almost no code inside. Or a bunch of packages reserved in one go.

Questions to think about

  1. Should name squatting be allowed? Should the PyPI team care?
  2. Should we do something?
  3. Should we allow private companies to reserve names from their internal registry “for security reasons”?
  4. Should all package names be namespaced to the author, like on GitHub or Docker Hub?
  5. Should we limit the number of packages per user? Should we tell Microsoft to go and maintain their own PyPI instance?