PyPi load metrics seem unrealistic

I put the package on PyPi for the first time ~ 2 months ago and have made some updates since then. This week I noticed a download countdown record and was surprised to see that it was downloaded hundreds of times. Over the next few days, I was more surprised to see that the number of downloads sometimes increases by hundreds per day, even if this is a niche for statistical tests. In particular, older versions of the package continue to load, sometimes with higher rates than the newest version.

What's going on here?

Is there an error in calculating PyPi load, or are there many scanners that capture open source code (like mine)?

+63
python pypi web-crawler
Mar 10 2018-12-12T00:
source share
5 answers

This is an old question at the moment, but I noticed the same thing about the package that I have on PyPI and explored it further. It turns out that PyPI stores quite a bit of download statistics , including (apparently, slightly anonymous) user agents. From this it was obvious that most of the people downloading my package were like "z3c.pypimirror / 1.0.15.1" and "pep381client / 1.5". (PEP 381 describes the mirroring infrastructure for PyPI.)

I wrote a quick script to calculate everything, first including all of them, and then missing the most obvious bots, and it turns out that literally 99% of the download activity for my package was caused by mirror trays: 14,335 downloads in total, compared to 146 downloads with bot filters . And that just leaves it very obvious, so it's probably still being reevaluated.

It seems that the main reason PyPI needs mirrors is because it has them.

+72
Feb 06 '13 at 10:05
source share

Starting with the approval of the Cairnarvon CV:

"It seems that the main reason PyPI needs mirrors is because they are there."

I would change this a bit:

It may be more the way PyPI actually works and, therefore, should be mirrored, which may introduce an extra bit (or two :-) to the real traffic.

At the moment, I think you MUST interact with the main index to know what needs to be updated in your repository. State is not just accessible through timestamps in any public folder hierarchy. So bad, rsync got out of the equation. It's good that you CAN talk to the index through JSON, OAuth, XML-RPC, or HTTP interfaces.

For XML-RPC:

$> python >>> import xmlrpclib >>> import pprint >>> client = xmlrpclib.ServerProxy('http://pypi.python.org/pypi') >>> client.package_releases('PartitionSets') ['0.1.1'] 

For JSON for example:

 $> curl https://pypi.python.org/pypi/PartitionSets/0.1.1/json 

If approx. 30,000 packages are posted [ 1 ], some of which are downloaded from 50,000 to 300,000 times a week [ 2 ] (for example, distribute, pips, requests, paramiko, lxml, boto, paramike, redis and others) you really need mirrors, by at least in terms of accessibilty. Imagine what the user does when pip install NeedThisPackage fails: Wait? In addition, the company's widespread PyPI mirrors are quite common, acting as proxies for other inaccessible networks. Finally, do not forget that a wonderful check of several versions is included through virtualenv and friends. All this IMO is the legal and potentially wonderful use of packages ...

In the end, you never know what the agent really does with the downloaded package: keep in mind that N users really use it or just rewrite it next time ... and in the end, the authors of IMHO packages should take more care number and nature of use than the net number of potential users ;-)




Refs: Guest rooms from https://pypi.python.org/pypi (29303 packages) and http://pypi-ranking.info/week (for weekly rooms requested in 2013-03-23).

+11
Mar 23 '13 at 7:47
source share

You should also consider that virtualenv is becoming more and more popular. If your package resembles the core library that people use in many of their projects, they usually download it several times.

Consider that one user has 5 projects where he uses your package and each lives in his own virtual space. Using pip to satisfy your requirements, your package has already been downloaded 5 times. Then these projects can be configured on different machines, such as work, home and laptop computers, in addition, in the case of a web application, there can be an intermediate and live server. To summarize, you get many downloads by one person.

Just a thought ... maybe your package is just good .;)

+10
Mar 10 2018-12-18T00:
source share

Hypothesis: CI tools such as Travis CI and Appveyor also contribute. This may mean that every commit / push leads to the creation of a package and the installation of everything in requirements.txt

+2
Apr 14 '16 at 13:23
source share

The results of PyPI-Stats.com seem reasonable.

0
Mar 13 '17 at 10:22
source share



All Articles