Home

Awesome

Research Report for pds/skeleton

Introduction and Summary

There is at least some level of interest across the PHP ecosystem in how packages are, or ought to be, organized. Here are a handful of pages regarding that topic:

Thus, the motivation is to examine existing PHP packages and find commonalities between them, so as to discern what they each use for directory and file names.

This research started with the idea that there was a common need for the following kinds of files in a package:

On examining project packages, and after the public review period, some additional common directories and files were noted:

After collection and analysis, the conclusion was that the following names should be used in the standard:

bin/              # command-line files
config/           # configuration files
docs/             # documentation files
public/           # web files
resources/        # other resource files
src/              # PHP source files
tests/            # test files
CHANGELOG(.*)     # change notes
CONTRIBUTING(.*)  # contribution guidelines
LICENSE(.*)       # licensing information
README(.*)        # read-me-first file

Of the collected packages, 69% of them appear to already comply with these naming standards.

Methodology

Collection

  1. Get the list of all packages on Packagist (cf. results/list.json).

  2. Parse the list to find all vendors with more than 3 packages; having more than 3 implies a minimum level of experience and practice with building and publishing packages.

  3. For each of those, fetch the package JSON files from Packagist ... <https://packagist.org/p/{$VENDOR}/{$PACKAGE}.json> ... and retain them.

  4. For each of the fetched package JSON files ...

    • skip the package if is marked as "abandoned";
    • otherwise, find its first "source" entry;
    • and skip the entry if it is not hosted at Github (this is to minimize tooling necessary to analyze repositories without downloading them).
  5. For all of the fetched non-abandoned packages hosted at Github, scrape the Github page for the default branch, to retain the top-level files and directories in the repository.

Whereas the Packagist list.json file indicates 71746 packages total from 7682 vendors (each with more than 3 packages), there were some cases where downloading didn't work:

Taking into account unavailable package JSON files, non-Github hosts, and missing source repositories, the sample ended up being 65617 packages. That is, 6129 packages were not retrievable during the collection process, for an attrition rate of 8.5%.

For comparison, the list.json file indicates a nominal total of 120247 packages on Packagist (this includes abandoned and missing packages). Thus, the collection process brought in 54.6% of the nominal total number of packages on Packagist.

Analysis

First Pass

This gives us an idea of what all the different top-level names are in the downloaded packages, and how often they are used:

Results:

It turns out there are 6082 unique top-level directories, and 30826 unique top- level file names.

Second Pass

This groups the directories and files by their presumed intent, rather than by their name:

Results:

For this, the collations into categories were necessarily "by hand," as there was no automated way to do so. The categories were for the initial expectations:

On inspection of the results, this pass netted one highly-used directory not in the original expectation:

Looking past the files that are apparently tool-specific (composer.json, phpunit.xml, etc.), this pass also netted some highly-used files not in the original expectation:

The frequency of the occurrence of these elements would seem to warrant inclusion in the analysis.

Third Pass

This brings the unexpected directories and files into the groupings:

Results:

The resulting set of categories was:

Fourth Pass

Now that a set of categories is in place, pick an appropriate directory or file name for each category. It seems reasonable that the name be the same as the most-frequently occuring name within the category, resulting in:

src/
tests/
assets/
docs/
config/
bin/
README.md
LICENSE
CHANGELOG.md
CONTRIBUTING.md

However, some of these may not be appropriate:

As such:

Fifth Pass

This pass re-runs the analysis to incorporate feedback from the public review period:

Results:

That gave the following set of categories:

The most common name for a directory of resource files is Resources/. However, for consistency with the other directory names, this report recommends the lower-case form (which is the second-most frequent use).

Conclusion

Recommendation

Given the above collection and analysis, these names should be used for these purposes:

bin/              # command-line files
config/           # configuration files
docs/             # documentation files
public/           # web files
resources/        # other resource files
src/              # PHP source code files
tests/            # test files
CHANGELOG(.*)     # change notes
CONTRIBUTING(.*)  # contribution guidelines
LICENSE(.*)       # licensing information
README(.*)        # read-me-first file

Since not all packages may need all these categories, they need not be required to be present in a package. However, if a package does provide directories or files of these categories, they should use the names listed.

Current Compliance

Of the 65617 packages in the sample, 45191 (69%) of them appear compliant with the above recommendation.

Results: compliance.txt

This does not mean that all the apparently compliant packages use all the directores and all the files named in the conclusion. Rather, it means that when directories and files for the related purpose are present in the package, they use the names indicated above.

For example, a package is apparently compliant when a directory for executable files is provided with the name bin/, and not cli/ (or something else). If no such directory is provided under any recognizable name, the package is still apparently compliant, since not all packages may provide all the kinds of directories and files named above.

Addendum

After completing the primary research, collection against all of Packagist (110212 packages after attrition) revealed that the above analysis holds true across the ecosystem, not just for vendors of more than three packages.

Results:

Of all the packages collected, 78725 (71%) of them appear compliant with the above recommendation.

Results: