PackageAnalyzer.jl

The main functionality of the package is the analyze function:

julia> using PackageAnalyzer

julia> analyze("Flux")
PackageV1 Flux:
  * repo: https://github.com/FluxML/Flux.jl.git
  * uuid: 587475ba-b771-5e3f-ad9e-33799f191a9c
  * version: 0.13.6
  * is reachable: true
  * tree hash: 76ca02c7c0cb7b8337f7d2d0eadb46ed03c1e843
  * Julia code in `src`: 5299 lines
  * Julia code in `test`: 3030 lines (36.4% of `test` + `src`)
  * documentation in `docs`: 1856 lines (25.9% of `docs` + `src`)
  * documentation in README: 14 lines
  * has license(s) in file: MIT
    * filename: LICENSE.md
    * OSI approved: true
  * has `docs/make.jl`: true
  * has `test/runtests.jl`: true
  * has continuous integration: true
    * GitHub Actions
    * Buildkite

The argument is a string, which can be the name of a package, a local path or a URL.

NOTE: the Git repository of the package may be cloned, in order to inspect its content.

You can also pass the output of find_package which is used under-the-hood to look up package names in any installed registries. find_package also allows one to specify a package by UUID.

julia> analyze(find_package("JuMP"; version=v"1"))
PackageV1 JuMP:
  * repo: https://github.com/jump-dev/JuMP.jl.git
  * uuid: 4076af6c-e467-56ae-b986-b466b2749572
  * version: 1.0.0
  * is reachable: true
  * tree hash: 936e7ebf6c84f0c0202b83bb22461f4ebc5c9969
  * Julia code in `src`: 16906 lines
  * Julia code in `test`: 12777 lines (43.0% of `test` + `src`)
  * documentation in `docs`: 15978 lines (48.6% of `docs` + `src`)
  * documentation in README: 79 lines
  * has license(s) in file: MPL-2.0
    * filename: LICENSE.md
    * OSI approved: true
  * has `docs/make.jl`: true
  * has `test/runtests.jl`: true
  * has continuous integration: true
    * GitHub Actions

Additionally, you can pass in the module itself:

julia> using PackageAnalyzer

julia> analyze(PackageAnalyzer)
PackageV1 PackageAnalyzer:
  * repo:
  * uuid: e713c705-17e4-4cec-abe0-95bf5bf3e10c
  * version: nothing
  * is reachable: true
  * tree hash: 7bfd2ab7049d92809eb18eed1b0548c7e07ec150
  * Julia code in `src`: 912 lines
  * Julia code in `test`: 276 lines (23.2% of `test` + `src`)
  * documentation in `docs`: 263 lines (22.4% of `docs` + `src`)
  * documentation in README: 44 lines
  * has license(s) in file: MIT
    * filename: LICENSE
    * OSI approved: true
  * has `docs/make.jl`: true
  * has `test/runtests.jl`: true
  * has continuous integration: true
    * GitHub Actions

You can also directly analyze the source code of a package via analyze by passing in the path to it, for example with the pkgdir function:

julia> using PackageAnalyzer, DataFrames

julia> analyze(pkgdir(DataFrames))
PackageV1 DataFrames:
  * repo:
  * uuid: a93c6f00-e57d-5684-b7b6-d8193f3e46c0
  * version: 0.0.0
  * is reachable: true
  * tree hash: db2a9cb664fcea7836da4b414c3278d71dd602d2
  * Julia code in `src`: 15628 lines
  * Julia code in `test`: 21089 lines (57.4% of `test` + `src`)
  * documentation in `docs`: 6270 lines (28.6% of `docs` + `src`)
  * documentation in README: 21 lines
  * has license(s) in file: MIT
    * filename: LICENSE.md
    * OSI approved: true
  * has `docs/make.jl`: true
  * has `test/runtests.jl`: true
  * has continuous integration: true
    * GitHub Actions

You can pass the keyword argument root to specify a directory to store downloaded code.

The PackageV1 struct

The returned values from analyze are objects of the type PackageV1, which has the following fields:

struct PackageV1
    name::String # name of the package
    uuid::UUID # uuid of the package
    repo::String # URL of the repository
    subdir::String # subdirectory of the package in the repo
    reachable::Bool # can the repository be cloned?
    docs::Bool # does it have documentation?
    runtests::Bool # does it have the test/runtests.jl file?
    github_actions::Bool # does it use GitHub Actions?
    travis::Bool # does it use Travis CI?
    appveyor::Bool # does it use AppVeyor?
    cirrus::Bool # does it use Cirrus CI?
    circle::Bool # does it use Circle CI?
    drone::Bool # does it use Drone CI?
    buildkite::Bool # does it use Buildkite?
    azure_pipelines::Bool # does it use Azure Pipelines?
    gitlab_pipeline::Bool # does it use Gitlab Pipeline?
    license_files::Vector{LicenseV1} # a table of all possible license files
    licenses_in_project::Vector{String} # any licenses in the `license` key of the Project.toml
    lines_of_code::Vector{LinesOfCodeV2} # table of lines of code
    contributors::Vector{ContributorsV1} # table of contributor data
    version::Union{String, Missing} # the version number, if a release was analyzed
    tree_hash::String # the tree hash of the code that was analyzed
end

where:

  • LicenseV1 contains fields license_filename::String, licenses_found::Vector{String}, license_file_percent_covered::Float64,
  • LinesOfCodeV2 contains fields directory::String, language::Symbol, sublanguage::Union{Nothing, Symbol}, files::Int, code::Int, comments::Int, blanks::Int,
  • and ContributorsV1 contains fields login::Union{String,Missing}, id::Union{Int,Missing}, name::Union{String,Missing}, type::String, contributions::Int.

Adding additional fields to PackageV1 is not considered breaking, and may occur in feature releases of PackageAnalyzer.jl.

Removing or altering the meaning of existing fields is considered breaking and will only occur in major releases of PackageAnalyzer.jl.

Analyzing multiple packages

To run the analysis for multiple packages you can either use broadcasting

analyze.(pkg_entries)

or use the function analyze_packages(pkg_entries) which runs the analysis with multiple threads. Here, pkg_entries may be any valid input to analyze.

You can use the function find_packages to find all packages in a given registry:

julia> result = find_packages(; registry=general_registry());

julia> summary(result)
"7213-element Vector{PkgSource}"

Do not abuse this function!

Warning

Cloning all the repos in General will take more than 20 GB of disk space and can take up to a few hours to complete.

You can also use find_packages_in_manifest to use a Manifest.toml to lookup packages and their versions. Besides handling release dependencies, this should also correctly handle dev'd dependencies, and non-released Pkg.add'd dependencies. The helper analyze_manifest is provided as a convenience to composing find_packages_in_manifest and analyze_packages.

License information

The license_files field of the PackageV1 object is a Tables.jl row table containing much more detailed information about any or all files containing licenses, identified by licensecheck via LicenseCheck.jl. For example, RandomProjectionTree.jl is dual licensed under both Apache-2.0 and the MIT license, and provides two separate license files. Interestingly, the README is also identified as containing an Apache-2.0 license; I've filed an issue to see if this is intentional.

julia> using PackageAnalyzer, DataFrames

julia> result = analyze("RandomProjectionTree");

julia> DataFrame(result.license_files)
3×3 DataFrame
 Row │ license_filename  licenses_found  license_file_percent_covered
     │ String            Vector{String}  Float64
─────┼────────────────────────────────────────────────────────────────
   1 │ LICENSE-APACHE    ["Apache-2.0"]                     100.0
   2 │ LICENSE-MIT       ["MIT"]                            100.0
   3 │ README.md         ["Apache-2.0"]                       6.34921

Most packages contain a single file containing a license, and so have a single entry in the table.

Lines of code

The lines_of_code field of the PackageV1 object is a Tables.jl row table containing much more detailed information about the lines of code count (thanks to tokei) and can e.g. be passed to a DataFrame for further analysis.

julia> using PackageAnalyzer, DataFrames

julia> result = analyze(pkgdir(DataFrames));

julia> DataFrame(result.lines_of_code)
15×7 DataFrame
 Row │ directory        language  sublanguage  files  code   comments  blanks
     │ String           Symbol    Union…       Int64  Int64  Int64     Int64
─────┼────────────────────────────────────────────────────────────────────────
   1 │ test             Julia                     29  17512       359    2264
   2 │ src              Julia                     31  15809       885    1253
   3 │ benchmarks       Julia                      4    245        30      50
   4 │ benchmarks       Shell                      2     15         0       0
   5 │ docs             Julia                      1     45         6       5
   6 │ docs             TOML                       1     11         0       1
   7 │ docs             Markdown                  16      0      3782     662
   8 │ docs             Markdown  Julia            4     30         3       4
   9 │ docs             Markdown  Python           1     13         0       1
  10 │ docs             Markdown  R                1      6         0       0
  11 │ Project.toml     TOML                       1     51         0       4
  12 │ README.md        Markdown                   1      0        21      10
  13 │ NEWS.md          Markdown                   1      0       267      47
  14 │ LICENSE.md       Markdown                   1      0        22       1
  15 │ CONTRIBUTING.md  Markdown                   1      0       138      20

Contributors to the repository

If the package repository is hosted on GitHub and you can use GitHub authentication, the list of contributors is added to the contributors field of the PackageV1 object. This is a table which includes the GitHub username ("login") and the GitHub ID ("id") for contributors identified as GitHub "users", and the "name" for contributors identified as "Anonymous" contributors, as well as the number of contributions provided by that user to the repository. This is the data returned from the GitHub API, and there may be people for which some of their contributions are marked as from an anonymous user (possibly more than one!) and some of their contributions are associated to their GitHub username.

julia> using PackageAnalyzer, DataFrames

julia> result = analyze("DataFrames");

julia> df = DataFrame(result.contributors);

julia> sort!(df, :contributions, rev=true)
189×5 DataFrame
 Row │ login                id        name           type       contributions
     │ String?              Int64?    String?        String     Int64
─────┼────────────────────────────────────────────────────────────────────────
   1 │ johnmyleswhite          22064  missing        User                 431
   2 │ bkamins               6187170  missing        User                 412
   3 │ powerdistribution     5247292  missing        User                 232
   4 │ nalimilan             1120448  missing        User                 223
   5 │ garborg               2823840  missing        User                 173
   6 │ quinnj                2896623  missing        User                 104
   7 │ simonster              470884  missing        User                  87
   8 │ missing               missing  Harlan Harris  Anonymous             67
   9 │ cjprybol              3497642  missing        User                  50
  10 │ alyst                  348591  missing        User                  48
  11 │ dmbates                371258  missing        User                  47
  12 │ tshort                 636420  missing        User                  39
  13 │ doobwa                  79467  missing        User                  32
  14 │ HarlanH                130809  missing        User                  32
  15 │ kmsquire               223250  missing        User                  30
  ⋮  │          ⋮              ⋮            ⋮            ⋮            ⋮

GitHub authentication

If you have a GitHub Personal Access Token, you can obtain some extra information about packages whose repository is hosted on GitHub (e.g. the list of contributors). If you store the token as an environment variable called GITHUB_TOKEN or GITHUB_AUTH, this will be automatically used whenever possible, otherwise you can generate a GitHub authentication with the PackageAnalyzer.github_auth function and pass it to the functions accepting the auth::GitHub.Authorization keyword argument.