Skip to content

[research] deps2vec #100

@r0mainK

Description

@r0mainK

This is a replacement for #65 which lasted too long to be effective.

Context

Theory

Two papers in MSR 2019 have shown that we can use dependencies graphs to extract information. Namely:

Business

As detailed in the ML e-book, there are multiple business use-cases related to the dependency graph:

  • Grouping projects by dependency similarity, either in 2D or answering nearest neighbours queries.
  • Finding frequent dependency sets, the same way as retailers determine frequent product baskets.
  • Finding competing dependencies.
  • Suggesting new dependencies.
  • Recommending alternatives.

It is also a strong possibility that we could leverage dependencies graph to group similar developers, and assess expertise in libraries or ecosystems.

Objective

We want to use the dependency graph of PGA to create embeddings for as much libraries as possible, and explore how we could use these to answer the business perspectives described above.

Checklist

  • Extract the dependencies from the ClickHouse DB - we will do this with some granularity, but it is expected that packages/class imports will simply introduce noise, and will be a pain to extract. We will not care for aliases as well.
  • Postprocess the dataset
    • Normalize local import
    • Link dependencies with different names, keeping in mind that we want to separate versions when possible
    • Convert dataset to a sparse representation.
  • Create the embeddings with Swivel.
    • Per language
    • Globally
  • Evaluate the quality of the embeddings.
    • Qualitatively
    • With SGNS-like model, compare with random
    • Robustness with K-NN, see previous issue.
  • Explore how to create sensible project/developer representations for th eembeddings, and test on orgs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions