View on GitHub


A Toolkit for Generating Code Knowledge Graphs

Example Use Cases

Recommendation engine for developers

CodeBreaker is a coding assistant built on top of Graph4Code to help data scientists write code. The coding assistant helps users find the most plausible next coding step, finds relevant stack overflow posts based purely on the users’ code, and allows users to see what sorts of models other people have constructed for data flows similar to their own. CodeBreaker uses the Language Server Protocol (LSP) to provide integration with any IDE. For a detailed description of this use case, see the demo paper. A video of this use case is also here.

Enforcing best practices

Many best practices for API frameworks can be encoded into query templates over data flow and control flow. Here we give three such examples for data science code, along with queries which can be templatized.

Debugging with Stackoverflow

A common use of sites such as StackOverflow is to search for posts related to an issue with a developer’s code, often a crash.
In this use case, we show an example of searching StackOverflow using the code context in the following figure, based on the highlighted code locations found with dataflow to the {\tt fit} call.

Such a search on Graph4Code does produce the StackOverflow result shown above based on links with the coding context, specifically the train_test_split and call as one might expect. Suppose we had given SVC a very large dataset, and the fit call had memory issues; we could augment the query to look for posts that mention `memory issue’, in addition to taking the code context shown in the above figure into consideration. The figure below shows the first result returned by such a query over the knowledge graph. As shown in the figure, this hit is ranked highest because it matches both the code context in motivating figure highlighted with green ellipses, and the terms “memory issue” in the text. What is interesting is that, despite its irrelevant title, the answer is actually a valid one for the problem.

A text search on StackOverflow with sklearn, SVC and memory issues as terms does not return this answer in the top 10 results. We show below the second result, which is the first result returned by a text search on StackOverflow. Note that our system ranks this lower because the coding context does not match the result as closely.

Learning from big code

There has been an explosion of work on mining large open domain repositories for a wide variety of tasks (see here). We sketch a couple of examples for how Graph4Code can be used in this context.