View on GitHub

GraphGen4Code

A Toolkit for Generating Code Knowledge Graphs

Knowledge graphs have been proven extremely useful in powering diverse applications in semantic search and natural language understanding. In this work, we present GraphGen4Code, a toolkit to build code knowledge graphs that can similarly power various applications such as program search, code understanding, bug detection, and code automation. GraphGen4Code uses generic techniques to capture code semantics with the key nodes in the graph representing classes, functions and methods. Edges indicate function usage (e.g., how data flows through function calls, as derived from program analysis of real code), and documentation about functions (e.g., code documentation, usage documentation, or forum discussions such as StackOverflow). Our toolkit uses named graphs in RDF to model graphs per program, or can output graphs as JSON. We show the scalability of the toolkit by applying it to 1.3 million Python files drawn from GitHub, 2,300 Python modules, and 47 million forum posts. This results in an integrated code graph with over 2 billion triples. We make the toolkit to build such graphs as well as the sample extraction of the 2 billion triples graph publicly available to the community for use.

Table of Contents

  1. Sample Graphs Generated by GraphGen4Code
  2. GraphGen4Code Pipeline
  3. Schema
  4. Create your own graph
  5. Example Queries
  6. Example Use Cases
  7. Publications

Sample Graphs Generated by GraphGen4Code

1.3 Million Python Programs from Github

To demonstrate GraphGen4Code’s scalability, we build graphs for 1.3 million Python programs (where program refers to a single Python script) on GitHub, each analyzed into its own separate graph. We also use the toolkit to link library calls to documentation and forum discussions, by identifying the most commonly used modules in code, and trying to connect their classes, methods or functions to relevant documentation or posts. For forum posts, we used information retrieval techniques to connect it to its relevant methods or classes. We performed this linking for 257K classes, 5.8M methods, and 278K functions, and processed 47M posts from StackOverflow and StackExchange. This shows the feasibility of using the Graph4CodeGen toolkit for building large-scale knowledge graphs for code that captures code semantics as well as natural language artifcacts about code.

All graph files are available here.

To load and query this data, please follow the instructions here. We also provide scripts for creating a docker image with the graph database ready to use.

ETH 150k Python Dataset

We also used GraphGen4Code to produce graphs for ETH 150k Python Dataset collected from Github. ETH-150K dataset has been used to train models for code recommendation, type inferencing, program repairs, …etc. We provide graph data for this dataset in both JSON and RDF N-Quads formats.

GraphGen4Code Pipeline

The figure below shows the overall pipeline of steps followed by GraphGen4Code to generate large-scale code knowledge graphs.



Schema

The following shows a code snippet example as well as a high level overview of the information generated by GraphGen4Code from code analysis, StackOverflow, and docstrings. We provide a random sample of each data source in RDF format here.

Code Snippet Example



Dataflow graph for the running example



StackOverflow Graph Example



Docstrings Graph Example



Publications

  @article{abdelaziz2020codebreaker,
  title={A Demonstration of CodeBreaker: A Machine Interpretable Knowledge Graph for Code},
  author={Abdelaziz, Ibrahim and Srinivas, Kavitha and Dolby, Julian and  McCusker, James P},
  journal={International Semantic Web Conference (ISWC) (Demonstration Track)},
  year={2020}
}
 @article{abdelaziz2020graph4code,
  title={Graph4Code: A Machine Interpretable Knowledge Graph for Code},
  author={Abdelaziz, Ibrahim and Dolby, Julian and  McCusker, James P and Srinivas, Kavitha},
  journal={arXiv preprint arXiv:2002.09440},
  year={2020}
}