by kannthu on 7/1/2024, 1:51:12 PM with 1 comments
*TL;DR;*
Over one month ago I posted about a really hard problem that I "accidentally" solved (https://news.ycombinator.com/item?id=40460084).
The problem is to resolve cross-file references for multiple programming languages. I can generate a graph representation of the codebase.
*Why do you need to have a graph representation of the codebase?*
- To understand how code references other code
- Track how data is passed around
I generated references for repo https://github.com/dj-stripe/dj-stripe, here is a gist: https://gist.githubusercontent.com/kannthu/6e1bdd2781d2e0a6ded30844d61f089e/raw/f1fa4bc0f34891834ce13ac256eec12f6cc671e1/dj-stripe-references.json
The gist is a big JSON blob that contains definitions form the repository.
Definitions are:
- top-level functions
- classes
- methods and public properties
- top-level variables
- exports
Each definition contains:
- Snippet, path, and range within the file
- "references" - a list of places where the definition is used
- "expressions" - a list of resolved references (variables, functions, and classes) that are used within the body of the definition
*How this data can be useful?*
If you are building code generation, code intelligence, or code review products - your product needs to have an understanding of the codebase for many programming languages at once. The more accurate context you feed to LLM => the better output you will get, and doing it in-house is really expensive and resource-consuming.
*TL;DR;*
Over one month ago I posted about a really hard problem that I "accidentally" solved (https://news.ycombinator.com/item?id=40460084).
The problem is to resolve cross-file references for multiple programming languages. I can generate a graph representation of the codebase.
*Why do you need to have a graph representation of the codebase?*
- To understand how code references other code
- Track how data is passed around
I generated references for repo https://github.com/dj-stripe/dj-stripe, here is a gist: https://gist.githubusercontent.com/kannthu/6e1bdd2781d2e0a6ded30844d61f089e/raw/f1fa4bc0f34891834ce13ac256eec12f6cc671e1/dj-stripe-references.json
The gist is a big JSON blob that contains definitions form the repository.
Definitions are:
- top-level functions
- classes
- methods and public properties
- top-level variables
- exports
Each definition contains:
- Snippet, path, and range within the file
- "references" - a list of places where the definition is used
- "expressions" - a list of resolved references (variables, functions, and classes) that are used within the body of the definition
*How this data can be useful?*
If you are building code generation, code intelligence, or code review products - your product needs to have an understanding of the codebase for many programming languages at once. The more accurate context you feed to LLM => the better output you will get, and doing it in-house is really expensive and resource-consuming.
Let me know if it is interesting for any of you.