Parser and code understanding engine
By Benjamin Carter, Dylan Johnson, Hunter Jenkins, and Jacob McAuley Penney
- A training and prediction engine --> CoreEngine Repository (this Repository)
- An interactive UI that utilizes CoreEngine --> Art-UI Repository
See the video located in
docs/demo.mp4to see how this works.
This README covers the training and prediction engine aspect, the CoreEngine
There are two sides to the CoreEngine.
- Training models, both LLM models and Random Forest models.
- Prediction API, allows the models to be queried to classify issues. It is the Prediction API that is called by outside projects (like the UI)
The program uses a passed in configuration JSON file (see /docs/repo-extractor/user/configuration_opts.md for more information).
The JSON file specifies the repository to extract. It also specifies the range of Issue/PRs to extract. In training, the CoreEngine extracts all the PRs as specified in the JSON file and trains either a Random Forest model, a single GPT model, or a combined binary classification GPT model. The model type is specified in the JSON file. After training, it saves the model at the location specified in the JSON configuration.
Training was done from 11 repositories. See dataset document for a list of which ones
Make sure poetry is installed See here for instructions
-
Run
poetry install-- this sets up the virtual environment -
Run
poetry run python3 -m spacy download en_core_web_mdto download the language file -
Create a GitHub Personal Access Token to use for downloading issues from GitHub and save it in a file
-
Set up a configuration file for training like below (see example pre-filled configuration for default)
-
Set an environment variable in a
.envfile with theOPENAI_API_KEYset to an OpenAI key -
Place a GitHub key in a file located at
auth_pathas specified in theconfig.json. (Default:input/mp_auth.txt) -
Run
poetry run python3 main.py path/to/config.jsonwhere the json is the one set up from step three. This will download, analyze, and train the model. It stores the results in a cache, preventing repeated calls.
If you want to see the AI calls in real time, run
tail -n 100 -f output/ai_log.csv
ai_log.csvThis is the log of all AI calls done.main.dbThis is the main SQLite file that manages all the run artifacts. Deleting this file will prompt complete environment regeneration.ai_cache_results.dbThis will be a backup persistent database. Not recommended to delete, as deleting will result in replayed calls to OpenAI
⚠️ Warning
SAVEai_result_backup.dbin theoutputdirectory as this keeps track of AI artifacts. Deleting this file can result in having to redo OpenAI calls, costing money!
ℹ️ Info
If you want to restart the analysis from a clean state, delete ONLY themain.dbfile in theoutputdirectory. You should rarely have to deletemain.db, except when switching repositories.main.dbcaches all extracted data to prevent re-download.
After a model is trained, it can be used to predict an issue's domains.
Import CoreEngine.src.external.External_Model_Interface and CoreEngine.src.database_manager.DatabaseManager These then can be used to query the models.
Set up the database connection:
# Defaults are set to the ones implemented in the CoreEngine basic conf `config_example.json`
db = CoreEngine.DatabaseManager(
dbfile="./output/main.db",
cachefile="./ai_result_backup.db",
label_file="./data/subdomain_labels.json",
)
external_gpt_combined = CoreEngine.External_Model_Interface(
openai_key,
db,
"./output/gpt_combined_model.pkl",
"./data/domain_labels.json",
"./data/subdomain_labels.json",
"./data/formatted_domain_labels.json",
f"example cache key-{repo_name}",
"./output/response_cache/",
)
external_gpt = CoreEngine.External_Model_Interface(
openai_key,
db,
"./output/gpt_model.pkl",
"./data/domain_labels.json",
"./data/subdomain_labels.json",
"./data/formatted_domain_labels.json",
f"example cache key-{repo_name}",
"./output/response_cache/",
)
# Get the model (RF) The model is automatically detected by the model file.
external_rf = CoreEngine.External_Model_Interface(
openai_key,
db,
"./output/rf_model.pkl",
"./data/domain_labels.json",
"./data/subdomain_labels.json",
"./data/formatted_domain_labels.json",
f"example cache key-{repo_name}",
"./output/response_cache/",
)Now, to predict an issue, one only must do this: (Notice Issue is of type CoreEngine.src.Issue)
issue = CoreEngine.Issue(
1,
"Database connection fails when power goes off.",
"""Hey, I noticed that when I unplug my computer, the database server on my computer stops working. This is definitely an issue.""",
)
print(external_rf.predict_issue(issue))See an example in example_external.py
Note, gpt_combined outputs in a slightly different format: a dictionary with the key of the domain and the value being a list of the subdomains.