ART-CoreEngine

Parser and code understanding engine
By Benjamin Carter, Dylan Johnson, Hunter Jenkins, and Jacob McAuley Penney

ART Project Overview:

A training and prediction engine --> CoreEngine Repository (this Repository)
An interactive UI that utilizes CoreEngine --> Art-UI Repository

See the video located in docs/demo.mp4 to see how this works.

This README covers the training and prediction engine aspect, the CoreEngine

Core Engine Structure

There are two sides to the CoreEngine.

Training models, both LLM models and Random Forest models.
Prediction API, allows the models to be queried to classify issues. It is the Prediction API that is called by outside projects (like the UI)

Training models

The program uses a passed in configuration JSON file (see /docs/repo-extractor/user/configuration_opts.md for more information).

The JSON file specifies the repository to extract. It also specifies the range of Issue/PRs to extract. In training, the CoreEngine extracts all the PRs as specified in the JSON file and trains either a Random Forest model, a single GPT model, or a combined binary classification GPT model. The model type is specified in the JSON file. After training, it saves the model at the location specified in the JSON configuration.

Training was done from 11 repositories. See dataset document for a list of which ones

Example instructions for training a model:

Make sure poetry is installed See here for instructions

Run poetry install -- this sets up the virtual environment
Run poetry run python3 -m spacy download en_core_web_md to download the language file
Create a GitHub Personal Access Token to use for downloading issues from GitHub and save it in a file
Set up a configuration file for training like below (see example pre-filled configuration for default)
Set an environment variable in a .env file with the OPENAI_API_KEY set to an OpenAI key
Place a GitHub key in a file located at auth_path as specified in the config.json. (Default: input/mp_auth.txt)
Run poetry run python3 main.py path/to/config.json where the json is the one set up from step three. This will download, analyze, and train the model. It stores the results in a cache, preventing repeated calls.

If you want to see the AI calls in real time, run tail -n 100 -f output/ai_log.csv

output format:

ai_log.csv This is the log of all AI calls done.
main.db This is the main SQLite file that manages all the run artifacts. Deleting this file will prompt complete environment regeneration.
ai_cache_results.db This will be a backup persistent database. Not recommended to delete, as deleting will result in replayed calls to OpenAI

⚠️ Warning
SAVE ai_result_backup.db in the output directory as this keeps track of AI artifacts. Deleting this file can result in having to redo OpenAI calls, costing money!

ℹ️ Info
If you want to restart the analysis from a clean state, delete ONLY the main.db file in the output directory. You should rarely have to delete main.db, except when switching repositories. main.db caches all extracted data to prevent re-download.

Predicting issues from the saved models.

After a model is trained, it can be used to predict an issue's domains. Import CoreEngine.src.external.External_Model_Interface and CoreEngine.src.database_manager.DatabaseManager These then can be used to query the models.

Set up the database connection:

# Defaults are set to the ones implemented in the CoreEngine basic conf `config_example.json`
db = CoreEngine.DatabaseManager(
    dbfile="./output/main.db",
    cachefile="./ai_result_backup.db",
    label_file="./data/subdomain_labels.json",
)
external_gpt_combined = CoreEngine.External_Model_Interface(
openai_key,
db,
"./output/gpt_combined_model.pkl",
"./data/domain_labels.json",
"./data/subdomain_labels.json",
"./data/formatted_domain_labels.json",
f"example cache key-{repo_name}",
"./output/response_cache/",
)

external_gpt = CoreEngine.External_Model_Interface(
openai_key,
db,
"./output/gpt_model.pkl",
"./data/domain_labels.json",
"./data/subdomain_labels.json",
"./data/formatted_domain_labels.json",
f"example cache key-{repo_name}",
"./output/response_cache/",
)

# Get the model (RF)  The model is automatically detected by the model file.
external_rf = CoreEngine.External_Model_Interface(
openai_key,
db,
"./output/rf_model.pkl",
"./data/domain_labels.json",
"./data/subdomain_labels.json",
"./data/formatted_domain_labels.json",
f"example cache key-{repo_name}",
"./output/response_cache/",
)

Now, to predict an issue, one only must do this: (Notice Issue is of type CoreEngine.src.Issue)

issue = CoreEngine.Issue(
    1,
    "Database connection fails when power goes off.",
    """Hey, I noticed that when I unplug my computer, the database server on my computer stops working. This is definitely an issue.""",
)
print(external_rf.predict_issue(issue))

See an example in example_external.py

Note, gpt_combined outputs in a slightly different format: a dictionary with the key of the domain and the value being a list of the subdomains.

Name		Name	Last commit message	Last commit date
Latest commit History 198 Commits
data		data
docs		docs
input		input
output		output
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
example_external.py		example_external.py
main.py		main.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ART-CoreEngine

ART Project Overview:

Core Engine Structure

Training models

Example instructions for training a model:

output format:

Predicting issues from the saved models.

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors 9

Uh oh!

Languages

License

RESHAPELab/ART-CoreEngine

Folders and files

Latest commit

History

Repository files navigation

ART-CoreEngine

ART Project Overview:

Core Engine Structure

Training models

Example instructions for training a model:

output format:

Predicting issues from the saved models.

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors 9

Uh oh!

Languages

Packages