Shivan Mukherjee: sm5155 Gurnoor Virdi: gsv2110
NeuraScript is a language designed to simplify the integration of Linear Regression models into applications. The purpose of this project is to create a scanner, parser, and Code Generator for NeuraScript that tokenizes source code, adhering to the finite state automata rules discussed in class. The language is built on Python to implement simple linear regression models for beginners.
Currently "Linear Regression Model" has been implemented for use. We hope to add additional model functionality such as "Random Forest Model" in additional iterations of this project.
Gurnoor and I had a great time delivering this project and expressing our creativity throughout. We faced many roadblocks along the way but TAs and reviewing class materials and just general research online provided us with a lot of clarity. Some examples included incorporating the python dependencies into our own language. We wanted to be able to use structures like matplotlib to deliver graphs in an easy way and were able to figure out how to ensure that dependencies could be resolved to deliver high quality statistics on regression data. We deviated a bit from our original proposal of being able to deliver all Python functionality because we recognized it would be a largely manual effort to get full functionality. What we have now, we believe to be a good representation of the capabilities of this language and how, given more time, Neurascript could eventually be a fully fleshed out tool for any beginner developer.
- code_generator.py - Generating executable from our generated AST
- New Tests test1.ns - test6.ns, 4 tests showing functionality and 2 general error tests
- Demo Video Link: https://youtu.be/xI_YuKkdv7U
- Develop Algorithm to process the AST and output lower level language
- Given the input AST Tree, our algorithm outputs the corresponding Python executable code. We do this by processing each node of our developed AST tree and mapping it to the python boilerplate code we truncate. Types of nodes we have include saving a model, classifying a model, inputting data, general operators, etc.
- Develop pipeline that further executes the generated code to produce the output
- In our shell scripts we pipeline the entire process to produce the full output for each of our script tests. The executable for any neurascript (.ns) file type is a Python executable. After all, this is a language built on top of python. Thus, we are able to convert our high level neurascript code into python readable code and execute it through our script pipelines from there. In this phase we also resolve python dependencies for users.
- Make sure you have Python installed.
- Run the scanner with the shell script: ./run_scanner.sh
python3 src/code_generator.py tests/test1.ns
python3 src/scanner.py tests/test1.ns
python3 src/parser.py tests/test1.ns
Expected AST Output for test1.ns t: Loading a linear regression model and its data.
Program
Declaration: model :=
Literal: "Linear Regression Model"
Declaration: d1 :=
Literal: "data.csv"
Declaration: l1 :=
Literal: "labels.csv"
FunctionCall: print
Identifier: d1
FunctionCall: print
Identifier: l1
Explanation:
This test demonstrates:
- Model loading and data declaration
- Use of 'print' Statement
We now move to showing our intricacies. This is a script used to load a model, split it, and then print the output.
Expected AST Outpu for test2.ns t: Loading a linear regression model and its data.
Program
Declaration: model :=
Literal: "Linear Regression Model"
Declaration: d1 :=
Literal: "data.csv"
Declaration: l1 :=
Literal: "labels.csv"
FunctionCall: save
Identifier: model
Assignment: split :=
Literal: "0.4"
FunctionCall: print
Identifier: x_train
FunctionCall: print
Identifier: x_test
FunctionCall: print
Identifier: y_train
FunctionCall: print
Identifier: y_test
Explanation:
This test demonstrates:
- Model loading, splitting, and then outputting split dataset
- Use of 'print' Statement
Our main deliverable for this project. End to end training for a machine learning model in 7 simple lines of code. Given an input of csv files, the code will scan, parse, and generate an output python file that can be executed with dependencies included.
Expected AST Outpu for test3.ns t: Loading a linear regression model and its data.
Program
Declaration: model :=
Literal: "Linear Regression Model"
Declaration: d1 :=
Literal: "data.csv"
Declaration: l1 :=
Literal: "labels.csv"
FunctionCall: save
Identifier: model
Assignment: split :=
Literal: "0.4"
FunctionCall: train
Identifier: model
FunctionCall: predict
Identifier: model
FunctionCall: plot
Identifier: predictions
Explanation:
This test demonstrates:
- Model loading, splitting, training, saving, and then evaluating model on test data.
- Matplotlib integration in Python side, with full graphical GUI appearing.
This script depicts our error handling capabilities. If you load an incorrect model type, it will return in error. It correctly generates a parse tree but in code generation will throw an error.
This script depicts a syntax error. Outputs in failure pre- code generation phase.
Given a saved model, deliver predictions on new test data.
Expected AST Outpu for test6.ns t:
Program
Assignment: model :=
FunctionCall: load
Literal: "model.pkl"
Declaration: d1 :=
Literal: "data.csv"
Declaration: l1 :=
Literal: "labels.csv"
Assignment: split :=
Literal: "0.5"
FunctionCall: predict
Identifier: model
FunctionCall: plot
Identifier: predictions
Explanation:
This test demonstrates:
- Loading a saved model and delivering predictions
Demo Video Link: https://youtu.be/f6I6PhQrUgg
Below, I have outlined the context-free grammar (CFG) for NeuraScript.
- Keywords:
load,classify,train,foreach,save,file,process_email,predict,read_file,split,in,output,data,using,by - Operators:
:=,+,-,*,/,==,!=,<,>,<=,>=,= - Symbols:
{,},(,),[,],,,:,. - Identifiers: Represented by sequences of letters, digits, or underscores starting with a letter or underscore.
- String Literals: Represented as
"some text"(strings enclosed in double quotes). - Numbers: Represented by integers or decimal values.
<Program>: The entire script or file content.<Statement>: A single line of code that may be a declaration, assignment, function call, or loop.<Declaration>: A command for declaring or loading models, files, or data.<Expression>: Expressions, which can include identifiers, function calls, or operations.<Loop>: Control flow structure to repeat actions over an iterable.
-
Program
<Program> ::= <StatementList> -
Statement List
<StatementList> ::= <Statement> NEWLINE <StatementList> | <Statement> NEWLINE -
Statement
<Statement> ::= <Declaration> | <Assignment> | <FunctionCall> | <Loop> | <OutputStatement> -
Declaration
<Declaration> ::= "load" IDENTIFIER <AssignOp> <Expression> | "data" IDENTIFIER <AssignOp> <Expression> | "file" IDENTIFIER <AssignOp> <Expression> -
Assignment
<Assignment> ::= IDENTIFIER <AssignOp> <Expression> -
Function Call
<FunctionCall> ::= IDENTIFIER "(" <ArgumentList> ")" ["using" IDENTIFIER] -
Loop
<Loop> ::= "foreach" IDENTIFIER "in" IDENTIFIER ":" INDENT <StatementList> DEDENT -
Output Statement
<OutputStatement> ::= "output" <Expression> -
AssignOp
<AssignOp> ::= ":=" -
Argument List
<ArgumentList> ::= <Expression> "," <ArgumentList> | <KeywordArgument> | <Expression> | ε -
Keyword Argument
<KeywordArgument> ::= IDENTIFIER "=" <Expression> -
Expression
<Expression> ::= <Term> "using" IDENTIFIER | <Term> -
Term
<Term> ::= STRINGLITERAL | NUMBER | IDENTIFIER | <FunctionCall> | <ListLiteral> -
Literal
<Literal> ::= NUMBER | STRINGLITERAL
Sample Input:
load model := "classificationModel"
data input == ["email1.txt", "email2.txt"]
output prediction
Expected Output:
Syntax error on line 2: Expected ':=' or ':', got '=='
Explanation:
This input uses == instead of := or : for the data declaration, triggering a syntax error and demonstrating the parser's error-handling capability.
Deliverables for Homework 1
- Lexical Grammar The lexical grammar we have outlined in our Homework 1 defines five token types: KEYWORD, IDENTIFIER, OPERATOR, LITERAL, SYMBOL.
These token types are identified by custom finite automata implemented in src/tokens.py.
NeuraScript tokens:
- KEYWORDS: load, train, classify, foreach, save, file, process_email, etc(including python specific methods like split in our example, more will be added soon).
- OPERATORS: :=, +, -, *, /, ==, etc.
- SYMBOLS: {, }, (, ), [, ], ,
- Identifiers: User-defined names that start with a letter or underscore, followed by letters, numbers, or underscores.
- Literals: String literals enclosed in quotes ("" or '') and numeric literals (e.g., 123).
Additionally, the scanner supports standard Python identifiers, literals, and comments.
- Scanner and State Transitions The scanner is implemented in src/scanner.py, with state transitions handled by finite automata in src/automaton.py. The scanner outputs tokens in the form <Token Type, Token Value> which was outlined in the HW1 programming as the desired output. The finite automata are described in detail below in subsection B. Finite Automata and are written in automation.py.
a. Tokenization: The scanner outputs tokens in the following format: <Token Type, Token Value>. This applies both to Python code and NeuraScript constructs.
b. Finite Automata: Each token type is processed using a finite state machine (FSM). For example, the FSM for keywords processes each character of a keyword such as load until it matches a known keyword.
The finite automata are implemented in automaton.py. Specific functions handle state transitions for different token types:
Keywords: The is_keyword() function (found in tokens.py) processes potential keywords by comparing tokens to the list of known keywords.
Identifiers: The is_identifier() function ensures that identifiers start with a letter or underscore and are composed of valid characters.
Operators and Symbols: The is_operator() and is_symbol() functions match tokens against the respective lists of valid operators and symbols.
For a detailed example of how these state transitions are implemented, refer to: process_token() in automaton.py, lines 5-35
This function is responsible for processing each token, determining its type, and handling the state transitions accordingly.
FSM Handling in Code:
Keywords FSM: The is_keyword() function in tokens.py checks if the token matches any of the known keywords, accepting on a match.
Identifiers FSM: The is_identifier() function in tokens.py verifies if the token starts with an alphabetical character or underscore, accepting on a match.
Operators and Symbols FSM: The is_operator() and is_symbol() functions in tokens.py recognize operators and symbols respectively, and accept on a match.
Literals FSM: The is_literal() function identifies whether the token is a numeric value or a quoted string, accepting on a match.
c. Error Handling: Lexical errors are handled by src/error_handler.py. Errors such as invalid tokens or unrecognized characters trigger an error message but allow the scanner to continue processing.
-
Sample Input Programs Five sample programs are included in the test/ folder. These programs demonstrate various NeuraScript constructs and also include Python code.
-
Shell Script A run_scanner.sh shell script is provided to automate the execution of the scanner. It runs the scanner on a sample .ns file, demonstrating how to tokenize a NeuraScript program.
-
README This README provides details on our code. It explains the lexical grammar, FSM implementation, error handling, and examples. It also includes LaTeX diagrams of the automaton.