acohenstat · jsanc223 · Mar 10, 2023 · Mar 10, 2023 · Mar 25, 2023 · Apr 3, 2023
diff --git a/NER - Deep Learning techniques. RNN and Transformer Model.html b/NER - Deep Learning techniques. RNN and Transformer Model.html
diff --git a/NER - Deep Learning techniques. RNN and Transformer Model.qmd b/NER - Deep Learning techniques. RNN and Transformer Model.qmd
@@ -0,0 +1,304 @@
+---
+title: "NER - Deep Learning techniques. RNN and Transformer Model"
+format:
+  html:
+    embed-resources: true
+---
+
+# 1. Introduction
+
+Named Entity Recognition (NER) is a crucial task in Natural Language Processing (NLP) that involves identifying and extracting named entities from unstructured text. NER presents challenges due to the inherent complexity and ambiguity of natural language. Still, it is essential for various NLP applications, including information retrieval, question answering, and machine translation. Deep learning techniques have demonstrated exceptional performance in NER tasks in recent years. Deep learning, a subfield of machine learning, uses artificial neural networks to learn from data and make predictions.
+
+The Transformer model, introduced in 2017, has become a fundamental architecture in modern NLP and is widely used for NER tasks. It is known for its self-attention mechanism, which allows the model to weigh the importance of each word in the input sequence. Additionally, other deep learning architectures such as Recurrent Neural Networks (RNNs), Convolutional Neural Networks (CNNs), and Bidirectional Encoder Representations from Transformers (BERT) have been successfully applied to NER.
+
+In this literature review, we will explore and compare the effectiveness of RNNs and the Transformer model for Named Entity Recognition (NER) tasks.
+
+## 1.1 Recurrent Neural Networks (RNN)
+
+A Recurrent Neural Network (RNN) is a method of deep learning that uses sequential data or time-series data. They are often used in ordinal problems, where the output is not continues but it is discrete and ordered, and temporal problems, where data is collected over time and the order of the observations is important. These problems include speech recognition, language translation, and natural language processing (nlp). RNNs make use of training data by taking prior inputs and using them to influence current inputs and outputs. Utilizing prior inputs in this way is what creates the RNN's memory. This is something that distinguishes RNNs from other deep learning methods. RNNs are also distinguished by the sharing of parameters across each layer in the network. A popular type of RNN architecture known as Long Short-Term Memory (LSTM) has been used in named entity recognition. LSTMs have individual units in the hidden layers of a neural network, each of which has three gates. These include an input gate that controls the input information that goes into a memory cell. A forget gate that controls the amount of historical information that passes through from the previous state, and an output gate that controls the amount of information that is passed on to the next step. RNNs allow information to persist across multiple steps which enable the network to capture dependencies and context. The architecture and abilities of RNNs allow it to be a useful tool in NER that has been proven to perform better than previous NER systems.
+
+## 1.2 Transformer Model
+
+The Transformer model is a groundbreaking neural network architecture introduced in the paper "Attention Is All You Need" (vaswani2017attention). The model is designed for sequence-to-sequence tasks, such as machine translation, and is known for its ability to process input sequences in parallel rather than sequentially. This parallel processing makes the Transformer model highly efficient and scalable. One of the key innovations of the Transformer model is the self-attention mechanism. Self-attention allows the model to weigh the importance of different words in the input sequence relative to each other when making predictions. The model uses multi-head attention, which means it can simultaneously attend to various input aspects. This ability to capture complex dependencies and relationships between words contributes to the model's strong performance. The Transformer architecture consists of an encoder and decoder, each composed of multiple layers of self-attention and feed forward neural networks. The encoder processes the input sequence while the decoder generates the output sequence. The connections between the encoder and decoder are facilitated by attention mechanisms that allow the decoder to focus on different parts of the input sequence as it generates the output. Given its effectiveness and efficiency in handling sequence data, the Transformer model has become the foundation for many subsequent natural language processing (NLP) models and architectures.
+
+## 1.3 Limitations
+
+***\[In progress - Pending RNN Method\]***
+
+Transformers are powerful neural network models commonly used in natural language processing tasks. However, they face two fundamental limitations: First, their large number of parameters results in high computational complexity, necessitating specialized hardware and substantial resources, especially for deep models with long input sequences. Second, transformers exhibit quadratic time complexity concerning input length due to the self-attention mechanism, which calculates attention scores between all token pairs. This quadratic complexity limits transformers' ability to handle very long sequences efficiently. Researchers actively explore methods to mitigate these limitations and improve the scalability of transformers.(keles2022computational)
+
+# **2 Methods**
+
+***\[In progress - Pending RNN Method\]***
+
+## 2.1 RNN
+
+## 2.2 Transformer Model
+
+The Transformers model has a unique feature called the 'attention mechanism.' It helps the model make intelligent predictions by focusing on essential parts of the input. For example, when predicting the next word in 'I have a pet \_\_\_,' the attention mechanism considers the context (the word 'pet') to make a better guess, like 'dog' or 'cat.' It's like the model knows which words deserve extra attention! ! Let's look at the two most important algorithms for the attention mechanism, Scaled Dot-Product Attention formula, and the Single (Masked) Self- or Cross-Attention Head Formula (phuong2022formal).
+
+1.  Scaled Dot-Product Attention formula\
+    \
+    $\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V$\
+
+    **Key Components:**
+
+    -   Q: Query matrix (requests for information).
+
+    -   K: Key matrix (used to calculate relevance).
+
+    -   V: Value matrix (actual content to retrieve).
+
+    -   d_k: Dimension of the query/key vectors.
+
+    -   Softmax (\...): Function to normalize attention scores.
+
+    In this formula, the attention mechanism allows a model to "pay attention" to specific parts of input data. The attention scores are calculated using queries (Q) and keys (K). The scores are normalized using Softmax , producing attention weights. The output is a weighted sum of value vectors (V), where the weights represent the importance of each value for each query. This mechanism is popular in tasks like machine translation to focus on relevant words.
+
+2.  The Single (Masked) Self- or Cross-Attention Head Formula $\begin{align*}\text{Attention}(Q, K, V) &= \text{softmax}\left(\frac{QK^\top + \text{Mask}}{\sqrt{d_k}}\right)V \\\end{align*}$
+
+**Key Components:**
+
+-   Q: Query matrix (requests for information).
+
+-   K: Key matrix (used to calculate relevance).
+
+-   V: Value matrix (actual content to retrieve).
+
+-   d_k: Dimension of the query/key vectors.
+
+-   Mask: Matrix to prevent attention to specific tokens (e.g., future tokens).
+
+-   Softmax (\...): Function to normalize attention scores.
+
+In this formula, the attention uses queries (Q) and keys (K) to calculate attention scores. An optional mask (Mask) is added to scores to prevent attending to specific tokens (e.g., in language modeling, future tokens are masked). Scores are normalized using Softmax, producing attention weights. The output is a weighted sum of value vectors (V), representing the importance of each value for each query. Masking is useful for sequence-to-sequence tasks and autoregressive models.
+
+# **3 Data Analysis and Results**
+
+## 3.1 Dataset Description
+
+I will use the dataset called corona2 from Kaggle to identify Natural Entity Recognition to identify Medical Condition, Medicine names and Pathogens. The dataset was manually tagged for training. I will use the Transformer model to apply the NER using python programming language. The dataset contains 31 observations and 4 attributes properly explained in the data definition.
+
+**Data Definition**
+
+| Column Name | Type    | Description                                         |
+|------------------|------------------|-----------------------------------|
+| Text        | string  | Sentence including the labels                       |
+| Starts      | integer | Position on where the label starts                  |
+| Ends        | integer | Position on where the label ends                    |
+| Labels      | string  | The label( Medical Condition, Medicine or Pathogen) |
+
+-   Labels:
+    -   Medical condition names (example: influenza, headache, malaria)
+    -   Medicine names (example : aspirin, penicillin, ribavirin, methotrexate)
+    -   Pathogens ( example: Corona Virus, Zika Virus, cynobacteria, E. Coli)
+
+## 
+
+## 3.2 Data Preparation
+
+The following python code will load the required libraries
+
+```{python}
+!pip install -q nbclient
+!pip install -q requests
+!pip install -q pandas
+!pip install -q nbformat
+!pip install -q plotly.express
+```
+
+Import the data from Github to local.
+
+```{python}
+
+import requests
+
+# Accessing the JSON file from GitHub
+url = 'https://raw.githubusercontent.com/jsanc223/datasetCorona2/main/Corona2.json'
+# HTTP GET request to the raw URL
+response = requests.get(url)
+# I am checking if the request was successful (status code 200)
+if response.status_code == 200:
+    # Parse the JSON data from the response
+    data = response.json()
+    print('Success, the Json data was stored into the data variable')
+else:
+    print('Failed to fetch JSON data:', response.status_code)
+
+```
+
+Parse json data into dictionary to manipulate data.
+
+```{python}
+
+training_data = []
+for example in data['examples']:
+  temp_dict = {}
+  temp_dict['text'] = example['content']
+  temp_dict['entities'] = []
+  for annotation in example['annotations']:
+    start = annotation['start']
+    end = annotation['end']
+    label = annotation['tag_name'].upper()
+    temp_dict['entities'].append((start, end, label))
+  training_data.append(temp_dict)
+
+```
+
+Convert data from Dictionary to Dataframe. I am only showing the Text and Labels for each sentence
+
+```{python}
+import pandas as pd
+# I am initialing empty lists to store the data for the DataFrame
+texts = []
+starts = []
+ends = []
+labels = []
+
+# Iterate through the training_data to extract individual entity annotations
+for example in training_data:
+    text = example['text']
+    for entity in example['entities']:
+        start, end, label = entity
+        # Append data to the lists
+        texts.append(text)
+        starts.append(start)
+        ends.append(end)
+        labels.append(label)
+
+# Create a DataFrame from the lists
+df = pd.DataFrame({'text': texts, 'start': starts, 'end': ends, 'label': labels})
+df.head(5)
+```
+
+## 3.3 Data statistics
+
+1.  Our analysis begins with examining the frequency of each label in the dataset. The dataset contains 134 instances of 'Medical Condition', 94 instances of 'Medicine', and 67 instances of 'Pathogen'. This data provides an overview of label distribution.
+
+```{python}
+
+import plotly.express as px
+
+# Count the occurrences of each label
+label_counts = df['label'].value_counts()
+
+# Create a DataFrame with labels and their respective counts
+df_counts = pd.DataFrame({'label': label_counts.index, 'count': label_counts.values})
+
+# Plot the frequency of each entity label using a bar plot in Plotly
+fig = px.bar(df_counts, x='label', y='count', text='count', color='label',
+             color_discrete_sequence=px.colors.qualitative.Plotly, title='Frequency of Entity Labels')
+
+# Display the counter label inside the bars
+fig.update_traces(textposition='inside')
+
+# Update axis titles
+fig.update_layout(xaxis_title='Entity Label', yaxis_title='Frequency')
+
+fig.show()
+
+```
+
+2.  The chart below displays the distribution of biomedical Entity labels in the dataset. 'Medical Condition' is the most prevalent at 45.4%, followed by 'Medicine' at 31.9%, and 'Pathogen' at 22.7%. The chart provides insights into the dataset composition and label prevalence..
+
+```{python}
+
+import plotly.express as px
+
+# Get the counts of each unique label
+label_counts = df['label'].value_counts()
+# Plot a pie chart using Plotly
+fig = px.pie(label_counts, values=label_counts.values, names=label_counts.index, title='Proportion of Entity Labels', hole=0.3)
+fig.update_traces(textinfo='percent+label', textfont_size=12)
+fig.show()
+
+```
+
+3.  The histogram below visualizes the start positions of entity labels in the text. It reveals that most entities occur within the first five hundred words.
+
+```{python}
+
+import plotly.express as px
+
+# Plot a histogram of entity start positions using Plotly
+fig = px.histogram(df, x='start', nbins=30, title='Histogram of Entity Start Positions')
+fig.update_layout(xaxis_title='Entity Start Position', yaxis_title='Frequency')
+fig.show()
+```
+
+4.  The box plot below shows the start and end positions of entity labels, with the majority located below the five hundredth position.
+
+```{python}
+
+import plotly.express as px
+
+# Create box plots for 'start' and 'end' columns using Plotly
+fig = px.box(df, y=['start', 'end'], points='all', title='Box Plots of Start and End Entity Positions')
+fig.update_layout(yaxis_title='Value', xaxis_title='Column')
+fig.show()
+
+```
+
+# **4 Transformer Model Results**
+
+***\[Pending\]***
+
+# **5 Recurrent Neural Networks Results**
+
+***\[Pending\]***
+
+# **6 Conclusion**
+
+***\[Pending\]***
+
+# 7 References
+
+\[In Progress\]
+
+\@misc{phuong2022formal,
+
+title={Formal Algorithms for Transformers},
+
+author={Mary Phuong and Marcus Hutter},
+
+year={2022},
+
+eprint={2207.09238},
+
+archivePrefix={arXiv},
+
+primaryClass={cs.LG}
+
+}
+
+\@misc{keles2022computational,
+
+title={On The Computational Complexity of Self-Attention},
+
+author={Feyza Duman Keles and Pruthuvi Mahesakya Wijewardena and Chinmay Hegde},
+
+year={2022},
+
+eprint={2209.04881},
+
+archivePrefix={arXiv},
+
+primaryClass={cs.LG}
+
+}
+
+\@misc{vaswani2017attention,
+
+title={Attention Is All You Need},
+
+author={Ashish Vaswani and Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N. Gomez and Lukasz Kaiser and Illia Polosukhin},
+
+year={2017},
+
+eprint={1706.03762},
+
+archivePrefix={arXiv},
+
+primaryClass={cs.CL}
+
+}
diff --git a/README.md b/README.md
@@ -1,21 +1,7 @@
-# Instructions
+# Name Entity Recognition (NER) using Deep Learning Techniques
 
-Create a GitHub page for the project
+## Group Members:
+#### 1. Sara Ericson
+#### 2. Andrew Campbell
+#### 2. Jorge Sanchez
 
-## GitHub:
-#### 1. Create a GitHub account and Sign in
-#### 2. Go to https://github.com/acohenstat/STA6257_Project and fork (create a copy to your GitHub)
-![fork](fork.png)
-#### 3. Change the name of the repo to *STA6257_Project_NameofGroup*
-#### 4. Go to *Settings* -> *Pages* -> under *Branch* -> select *main*
-#### 5. Wait for a few seconds and refresh the page. You see the link of the page. 
-
-## RStudio:
-#### 1. Go to RStudio
-#### 2. Create a Version Control Project and Clone the repo.
-#### 3. Commit and push to see changes on the website et Voilà!
-
-More information:
-- [GitHub](https://happygitwithr.com/index.html)
-- [Video1 RStudio connection to GitHub](https://www.youtube.com/watch?v=MdmnE3AnkQE)
-- [Video2 RStudio connection to GitHub](https://www.youtube.com/watch?v=jN6tvgt3GK8)
diff --git a/images/annotated1.png b/images/annotated1.png
diff --git a/images/annotated2.png b/images/annotated2.png
diff --git a/images/frequency_labels-01.png b/images/frequency_labels-01.png
diff --git a/images/start_end_Positions-01.png b/images/start_end_Positions-01.png
diff --git a/images/transformer Architecture.png b/images/transformer Architecture.png