Enron-Email-Graph-System

Overview

This Java application parses and analyzes the Enron email dataset to reveal patterns in the email communication. The EnronEmailParser class contains methods for traversing the Enron email dataset, checking the validity of email files, parsing email addresses, and constructing a graph representation of the email communication. It also includes methods for identifying "connectors" (individuals who connect different parts of the network) and teams (groups of individuals who communicate frequently with each other).

The program employs techniques like Depth-First Search (DFS) to traverse the graph. The program also allows users to enter an email address and see how many messages the individual has sent and received, as well as the number of individuals in the same team.

You can get a copy of the Enron dataset via https://www.cs.cmu.edu/~./enron/enron_mail_20150507.tar.gz. This dataset is large (1.7GB), and is provided as a compressed tar.gz file. On a Unix or MacOS system, you can uncompress it using the following command from a Unix shell or MacOS Terminal:
tar -xvzf enron_mail_20150507.tar.gz

Main idea of the graph

To represent the graph, an adjacency matrix has been utilized. This matrix provides a compact representation of the relationships between vertices. Edges in the graph are created whenever there is an exchange of messages between two vertices. Each edge represents a communication link between the corresponding vertices. The graph is undirected, which means the edges do not have a specific direction. Additionally, the graph is unweighted, implying that there is no additional information associated with the edges such as message frequency or intensity.

Data cleaning to check validity of email addresses

We have implemented a regular expression in the code to ensure that only unique email addresses are considered.
Single quotes have been replaced with null in the code to handle cases where they may cause issues or conflicts.
We have removed the dot at the beginning of instances where the text starts with "." in the code.
We have filtered and considered only those email addresses that end with "Enron.com" in the code.
We have excluded email addresses containing characters such as "#", "<", "_", "..", and "/o" in the code.

Data cleaning to check validity of mail files

Checked if the file is not empty.
Checked the file has the two main attributes in the header which are "To:" and "From:"

Finding unique email addresses

Used the regex to some cleaning up and searched for valid email addresses first.
Then used Set to find unique email addresses from the attributes 'to', 'from', 'cc' and 'bcc'.
There are some email files with special case where there are multiple lines of 'to' address so my Connectors count is higher.

Finding connectors

Main calculations are in the methods getConnectors() and dfs().
Depth first search algorithm is implemented to find out the connectors between various senders and receivers.
The dfs() method performs a DFS traversal on a graph and identifies the nodes that are connectors, i.e., nodes whose removal would increase the number of connected components in the graph.

Finding teams

Main calculations are in the methods getTeams() and island_dfs().
Depth first search is implemented to find teams of individuals.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
Edge.java		Edge.java
EnronEmailParser.java		EnronEmailParser.java
Graph.java		Graph.java
GraphAdjacencyMatrix.java		GraphAdjacencyMatrix.java
README.md		README.md
Vertex.java		Vertex.java
sample.png		sample.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Enron-Email-Graph-System

Overview

Main idea of the graph

Data cleaning to check validity of email addresses

Data cleaning to check validity of mail files

Finding unique email addresses

Finding connectors

Finding teams

About

Uh oh!

Releases

Packages

Languages

chansrinivas/Enron-Email-Graph-System

Folders and files

Latest commit

History

Repository files navigation

Enron-Email-Graph-System

Overview

Main idea of the graph

Data cleaning to check validity of email addresses

Data cleaning to check validity of mail files

Finding unique email addresses

Finding connectors

Finding teams

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages