Ethereum-Analysis

This coursework's objective is to use the methods learned in Big Data Processing module's first part to analyse the entire set of transactions that have taken place on the Ethereum network, from the first transactions in August 2015 until June 2019. I developed a number of Map/Reduce or Spark programmes to carry out various computations. I submitted a report with my findings and a description of how I came to them.

DATASET OVERVIEW

Ethereum is a blockchain based distributed computing platform where users may exchange currency (Ether), provide or purchase services (smart contracts), mint their own coinage (tokens), as well as other applications. The Ethereum network is fully decentralised, managed by public-key cryptography, peer-to-peer networking, and proof-of-work to process/verify transactions.

Whilst you would normally need a CLI tool such as GETH to access the Ethereum blockchain, recent tools allow scraping all block/transactions and dump these to csv's to be processed in bulk; notably Ethereum-ETL. These dumps are uploaded daily into a repository on Google BigQuery. We have used this source as the dataset for this coursework.

A subset of the data available on BigQuery is provided at the HDFS folder /data/ethereum. The blocks, contracts and transactions tables have been pulled down and been stripped of unneeded fields to reduce their size. We have also downloaded a set of scams, both active and inactive, run on the Ethereum network via etherscamDB which is available on HDFS at /data/ethereum/scams.json.

ASSIGNMENT

Write a set of Map/Reduce (or Spark) jobs that process the given input and generate the data required to answer the following questions:

PART A. TIME ANALYSIS

Create a bar plot showing the number of transactions occurring every month between the start and end of the dataset.

Create a bar plot showing the average value of transaction in each month between the start and end of the dataset.

Note: As the dataset spans multiple years and you are aggregating together all transactions in the same month, make sure to include the year in your analysis.

Note: Once the raw results have been processed within Hadoop/Spark you may create your bar plot in any software of your choice (excel, python, R, etc.)

PART B. TOP TEN MOST POPULAR SERVICES

Evaluate the top 10 smart contracts by total Ether received. An outline of the subtasks required to extract this information is provided below, focusing on a MRJob based approach. This is, however, only one possibility, with several other viable ways of completing this assignment.

JOB 1 - INITIAL AGGREGATION To workout which services are the most popular, you will first have to aggregate transactions to see how much each address within the user space has been involved in. You will want to aggregate value for addresses in the to_address field. This will be similar to the wordcount that we saw in Lab 1 and Lab 2.

JOB 2 - JOINING TRANSACTIONS/CONTRACTS AND FILTERING Once you have obtained this aggregate of the transactions, the next step is to perform a repartition join between this aggregate and contracts (example here). You will want to join the to_address field from the output of Job 1 with the address field of contracts

Secondly, in the reducer, if the address for a given aggregate from Job 1 was not present within contracts this should be filtered out as it is a user address and not a smart contract.

JOB 3 - TOP TEN Finally, the third job will take as input the now filtered address aggregates and sort these via a top ten reducer, utilising what you have learned from lab 4.

PART C. TOP TEN MOST ACTIVE MINERS

Evaluate the top 10 miners by the size of the blocks mined. This is simpler as it does not require a join. You will first have to aggregate blocks to see how much each miner has been involved in. You will want to aggregate size for addresses in the miner field. This will be similar to the wordcount that we saw in Lab 1 and Lab 2. You can add each value from the reducer to a list and then sort the list to obtain the most active miners.

PART D. DATA EXPLORATION

Comparative Evaluation Reimplement Part B in Spark (if your original was MRJob, or vice versa). How does it run in comparison? Keep in mind that to get representative results you will have to run the job multiple times, and report median/average results. Can you explain the reason for these results? What framework seems more appropriate for this task? (10/50)

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
partA		partA
partB		partB
partC		partC
partD		partD
ETHEREUM ANALYSIS.docx		ETHEREUM ANALYSIS.docx
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ethereum-Analysis

DATASET OVERVIEW

ASSIGNMENT

PART A. TIME ANALYSIS

PART B. TOP TEN MOST POPULAR SERVICES

PART C. TOP TEN MOST ACTIVE MINERS

PART D. DATA EXPLORATION

About

Uh oh!

Releases

Packages

Languages

suweyba1/Ethereum-Analysis

Folders and files

Latest commit

History

Repository files navigation

Ethereum-Analysis

DATASET OVERVIEW

ASSIGNMENT

PART A. TIME ANALYSIS

PART B. TOP TEN MOST POPULAR SERVICES

PART C. TOP TEN MOST ACTIVE MINERS

PART D. DATA EXPLORATION

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages