diff --git a/README.md b/README.md
index 0e38ddb..ac4cf1e 100644
--- a/README.md
+++ b/README.md
@@ -3,12 +3,124 @@ CUDA Stream Compaction
 
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Yu Sun 
+* [LinkedIn](https://www.linkedin.com/in/yusun3/)
+* Tested on: Tested on: Windows 10 , i7-6700HQ CPU @ 2.60GHz × 8 , GeForce GTX 960M/PCIe/SSE2, 7.7GB Memory (Personal Laptop)
 
-### (TODO: Your README)
+## Introduction
 
-Include analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+In this project, stream compaction is implemented using the traditional CPU approach, a naive CUDA based scan algorithm and a more efficient algorithm that performs the scan in place. This project can be used as a base work for many useful things like path tracer. 
+
+Stream compaction usually includes two processes: scan followed by compaction. 
+For example, given an array and some condition, we create an boolean array indicating whether the conditions is met, and perform a exclusive scan on it, resulting in
+a result that stores the indices where the condition is met in the original array. 
+
+Below is a visualization of how scan and compaction works:
+![](img/stream_scan)
+
+
+For CPU implementation, scan is basically just an for loop iterating through all the elements and produce the outputs.
+![](img/cpu.png)
+
+For the naive GPU implementation, scan is done using two ping-pong buffers since the algorithm requires inplace updates of the array.
+![](img/naive.png)
+![](img/naive_al.png)
+
+For the efficient GPU implementation, scan is done smartly using up-sweep and down-sweep, which reduces the amount of computation significantly. 
+![](img/up.png)
+![](img/down.png)
+I tried to optimize the efficient algorithm by launching with different grid dimension that is best for each iteration, but in the end it's still slower than naive implementation when not using shared memory.
+
+For performance comparison, the built-in scan algorithm from thrust library is also used.
+
+I also implement the algorithm with shared memory instead of calling for loops from the host to see if it can speed up computations.
+
+
+## Performance Analysis @ BlockSize = 128
+The performance of these three different algorithms with or w/o shared memory are shown in the diagram below.
+![](img/pw2.png)
+![](img/npt.png)
+
+As can be seen from the diagram, when the array size is small, there is not much performance gain by using the GPU. Also, it is very interesting to note that
+the efficient scan algorithm is not actually performing better than the naive algorithm when not utilizing shared memory. 
+
+My reasoning behind the first phenomenon is that essentially when the array size if small, we do not gain many parallelsim by using the algorithm. The memory read from global 
+memory is so heavy and we don't have enough parallelsim to hide this fact. It is not very obvious from my plot but from reasoning and looking at the performance from efficient shared memory, I would predict that as size of the array grows, the memory latency will hidden by the parallel computation so the GPU implementation will be faster. 
+
+The second phenomenon shows that if not carefully designed, a smarter algorithm could actually perform less well. There are a lot of branches and idling threads in the algorithm that the block computation power is not fully used at all! Even if I tried to eliminate the idling threads by changing the grid dimension it's still slower than the naive algorithm. Changing the algorithm by using stride can resolve this issue since the threads will be accessing consecutive memory addresses. But I don't have enough time to implement them.
+
+Also, as it can be seen, implementation with shared memory generally perform better than the one without shared memory. This is because the algorithm reused the computation quite often and by using shared memory, we can improve the performance by having to do many global reads. However, one needs to be careful to not deplet the resources on GPU. For example, my current algorithm wouldn't work if I have an array size that's bigger than the blockSize.
+
+
+## Output from test Program
+
+```
+****************
+  ** SCAN TESTS **
+****************
+    [   2  32  25  35   9  44  38  24   8   2  35   4  24 ...  30   0 ]
+==== cpu scan, power-of-two ====
+   elapsed time: 0.000395ms    (std::chrono Measured)
+    [   0   2  34  59  94 103 147 185 209 217 219 254 258 ... 694 724 ]
+==== cpu scan, non-power-of-two ====
+   elapsed time: 0ms    (std::chrono Measured)
+    [   0   2  34  59  94 103 147 185 209 217 219 254 258 ... 609 636 ]
+    passed
+==== naive scan, power-of-two ====
+   elapsed time: 0.02448ms    (CUDA Measured)
+    passed
+==== naive scan, non-power-of-two ====
+   elapsed time: 0.023744ms    (CUDA Measured)
+    passed
+==== work-efficient scan, power-of-two ====
+   elapsed time: 0.125024ms    (CUDA Measured)
+    [   0   2  34  59  94 103 147 185 209 217 219 254 258 ... 694 724 ]
+    passed
+==== work-efficient scan, non-power-of-two ====
+   elapsed time: 0.098976ms    (CUDA Measured)
+    passed
+==== thrust scan, power-of-two ====
+   elapsed time: 0.0272ms    (CUDA Measured)
+    passed
+==== thrust scan, non-power-of-two ====
+   elapsed time: 0.011232ms    (CUDA Measured)
+    passed
+==== naive scan with shared memory, power of two ====
+   elapsed time: 0.01696ms    (CUDA Measured)
+    passed
+==== naive scan with shared memory, non-power-of-two ====
+   elapsed time: 0.008ms    (CUDA Measured)
+    passed
+==== efficient scan with shared memory, power of two ====
+   elapsed time: 0ms    (CUDA Measured)
+    passed
+==== efficient scan with shared memory, non-power-of-two ====
+   elapsed time: 0ms    (CUDA Measured)
+    passed
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+    [   1   1   1   3   1   3   3   1   1   2   3   1   1 ...   3   0 ]
+==== cpu compact without scan, power-of-two ====
+   elapsed time: 0ms    (std::chrono Measured)
+    [   1   1   1   3   1   3   3   1   1   2   3   1   1 ...   1   3 ]
+    passed
+==== cpu compact without scan, non-power-of-two ====
+   elapsed time: 0ms    (std::chrono Measured)
+    [   1   1   1   3   1   3   3   1   1   2   3   1   1 ...   3   1 ]
+    passed
+==== cpu compact with scan ====
+   elapsed time: 0.000791ms    (std::chrono Measured)
+    [   1   1   1   3   1   3   3   1   1   2   3   1   1 ...   1   3 ]
+    passed
+==== work-efficient compact, power-of-two ====
+   elapsed time: 2.36675ms    (CUDA Measured)
+    passed
+==== work-efficient compact, non-power-of-two ====
+   elapsed time: 1.8215ms    (CUDA Measured)
+    [   1   1   1   3   1   3   3   1   1   2   3   1   1 ...   3   1 ]
+    passed
+Press any key to continue . . .
+```
 
diff --git a/img/cpu.png b/img/cpu.png
new file mode 100644
index 0000000..6fdc327
Binary files /dev/null and b/img/cpu.png differ
diff --git a/img/down.png b/img/down.png
new file mode 100644
index 0000000..59225fc
Binary files /dev/null and b/img/down.png differ
diff --git a/img/naive.png b/img/naive.png
new file mode 100644
index 0000000..7ff8a57
Binary files /dev/null and b/img/naive.png differ
diff --git a/img/naive_al.png b/img/naive_al.png
new file mode 100644
index 0000000..dd8f0c6
Binary files /dev/null and b/img/naive_al.png differ
diff --git a/img/npt.png b/img/npt.png
new file mode 100644
index 0000000..943c60b
Binary files /dev/null and b/img/npt.png differ
diff --git a/img/pw2.png b/img/pw2.png
new file mode 100644
index 0000000..5215f93
Binary files /dev/null and b/img/pw2.png differ
diff --git a/img/stream_scan.png b/img/stream_scan.png
new file mode 100644
index 0000000..88e9f2a
Binary files /dev/null and b/img/stream_scan.png differ
diff --git a/img/up.png b/img/up.png
new file mode 100644
index 0000000..f4a82c2
Binary files /dev/null and b/img/up.png differ
diff --git a/src/main.cpp b/src/main.cpp
index 1850161..0e70211 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -11,9 +11,11 @@
 #include <stream_compaction/naive.h>
 #include <stream_compaction/efficient.h>
 #include <stream_compaction/thrust.h>
+#include <stream_compaction/naive_sm.h>
+#include <stream_compaction/efficient_sm.h>
 #include "testing_helpers.hpp"
-
-const int SIZE = 1 << 8; // feel free to change the size of array
+// If you want to increase the size of array, make sure also increase BlockSize in algorithm implemented with shared memory
+const int SIZE = 1 << 7; 
 const int NPOT = SIZE - 3; // Non-Power-Of-Two
 int *a = new int[SIZE];
 int *b = new int[SIZE];
@@ -71,7 +73,7 @@ int main(int argc, char* argv[]) {
     printDesc("work-efficient scan, power-of-two");
     StreamCompaction::Efficient::scan(SIZE, c, a);
     printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(SIZE, c, true);
+    printArray(SIZE, c, true);
     printCmpResult(SIZE, b, c);
 
     zeroArray(SIZE, c);
@@ -95,6 +97,35 @@ int main(int argc, char* argv[]) {
     //printArray(NPOT, c, true);
     printCmpResult(NPOT, b, c);
 
+    zeroArray(SIZE, c);
+    printDesc("naive scan with shared memory, power of two");
+    StreamCompaction::NaiveSM::scan(SIZE, c, a);
+    printElapsedTime(StreamCompaction::NaiveSM::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+    // printArray(SIZE, c, true);
+    printCmpResult(SIZE, b, c);
+
+    zeroArray(SIZE, c);
+    printDesc("naive scan with shared memory, non-power-of-two");
+    StreamCompaction::NaiveSM::scan(NPOT, c, a);
+    printElapsedTime(StreamCompaction::NaiveSM::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+    // printArray(NPOT, c, true);
+    printCmpResult(NPOT, b, c);
+
+    zeroArray(SIZE, c);
+    printDesc("efficient scan with shared memory, power of two");
+    StreamCompaction::NaiveSM::scan(SIZE, c, a);
+    printElapsedTime(StreamCompaction::EfficientSM::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+    // printArray(SIZE, c, true);
+    printCmpResult(SIZE, b, c);
+
+    zeroArray(SIZE, c);
+    printDesc("efficient scan with shared memory, non-power-of-two");
+    StreamCompaction::NaiveSM::scan(NPOT, c, a);
+    printElapsedTime(StreamCompaction::EfficientSM::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+    // printArray(NPOT, c, true);
+    printCmpResult(NPOT, b, c);
+
+
     printf("\n");
     printf("*****************************\n");
     printf("** STREAM COMPACTION TESTS **\n");
@@ -144,7 +175,7 @@ int main(int argc, char* argv[]) {
     printDesc("work-efficient compact, non-power-of-two");
     count = StreamCompaction::Efficient::compact(NPOT, c, a);
     printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(count, c, true);
+    printArray(count, c, true);
     printCmpLenResult(count, expectedNPOT, b, c);
 
     system("pause"); // stop Win32 console from closing on exit
diff --git a/stream_compaction/CMakeLists.txt b/stream_compaction/CMakeLists.txt
index cdbef77..5f71362 100644
--- a/stream_compaction/CMakeLists.txt
+++ b/stream_compaction/CMakeLists.txt
@@ -9,6 +9,10 @@ set(SOURCE_FILES
     "efficient.cu"
     "thrust.h"
     "thrust.cu"
+    "naive_sm.cu"
+    "naive_sm.h"
+    "efficient_sm.cu"
+    "efficient_sm.h"
     )
 
 cuda_add_library(stream_compaction
diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu
index 8fc0211..3e21054 100644
--- a/stream_compaction/common.cu
+++ b/stream_compaction/common.cu
@@ -23,7 +23,10 @@ namespace StreamCompaction {
          * which map to 0 will be removed, and elements which map to 1 will be kept.
          */
         __global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
-            // TODO
+            int index = threadIdx.x + (blockIdx.x * blockDim.x);
+            if (index < n){
+                bools[index] = (idata[index] != 0);
+            }
         }
 
         /**
@@ -32,7 +35,10 @@ namespace StreamCompaction {
          */
         __global__ void kernScatter(int n, int *odata,
                 const int *idata, const int *bools, const int *indices) {
-            // TODO
+            int index = threadIdx.x + (blockIdx.x * blockDim.x);
+            if (index < n && bools[index]){
+                odata[indices[index]] = idata[index];
+            }
         }
 
     }
diff --git a/stream_compaction/common.h b/stream_compaction/common.h
index 99a1b04..236de71 100644
--- a/stream_compaction/common.h
+++ b/stream_compaction/common.h
@@ -10,6 +10,8 @@
 #include <chrono>
 #include <stdexcept>
 
+#define blockSize 128
+
 #define FILENAME (strrchr(__FILE__, '/') ? strrchr(__FILE__, '/') + 1 : __FILE__)
 #define checkCUDAError(msg) checkCUDAErrorFn(msg, FILENAME, __LINE__)
 
diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu
index 05ce667..b9d2291 100644
--- a/stream_compaction/cpu.cu
+++ b/stream_compaction/cpu.cu
@@ -1,15 +1,15 @@
 #include <cstdio>
 #include "cpu.h"
 
-#include "common.h"
+#include "common.h"
 
 namespace StreamCompaction {
     namespace CPU {
-        using StreamCompaction::Common::PerformanceTimer;
-        PerformanceTimer& timer()
-        {
-	        static PerformanceTimer timer;
-	        return timer;
+        using StreamCompaction::Common::PerformanceTimer;
+        PerformanceTimer& timer()
+        {
+	        static PerformanceTimer timer;
+	        return timer;
         }
 
         /**
@@ -19,7 +19,10 @@ namespace StreamCompaction {
          */
         void scan(int n, int *odata, const int *idata) {
 	        timer().startCpuTimer();
-            // TODO
+	        if (n == 0 || idata == NULL || odata == NULL) return;
+	        odata[0] = 0;
+            for (int i = 1; i < n; i++) odata[i] = odata[i - 1] + idata[i - 1];
+
 	        timer().endCpuTimer();
         }
 
@@ -29,10 +32,17 @@ namespace StreamCompaction {
          * @returns the number of elements remaining after compaction.
          */
         int compactWithoutScan(int n, int *odata, const int *idata) {
-	        timer().startCpuTimer();
-            // TODO
+            if (n == 0 || idata == NULL || odata == NULL) return 0;
+            timer().startCpuTimer();
+	        int counter = 0;
+            for (int i = 0; i < n; i++){
+                if (idata[i] != 0){
+                    odata[counter] = idata[i];
+                    counter ++;
+                }
+            }
 	        timer().endCpuTimer();
-            return -1;
+            return counter;
         }
 
         /**
@@ -41,10 +51,24 @@ namespace StreamCompaction {
          * @returns the number of elements remaining after compaction.
          */
         int compactWithScan(int n, int *odata, const int *idata) {
-	        timer().startCpuTimer();
-	        // TODO
+            if (n == 0 || idata == NULL || odata == NULL) return 0;
+            timer().startCpuTimer();
+	        int* position = new int[n];
+	        for (int i = 0; i < n; i++){
+                odata[i] = (idata[i] != 0);
+            }
+            position[0] = 0;
+            for (int i = 1; i < n; i++) position[i] = position[i - 1] + odata[i - 1];
+
+            int k = 0;
+	        for (int i = 0; i < n; i++){
+	            if (idata[i] != 0){
+                    k++;
+	                odata[position[i]] = idata[i];
+	            }
+	        }
 	        timer().endCpuTimer();
-            return -1;
+            return k;
         }
     }
 }
diff --git a/stream_compaction/efficient.cu b/stream_compaction/efficient.cu
index 36c5ef2..66757c8 100644
--- a/stream_compaction/efficient.cu
+++ b/stream_compaction/efficient.cu
@@ -2,23 +2,89 @@
 #include <cuda_runtime.h>
 #include "common.h"
 #include "efficient.h"
+#include <iostream>
 
 namespace StreamCompaction {
     namespace Efficient {
-        using StreamCompaction::Common::PerformanceTimer;
-        PerformanceTimer& timer()
-        {
-            static PerformanceTimer timer;
-            return timer;
+        using StreamCompaction::Common::PerformanceTimer;
+        PerformanceTimer& timer() {
+            static PerformanceTimer timer;
+            return timer;
         }
 
+        __global__ void kernUpSweep(int N, int *odata, int d){
+            int index = threadIdx.x + (blockIdx.x * blockDim.x);
+            // blocksize is changing 
+            if (index < (N >> (d + 1)) ){
+                int idx = index << (d + 1);
+                odata[idx + (1 << (d + 1)) - 1] += odata[idx + (1 << d) - 1];
+            }
+        }
+
+
+        __global__ void kernDownSweep(int N, int *odata, int d){
+             int index = threadIdx.x + (blockIdx.x * blockDim.x);
+             if (index < (N >> (d + 1)) ) {
+                int idx = index << (d + 1);
+                int tmp = odata[idx + (1 << d) - 1];
+                odata[idx + (1 << d) - 1] = odata[idx + (1 << (d + 1)) - 1];
+                odata[idx + (1 << (d + 1)) - 1] += tmp;
+             }
+
+        }
+
+
         /**
          * Performs prefix-sum (aka scan) on idata, storing the result into odata.
          */
         void scan(int n, int *odata, const int *idata) {
-            timer().startGpuTimer();
-            // TODO
-            timer().endGpuTimer();
+            // when n is not power of two, need to allocate more space to zero pad
+            int d = ilog2ceil(n);
+            int N = 1 << d;
+            int timer_started = 0;
+
+            dim3 fullBlockPerGrid;
+            int* dev_out;
+
+            cudaMalloc((void**)&dev_out, sizeof(int) * N);
+            checkCUDAError("cudaMalloc dev_out failed");
+
+            cudaMemset(dev_out, 0, sizeof(int) * N);
+            checkCUDAError("cuda Memset failed");
+
+            cudaMemcpy(dev_out, idata, sizeof(int) * n, cudaMemcpyHostToDevice);
+            checkCUDAError("cudaMemcpyHostToDevice failed");
+
+            try
+            {
+                timer().startGpuTimer();
+            }
+            catch(...)
+            {
+                // timer already started 
+                timer_started = 1;
+            }
+            
+            // without shared memory, the algorithm needs to be called for d times
+            for (int i = 0; i < d; i++){
+                fullBlockPerGrid = ((1 << (d - i - 1)) + blockSize - 1) / blockSize;
+                kernUpSweep<<<fullBlockPerGrid, blockSize>>>(N, dev_out, i);
+                checkCUDAError("kernUpSweep failed");
+            }
+
+            cudaMemset(dev_out + N - 1, 0, sizeof(int));
+            for (int i = d - 1; i >= 0; i--){
+                fullBlockPerGrid = ((1 << (d - i - 1)) + blockSize - 1) / blockSize;
+                kernDownSweep<<<fullBlockPerGrid, blockSize>>>(N, dev_out, i);
+                checkCUDAError("kernDownpSweep failed");
+            }
+
+            if (!timer_started) timer().endGpuTimer();
+
+            cudaMemcpy(odata, dev_out, sizeof(int) * n, cudaMemcpyDeviceToHost);
+            checkCUDAError("cudaMemcpyDeviceToHost failed");
+
+            cudaFree(dev_out);
         }
 
         /**
@@ -31,10 +97,52 @@ namespace StreamCompaction {
          * @returns      The number of elements remaining after compaction.
          */
         int compact(int n, int *odata, const int *idata) {
+
+            dim3 fullBlockPerGrid((n + blockSize - 1) / blockSize);
+            int* bools, *indices, *dev_in, *dev_out;
+            int num_element;
+
+            cudaMalloc((void**)&bools, sizeof(int) * n);
+            checkCUDAError("cudaMalloc bools failed");
+            cudaMalloc((void**)&indices, sizeof(int) * n);
+            checkCUDAError("cudaMalloc indices failed");
+            cudaMalloc((void**)&dev_out, sizeof(int) * n);
+            checkCUDAError("cudaMalloc dev_out failed");
+            cudaMalloc((void**)&dev_in, sizeof(int) * n);
+            checkCUDAError("cudaMalloc dev_in failed");
+
+            // lots of memcpy...
+
+            cudaMemcpy(dev_in, idata, sizeof(int) * n, cudaMemcpyHostToDevice);
+            checkCUDAError("cudaMemcpyHostToDevice failed");
+
             timer().startGpuTimer();
-            // TODO
+            StreamCompaction::Common:: kernMapToBoolean<<<fullBlockPerGrid, blockSize>>>(n, bools, dev_in);
+            checkCUDAError("kernMapToBoolean failed");
+
+            cudaMemcpy(odata, bools, sizeof(int) * n, cudaMemcpyDeviceToHost);
+            num_element = odata[n - 1];
+            checkCUDAError("cudaMemcpyDeviceToHost failed");
+
+            scan(n, odata, odata);
+            num_element += odata[n - 1];
+
+            cudaMemcpy(indices, odata, sizeof(int) * n, cudaMemcpyHostToDevice);
+            checkCUDAError("cudaMemcpyHostToDevice failed");
+
+            StreamCompaction::Common::kernScatter<<<fullBlockPerGrid, blockSize>>>(n, dev_out, dev_in, bools, indices);
+
             timer().endGpuTimer();
-            return -1;
+
+            cudaMemcpy(odata, dev_out, sizeof(int) * n, cudaMemcpyDeviceToHost);
+            checkCUDAError("cudaMemcpyDeviceToHost failed");
+
+            cudaFree(bools);
+            cudaFree(indices);
+            cudaFree(dev_in);
+            cudaFree(dev_out);
+
+            return num_element;
         }
     }
 }
diff --git a/stream_compaction/efficient_sm.cu b/stream_compaction/efficient_sm.cu
new file mode 100644
index 0000000..25d8469
--- /dev/null
+++ b/stream_compaction/efficient_sm.cu
@@ -0,0 +1,84 @@
+#include <cuda.h>
+#include <cuda_runtime.h>
+#include "common.h"
+#include "efficient_sm.h"
+
+namespace StreamCompaction {
+    namespace EfficientSM {
+        using StreamCompaction::Common::PerformanceTimer;
+        PerformanceTimer& timer()
+        {
+            static PerformanceTimer timer;
+            return timer;
+        }
+
+
+        __global__ void kernEfficientScan(int N, int *odata, int *idata){
+            extern __shared__ int tmp[];
+            int index = threadIdx.x;
+            if (index >= N) return;
+
+            int offset = 1;
+            tmp[2 * index] = idata[2 * index];
+            tmp[2 * index + 1] = idata[2 * index + 1];
+            // up sweep
+            for (int d = (N >> 1); d > 0; d >>= 1){
+                __syncthreads();
+                if (index < d) tmp[offset * (2 * index + 2) - 1] += tmp[offset * (2 * index + 1) - 1];
+                offset <<= 1;
+            }
+            // clear last digit
+            if (index == 0) tmp[N - 1] = 0;
+            // down sweep
+            for (int d = 1; d < N; d <<= 1){
+                offset >>= 1;
+                __syncthreads();
+                if (index < d){
+                    int t = tmp[offset * (2 * index + 1) - 1];
+                    tmp[offset * (2 * index + 1) - 1] = tmp[offset * (2 * index + 2) - 1];
+                    tmp[offset * (2 * index + 2) - 1] += t;
+                }
+            }
+            __syncthreads();
+
+            odata[2 * index] = tmp[2 * index];
+            odata[2 * index + 1] = tmp[2 * index + 1];
+        }
+
+        /**
+         * Performs prefix-sum (aka scan) on idata, storing the result into odata.
+         */
+        void scan(int n, int *odata, const int *idata) {
+            int N = 1 << ilog2ceil(n);
+            dim3 fullBlockPerGrid((N + blockSize - 1) / blockSize);
+            int* dev_in, *dev_out;
+
+            cudaMalloc((void**) &dev_in, N * sizeof(int));
+            checkCUDAError("cudaMalloc dev_in failed");
+
+            cudaMalloc((void**) &dev_out, N * sizeof(int));
+            checkCUDAError("cudaMalloc dev_out failed");
+
+            cudaMemset(dev_out, 0, sizeof(int) * N);
+            checkCUDAError("cuda Memset failed");
+
+            cudaMemcpy(dev_in, idata, n * sizeof(int), cudaMemcpyHostToDevice);
+            checkCUDAError("cudaMemcpy HostToDevice failed");
+
+            timer().startGpuTimer();
+
+            kernEfficientScan <<< fullBlockPerGrid, blockSize, 2 * N * sizeof(int) >>> (N, dev_out, dev_in);
+            checkCUDAError("kernNaiveScan dev_in failed");
+
+
+            timer().endGpuTimer();
+
+            cudaMemcpy(odata, dev_out, n * sizeof(int), cudaMemcpyDeviceToHost);
+            checkCUDAError("cudaMemcpy DeviceToHost failed");
+
+            cudaFree(dev_in);
+            cudaFree(dev_out);
+
+        }
+    }
+}
diff --git a/stream_compaction/efficient_sm.h b/stream_compaction/efficient_sm.h
new file mode 100644
index 0000000..f67c58e
--- /dev/null
+++ b/stream_compaction/efficient_sm.h
@@ -0,0 +1,11 @@
+#pragma once
+
+#include "common.h"
+
+namespace StreamCompaction {
+    namespace EfficientSM {
+        StreamCompaction::Common::PerformanceTimer& timer();
+
+        void scan(int n, int *odata, const int *idata);
+    }
+}
\ No newline at end of file
diff --git a/stream_compaction/naive.cu b/stream_compaction/naive.cu
index 9218f8e..aef301f 100644
--- a/stream_compaction/naive.cu
+++ b/stream_compaction/naive.cu
@@ -5,21 +5,69 @@
 
 namespace StreamCompaction {
     namespace Naive {
-        using StreamCompaction::Common::PerformanceTimer;
-        PerformanceTimer& timer()
-        {
-            static PerformanceTimer timer;
-            return timer;
+        using StreamCompaction::Common::PerformanceTimer;
+        PerformanceTimer& timer()
+        {
+            static PerformanceTimer timer;
+            return timer;
+        }
+
+        __global__ void kernNaiveScan(int N, int *odata, int *idata, int d){
+            int index = threadIdx.x + (blockIdx.x * blockDim.x);
+            if (index < N){
+                    if (index >= (1 << d)) odata[index] = idata[index - (1 << d)] + idata[index];
+                    else odata[index] = idata[index];
+                }
+        }
+
+        // couldn't figure out a way to exclusive scan at once
+        __global__ void kernInclusiveToExclusive(int N, int *odata, int *idata){
+            int index = threadIdx.x + (blockIdx.x * blockDim.x);
+            if (index < N){
+                if (index == 0) odata[index] = 0;
+                else odata[index] = idata[index - 1];
+            }
         }
-        // TODO: __global__
 
         /**
          * Performs prefix-sum (aka scan) on idata, storing the result into odata.
          */
         void scan(int n, int *odata, const int *idata) {
+            dim3 fullBlockPerGrid((n + blockSize - 1) / blockSize);
+            int* dev_in, *dev_out;
+            // int out = 0;
+
+            cudaMalloc((void**) &dev_in, n * sizeof(int));
+            checkCUDAError("cudaMalloc dev_in failed");
+
+            cudaMalloc((void**) &dev_out, n * sizeof(int));
+            checkCUDAError("cudaMalloc dev_out failed");
+
+            cudaMemcpy(dev_in, idata, n * sizeof(int), cudaMemcpyHostToDevice);
+            checkCUDAError("cudaMemcpy HostToDevice failed");
+
+            // cudaMemcpy(dev_out, idata, n * sizeof(int), cudaMemcpyHostToDevice);
+            // checkCUDAError("cudaMemcpy HostToDevice failed");
+
             timer().startGpuTimer();
-            // TODO
+            // we don't need to allocate more mem space since the algorithm is never accessing space > n
+            for (int d = 0; d < ilog2ceil(n); d++) {
+                // ping-pong the buffer for 'inplace' matrix manipulation
+                kernNaiveScan <<< fullBlockPerGrid, blockSize >>> (n, dev_out, dev_in, d);
+                checkCUDAError("kernNaiveScan dev_in failed");
+                std::swap(dev_in, dev_out);
+            }
+            // result now in dev_in
+            kernInclusiveToExclusive<<<fullBlockPerGrid, blockSize>>> (n, dev_out, dev_in);
+
             timer().endGpuTimer();
+
+            cudaMemcpy(odata, dev_out, n * sizeof(int), cudaMemcpyDeviceToHost);
+            checkCUDAError("cudaMemcpy DeviceToHost failed");
+
+            cudaFree(dev_in);
+            cudaFree(dev_out);
+
         }
     }
 }
diff --git a/stream_compaction/naive_sm.cu b/stream_compaction/naive_sm.cu
new file mode 100644
index 0000000..28e59cc
--- /dev/null
+++ b/stream_compaction/naive_sm.cu
@@ -0,0 +1,68 @@
+#include <cuda.h>
+#include <cuda_runtime.h>
+#include "common.h"
+#include "naive_sm.h"
+
+namespace StreamCompaction {
+    namespace NaiveSM {
+        using StreamCompaction::Common::PerformanceTimer;
+        PerformanceTimer& timer()
+        {
+            static PerformanceTimer timer;
+            return timer;
+        }
+
+        // naive scan implemented with shared memory
+
+        __global__ void kernNaiveScan(int N, int *odata, int *idata){
+			extern __shared__ int tmp[];
+			int pout = 0;
+			int pin = 1;
+            int index = threadIdx.x;
+			if (index >= N) return;
+            tmp[index] = index > 0 ? idata[index - 1]: 0;
+            __syncthreads();
+            for (int offset = 1; offset < N; offset *= 2){
+                pout = 1 - pout;
+                pin = 1 - pin;
+                // the suedo code on gems 3 contains error
+                if (index >= offset) tmp[pout * N + index] = tmp[pin * N + index - offset] + tmp[pin * N + index];
+                else tmp[pout * N + index ]  = tmp[pin * N + index];
+                __syncthreads();
+            }
+        
+            odata[index] = tmp[pout * N + index];
+        }
+
+        /**
+         * Performs prefix-sum (aka scan) on idata, storing the result into odata.
+         */
+        void scan(int n, int *odata, const int *idata) {
+            dim3 fullBlockPerGrid((n + blockSize - 1) / blockSize);
+            int* dev_in, *dev_out;
+
+            cudaMalloc((void**) &dev_in, n * sizeof(int));
+            checkCUDAError("cudaMalloc dev_in failed");
+
+            cudaMalloc((void**) &dev_out, n * sizeof(int));
+            checkCUDAError("cudaMalloc dev_out failed");
+
+            cudaMemcpy(dev_in, idata, n * sizeof(int), cudaMemcpyHostToDevice);
+            checkCUDAError("cudaMemcpy HostToDevice failed");
+
+            timer().startGpuTimer();
+
+            kernNaiveScan <<< fullBlockPerGrid, blockSize, 2 * n * sizeof(int) >>> (n, dev_out, dev_in);
+            checkCUDAError("kernNaiveScan dev_in failed");
+
+            timer().endGpuTimer();
+
+            cudaMemcpy(odata, dev_out, n * sizeof(int), cudaMemcpyDeviceToHost);
+            checkCUDAError("cudaMemcpy DeviceToHost failed");
+
+            cudaFree(dev_in);
+            cudaFree(dev_out);
+
+        }
+    }
+}
diff --git a/stream_compaction/naive_sm.h b/stream_compaction/naive_sm.h
new file mode 100644
index 0000000..ba6efe7
--- /dev/null
+++ b/stream_compaction/naive_sm.h
@@ -0,0 +1,11 @@
+#pragma once
+
+#include "common.h"
+
+namespace StreamCompaction {
+    namespace NaiveSM {
+        StreamCompaction::Common::PerformanceTimer& timer();
+
+        void scan(int n, int *odata, const int *idata);
+    }
+}
diff --git a/stream_compaction/thrust.cu b/stream_compaction/thrust.cu
index 36b732d..7a670e8 100644
--- a/stream_compaction/thrust.cu
+++ b/stream_compaction/thrust.cu
@@ -8,21 +8,25 @@
 
 namespace StreamCompaction {
     namespace Thrust {
-        using StreamCompaction::Common::PerformanceTimer;
-        PerformanceTimer& timer()
-        {
-            static PerformanceTimer timer;
-            return timer;
+        using StreamCompaction::Common::PerformanceTimer;
+        PerformanceTimer& timer()
+        {
+            static PerformanceTimer timer;
+            return timer;
         }
         /**
          * Performs prefix-sum (aka scan) on idata, storing the result into odata.
          */
         void scan(int n, int *odata, const int *idata) {
+
+            thrust::device_vector<int> thrust_odata(odata, odata + n);
+            thrust::device_vector<int> thrust_idata(idata, idata + n);
+
             timer().startGpuTimer();
-            // TODO use `thrust::exclusive_scan`
-            // example: for device_vectors dv_in and dv_out:
-            // thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin());
+            thrust::exclusive_scan(thrust_idata.begin(), thrust_idata.end(), thrust_odata.begin());
             timer().endGpuTimer();
+
+            thrust::copy(thrust_odata.begin(), thrust_odata.end(), odata);
         }
     }
 }