Skip to content
Open
1 change: 1 addition & 0 deletions 2013-09-07.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Empty Template
31 changes: 31 additions & 0 deletions Week 10.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
<h3>Weekly Reflections for the week 11/3-11/9</h3>

<h4>Requirements for Data Cleaning up</h4>

Sorry for a little bit late to submit the weekly reflection. I hope it is not too late.

The ultimate target of data cleaning up is to provide the analyzer an easy way to access those information in certain data structure they prefer. The procedure of generate such kind of data structure could be very detail-oriented and natually we want to provide a general way for anyone who want to use it.

Sometimes, just taking care of missing data is not the end of the story yet. The original data might also contains lots of useless information, even noises which needs further clean up. It would be great to make it possible for the analyzers to selectively chosse data based on their own criterias. Since the criteria is decided by tha analyzer rather than the data curator, it is not possible to have it hard coded in the data cleaning up code. This is is one of the road block before we can move on furhter.

My solution is to allow the analyzer to pass a function which decides whether a data should be kept or not when they call the tools provide by data curator which generates the data structure.

Following pseudo code are based on R.

# The data curator's code
flexible_data_generator=function(original_data, select_function=NA) {
export_data = original_data
if(is.function(select_function)) {
keep=select_function(year=original_data$year, mag=original_data$mag, ...)
export_data = original_data[ keep, ]
}

# generate ppx data based on export_data
...
}

# The analyzer's code
myselect=function(year, mag, ...) {...}
flexible_data_generator(clean_data, myselect)

There is no need for the analyzer to implement their own select_function, but they can if they want to apply certain rules to the original data.
8 changes: 8 additions & 0 deletions Week 11.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
<h3>Weekly Reflections for the week 11/10-11/16</h3>

<h4>Reflection on "Python for Data Analysis"</h4>

"Python for Data Analysis" is the book recommended in class. As a new learner, I found it is very helpful. The brief introduction to Python's essentials reveals lots of basic ideas. As an object-oriented languange, it is important to understand both what does a data type is trying to modeling as well as related method. The brief review section has plenty of examples to reveal the what does each basic Python data type modeling, and those small samples makes thing easier to understand the purpose and function of each method.

However, data type imported with pandas, numpy ans scipy are more complicated. For example, it is not difficult to imagine what does Data.Frame modeling, but purpose and function of its methods are not. Notice that abbreviated names are widely used and these names could be ambiguous. Furthermore, the examples are so big that it is impossible to review the results of some methods manually. I'm hoping to find some smaller examples in the following chapters.

7 changes: 7 additions & 0 deletions Week 12.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
<h3>Weekly Reflections for the week 11/17-11/23</h3>

<h4>Importance of Comments</h4>

In order to contribute more to the whole project, now I joined the visualizer's team "visualheart.task8" and we need to enhance code provided by analyzers and make the visualization better, which means I need to review code from other people and revise it. It comes to my attention that sometimes the code is easy to run but not easy to read. In another word, they are not well documented.

The whole idea about this project is a reproducible study which means anyone is allowed to review your code and data and try to reproduce whatever you did. As long as someone might revisit your code later, it is important to keep the code well documented with enough comments to help them understanding the idea. As the matter of fact, the author himself is also benefit from that since in most cases he also needs to revisit the code from time to time for bug-fixing or sort of enhancement. So it is always worthy to have some comments for even a straight-forward trick to the author at that time he developed this code, unless it is common sense for everyone. It is very important part of maintainability of the code.
30 changes: 30 additions & 0 deletions Week of Thanksgiving.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
<h3>Weekly Reflections for the week 11/14-11/30</h3>

<h4>Keep the Code Clean And Scalable</h4>

I've been working on etas-training.R this week, the Quaker team did a pretty job on this, it runs smoothly out-of-the-box and the final plot is colorful and with lots of information. There wasn't much I could do to improve the plot, so I focused on impoving the code itself by making it neater and faster.

The 1st thing I noticed was that the original code generated several large vectors but never used them. Also, there were some other large vectors served only as intermediate results for those unused vectors. so I made the code shorter, and faster by removing the code related to these variables, which didn't affect the final results.

The 2nd thing I noticed was that the original code sourced Luen's code that contains lots of information. However, the variables and functions defined in Luen's code were not used by etas-training, so I cleaned all variable in current space and reran the etas-training code without source Luen's code. It turned out that the final results remained unchanged.

The 3rd thing I noticed was that in the original code vectors are declared as place holder first and assigned later in a for loop. This approach works, but the better approach in R is to do it with sapply function.

Finally, I went through the code step-by-step and tried to improve the performance of the code. I think reproducibility does not only mean to share the code and dataset with other people, but also means that the code is scalable in order to handle different dataset. I noticed that the it was possible to improve the code for calculating intermediate variable w58.list and w58.dist

for(KK in 1:length(w58.list)){
w58.list[KK]=min((times[1+KK]-times[1:(KK)])/(5.8^mags[1:(KK)]))}
for(KK in 1:length(timelist)){
w58.dist[KK]=min((timelist[KK]-times[1:n.events[KK]])/(5.8^mags[1:n.events[KK]]))}

The code needed to get power series 5.8^n inside the loop so it did the calculation everytime, and the calculation results were abandoned immediately after evaluation of that expression. On the other hand, all of these power values would be recalculated again for next KK value. Take w58.list as example, K ranges from 1 to N, so we need to get a power series of N items but in the code the power was evaluated about N*N/2 times. Considering the fact that power evaluation is a pretty expensive operation slowing the processing, I modified this part by adding power series generation code before for loop. Below is the new code.

power_mags=5.8^mags
w58.list=sapply(c(1:n.training), function(x) {
min((times[1+x]-times[1:x])/(power_mags[1:x]))
})
w58.dist=sapply(c(1:length(timelist)), function(x) {
min((timelist[x]-times[1:n.events[x]])/(power_mags[1:n.events[x]]))
})

By combining all of these efforts together, the improved code I checked in runs almost 2 times faster than the original code.