From 4aa3f3aa097a785ccb18bebc30af440a826e0a3e Mon Sep 17 00:00:00 2001 From: qi-zhang Date: Wed, 11 Sep 2013 22:45:57 -0700 Subject: [PATCH 01/19] For 1st week. --- 2013-09-07.md | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/2013-09-07.md b/2013-09-07.md index e69de29..e18a077 100644 --- a/2013-09-07.md +++ b/2013-09-07.md @@ -0,0 +1,11 @@ +Reflection for the week 9/1 to 9/7. + +It was pretty challenging to get ssh access to the Ubuntu server running on my Virtual box. + +By default there is only one network adaptor, known as “Adaptor 1” and it was attached to “NAT”. Such kind of configuration makes it impossible to access the box with its IP address, not to mention establishing ssh access. + +I tried to re-configure the network adaptor to attach to “bridged adaptor” and I’m able to establish ssh access to it. Soon I found that was not a good idea since the IP address of the box keep changing every time I restarted it. + +Finally I configured the “Adaptor 1” back to attached to “NAT” and enabled “Adaptor 2” and let it attached to “Host-Only Adapter”. With help from server guide I found on Ubuntu’s website (https://help.ubuntu.com/13.04/serverguide/network-configuration.html), I finally get the box to support “Adaptor 2” and now I can establish ssh access to the box. + +It was pretty challenging for me since this is the first time I learn some terms in networking. But I’m so glad I did it. From aba31eeafe43e1c3662b191234c42b4d2ea79fb5 Mon Sep 17 00:00:00 2001 From: qi-zhang Date: Wed, 11 Sep 2013 22:49:58 -0700 Subject: [PATCH 02/19] Update 2013-09-07.md --- 2013-09-07.md | 12 +----------- 1 file changed, 1 insertion(+), 11 deletions(-) diff --git a/2013-09-07.md b/2013-09-07.md index e18a077..831a521 100644 --- a/2013-09-07.md +++ b/2013-09-07.md @@ -1,11 +1 @@ -Reflection for the week 9/1 to 9/7. - -It was pretty challenging to get ssh access to the Ubuntu server running on my Virtual box. - -By default there is only one network adaptor, known as “Adaptor 1” and it was attached to “NAT”. Such kind of configuration makes it impossible to access the box with its IP address, not to mention establishing ssh access. - -I tried to re-configure the network adaptor to attach to “bridged adaptor” and I’m able to establish ssh access to it. Soon I found that was not a good idea since the IP address of the box keep changing every time I restarted it. - -Finally I configured the “Adaptor 1” back to attached to “NAT” and enabled “Adaptor 2” and let it attached to “Host-Only Adapter”. With help from server guide I found on Ubuntu’s website (https://help.ubuntu.com/13.04/serverguide/network-configuration.html), I finally get the box to support “Adaptor 2” and now I can establish ssh access to the box. - -It was pretty challenging for me since this is the first time I learn some terms in networking. But I’m so glad I did it. +Empty Template From cfff69f2796dfd9d9dab896972b34abdbc4a7716 Mon Sep 17 00:00:00 2001 From: qi-zhang Date: Sat, 19 Oct 2013 19:43:03 -0700 Subject: [PATCH 03/19] Create week7 --- week7 | 12 ++++++++++++ 1 file changed, 12 insertions(+) create mode 100644 week7 diff --git a/week7 b/week7 new file mode 100644 index 0000000..635ad14 --- /dev/null +++ b/week7 @@ -0,0 +1,12 @@ +

Weekly Reflections for the week 10/13-10/19

+ +

Working with JSON in Python

+ +JSON is format of hierarchical data file. The format of JSON file is similar to a flexible combination of dict and list object in Python. Hierarchical data file means same field name is reusable across the database as long as these field names are nested under different hierarch. That's why it is not a good idea to search for certain information with regular expression in hierarchical data files. To find certain information of a hierarchical file, it is necessary to figure the structure of a JSON file. + +In terms of file format, JSON file is similar to a dict object in Python, but it's value could be flexible combination of dict object and list object. Naturally, we can leverage concept of dict object and list object to parse a JSON file. + +The first step is to create a dict object from a JSON file with json.loads which is a function defined in json module. json.loads() returns a dict object. For example, + + +detail=json.loads(urllib.urlopen("http://earthquake.usgs.gov/product/nearby-cities/ci11380834/us/1382197630296/nearby-cities.json").read()) From ad4654b2dd592c15068b90f4b57eee3b49ec0d7a Mon Sep 17 00:00:00 2001 From: qi-zhang Date: Sat, 19 Oct 2013 20:23:43 -0700 Subject: [PATCH 04/19] Delete week7 --- week7 | 12 ------------ 1 file changed, 12 deletions(-) delete mode 100644 week7 diff --git a/week7 b/week7 deleted file mode 100644 index 635ad14..0000000 --- a/week7 +++ /dev/null @@ -1,12 +0,0 @@ -

Weekly Reflections for the week 10/13-10/19

- -

Working with JSON in Python

- -JSON is format of hierarchical data file. The format of JSON file is similar to a flexible combination of dict and list object in Python. Hierarchical data file means same field name is reusable across the database as long as these field names are nested under different hierarch. That's why it is not a good idea to search for certain information with regular expression in hierarchical data files. To find certain information of a hierarchical file, it is necessary to figure the structure of a JSON file. - -In terms of file format, JSON file is similar to a dict object in Python, but it's value could be flexible combination of dict object and list object. Naturally, we can leverage concept of dict object and list object to parse a JSON file. - -The first step is to create a dict object from a JSON file with json.loads which is a function defined in json module. json.loads() returns a dict object. For example, - - -detail=json.loads(urllib.urlopen("http://earthquake.usgs.gov/product/nearby-cities/ci11380834/us/1382197630296/nearby-cities.json").read()) From 77c0c1d1298fcd9b1ad552fa4cda04ba1bc964de Mon Sep 17 00:00:00 2001 From: qi-zhang Date: Mon, 11 Nov 2013 14:45:50 -0800 Subject: [PATCH 05/19] Create Week 10 --- Week 10 | 4 ++++ 1 file changed, 4 insertions(+) create mode 100644 Week 10 diff --git a/Week 10 b/Week 10 new file mode 100644 index 0000000..348a81e --- /dev/null +++ b/Week 10 @@ -0,0 +1,4 @@ +

Weekly Reflections for the week 11/3-11/9

+ +

Data Clean Up

+ From b178cc998b78f0912ebf12f760e5f9492644852a Mon Sep 17 00:00:00 2001 From: qi-zhang Date: Mon, 11 Nov 2013 14:46:44 -0800 Subject: [PATCH 06/19] Delete Week 10 --- Week 10 | 4 ---- 1 file changed, 4 deletions(-) delete mode 100644 Week 10 diff --git a/Week 10 b/Week 10 deleted file mode 100644 index 348a81e..0000000 --- a/Week 10 +++ /dev/null @@ -1,4 +0,0 @@ -

Weekly Reflections for the week 11/3-11/9

- -

Data Clean Up

- From dc2f30787432f147ff167e534f718cb3cad4f5c6 Mon Sep 17 00:00:00 2001 From: qi-zhang Date: Mon, 11 Nov 2013 14:47:01 -0800 Subject: [PATCH 07/19] Create Week 10.md --- Week 10.md | 4 ++++ 1 file changed, 4 insertions(+) create mode 100644 Week 10.md diff --git a/Week 10.md b/Week 10.md new file mode 100644 index 0000000..348a81e --- /dev/null +++ b/Week 10.md @@ -0,0 +1,4 @@ +

Weekly Reflections for the week 11/3-11/9

+ +

Data Clean Up

+ From 281b92d09f0b86d702d2e0727c9adc43a18052c7 Mon Sep 17 00:00:00 2001 From: qi-zhang Date: Mon, 11 Nov 2013 16:25:00 -0800 Subject: [PATCH 08/19] Update Week 10.md --- Week 10.md | 23 ++++++++++++++++++++++- 1 file changed, 22 insertions(+), 1 deletion(-) diff --git a/Week 10.md b/Week 10.md index 348a81e..ab66790 100644 --- a/Week 10.md +++ b/Week 10.md @@ -1,4 +1,25 @@

Weekly Reflections for the week 11/3-11/9

-

Data Clean Up

+

Requirements for Data Cleaning up

+The ultimate target of data cleaning up is to provide the analyzer an easy way to access those information in certain data structure they prefer. The procedure of generate such kind of data structure could be very detail-oriented and natually we want to provide a general way for anyone who want to use it. + +Sometimes, just taking care of missing data is not the end of the story yet. The original data might also contains lots of useless information, even noises which needs further clean up. It would be great to make it possible for the analyzers to selectively chosse data based on their own criterias. Since the criteria is decided by tha analyzer rather than the data curator, it is not possible to have it hard coded in the data cleaning up code. This is is one of the road block before we can move on furhter. + +My solution is to allow the analyzer to pass a function which decides whether a data should be kept or not when they call the tools provide by data curator which generates the data structure. + +Following pseudo code are based on R. + + # The data curator's code + flexible_data_generator=function(original_data, select_function=NA) { + export_data = original_data + if(is.function(select_function)) { + keep=select_function(year=original_data$year, mag=original_data$mag, ...) + export_data = original_data[ keep, ] + } + + # generate ppx data based on export_data + ... + } + +There is no need for the analyzer to implement their own select_function, but they can if they want to apply certain rules to the original data. From f69dc8df47583f192150f9f8fc5ca7d7df3251a1 Mon Sep 17 00:00:00 2001 From: qi-zhang Date: Mon, 11 Nov 2013 20:52:13 -0800 Subject: [PATCH 09/19] Update Week 10.md --- Week 10.md | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/Week 10.md b/Week 10.md index ab66790..1066ac9 100644 --- a/Week 10.md +++ b/Week 10.md @@ -2,6 +2,8 @@

Requirements for Data Cleaning up

+Sorry for a little bit late to submit the weekly reflection. I hope it is not too late. + The ultimate target of data cleaning up is to provide the analyzer an easy way to access those information in certain data structure they prefer. The procedure of generate such kind of data structure could be very detail-oriented and natually we want to provide a general way for anyone who want to use it. Sometimes, just taking care of missing data is not the end of the story yet. The original data might also contains lots of useless information, even noises which needs further clean up. It would be great to make it possible for the analyzers to selectively chosse data based on their own criterias. Since the criteria is decided by tha analyzer rather than the data curator, it is not possible to have it hard coded in the data cleaning up code. This is is one of the road block before we can move on furhter. @@ -22,4 +24,8 @@ Following pseudo code are based on R. ... } + # The analyzer's code + myselect=function(year, mag, ...) {...} + flexible_data_generator(clean_data, myselect) + There is no need for the analyzer to implement their own select_function, but they can if they want to apply certain rules to the original data. From 58e79c8d3906b582973757b1fdc7a9bcf548e7ca Mon Sep 17 00:00:00 2001 From: qi-zhang Date: Sat, 16 Nov 2013 20:26:15 -0800 Subject: [PATCH 10/19] Create Week 11.md --- Week 11.md | 8 ++++++++ 1 file changed, 8 insertions(+) create mode 100644 Week 11.md diff --git a/Week 11.md b/Week 11.md new file mode 100644 index 0000000..371c8bf --- /dev/null +++ b/Week 11.md @@ -0,0 +1,8 @@ +

Weekly Reflections for the week 11/10-11/16

+ +

Make "Python for Data Analysis" Better

+ +"Python for Data Analysis" is a book for new learners. As a new learner, I found the brief introduction to Python's essentials is really helpful. As an object-oriented languange, it is important to understand both what does a data type is trying to modeling as well as related method. The brief review section has plenty of examples to reveal the what does each basic Python data type modeling, and those small samples makes thing easire to understand the purpose and function of each method. + +However, data type imported with pandas, numpy ans scipy are more complicated. For example, it is not difficult to imagine what does Data.Frame modeling, but purpose and function of its methods are not. Notice that abbreviated names are widely used and these names could be ambiguous. Furthermore, the examples are so big that it is impossible to review the results of some methods manually. I believe examples of appropriate size will definately make this book even better. + From 5f9456ff2e5d5727851594a7d2012a863030f1aa Mon Sep 17 00:00:00 2001 From: qi-zhang Date: Sat, 16 Nov 2013 20:31:11 -0800 Subject: [PATCH 11/19] Update Week 11.md --- Week 11.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/Week 11.md b/Week 11.md index 371c8bf..006b9d5 100644 --- a/Week 11.md +++ b/Week 11.md @@ -1,8 +1,8 @@

Weekly Reflections for the week 11/10-11/16

-

Make "Python for Data Analysis" Better

+

Reflection on "Python for Data Analysis"

-"Python for Data Analysis" is a book for new learners. As a new learner, I found the brief introduction to Python's essentials is really helpful. As an object-oriented languange, it is important to understand both what does a data type is trying to modeling as well as related method. The brief review section has plenty of examples to reveal the what does each basic Python data type modeling, and those small samples makes thing easire to understand the purpose and function of each method. +"Python for Data Analysis" is the book recommended in class. As a new learner, I found it is very helpful. The brief introduction to Python's essentials reveals lots of basic ideas. As an object-oriented languange, it is important to understand both what does a data type is trying to modeling as well as related method. The brief review section has plenty of examples to reveal the what does each basic Python data type modeling, and those small samples makes thing easier to understand the purpose and function of each method. -However, data type imported with pandas, numpy ans scipy are more complicated. For example, it is not difficult to imagine what does Data.Frame modeling, but purpose and function of its methods are not. Notice that abbreviated names are widely used and these names could be ambiguous. Furthermore, the examples are so big that it is impossible to review the results of some methods manually. I believe examples of appropriate size will definately make this book even better. +However, data type imported with pandas, numpy ans scipy are more complicated. For example, it is not difficult to imagine what does Data.Frame modeling, but purpose and function of its methods are not. Notice that abbreviated names are widely used and these names could be ambiguous. Furthermore, the examples are so big that it is impossible to review the results of some methods manually. I'm hoping to find some smaller examples in the following chapters. From bce6abaabac51d5f5196f2198e3ba20605fad34f Mon Sep 17 00:00:00 2001 From: qi-zhang Date: Sat, 23 Nov 2013 20:35:53 -0800 Subject: [PATCH 12/19] Create Week 12.md --- Week 12.md | 7 +++++++ 1 file changed, 7 insertions(+) create mode 100644 Week 12.md diff --git a/Week 12.md b/Week 12.md new file mode 100644 index 0000000..ffcbaeb --- /dev/null +++ b/Week 12.md @@ -0,0 +1,7 @@ +

Weekly Reflections for the week 11/17-11/23

+ +

Importance of Comments

+ +In order to contribute more to the whole project, now I joined the visualizer's team "visualheart.task8" and we need to enhance code provided by analyzers and make the visualization better, which means I need to review code from other people and revise it. It comes to my attention that sometimes the code is easy to run but not easy to read. In another word, they are not well documented. + +The whole idea about this project is a reproducible study which means anyone is allowed to review your code and data and try to reproduce whatever you did. As long as someone might revisit your code later, it is important to keep the code well documented with enough comments to help them understanding the idea. As the matter of fact, the author himself is also benefit from that since in most cases he also needs to revisit the code from time to time for bug-fixing or sort of enhancement. So it is always worthy to have some comments for even a straight-forward trick to the author at that time he developed this code, unless it is common sense for everyone. It is very important part of maintainability of the code. From 6b69aa045d6d4fb6ef0046381db5f0b47d0c4a5d Mon Sep 17 00:00:00 2001 From: qi-zhang Date: Sat, 30 Nov 2013 15:02:57 -0800 Subject: [PATCH 13/19] Create Week of the Thanksgiving --- Week of the Thanksgiving | 22 ++++++++++++++++++++++ 1 file changed, 22 insertions(+) create mode 100644 Week of the Thanksgiving diff --git a/Week of the Thanksgiving b/Week of the Thanksgiving new file mode 100644 index 0000000..4917ea3 --- /dev/null +++ b/Week of the Thanksgiving @@ -0,0 +1,22 @@ +

Weekly Reflections for the week 11/14-11/30

+ +

Keep the Code Clean And Scalable

+ +I was working on etas-training.R this week, the Quaker team did a pretty job on this, it runs smoothly out-of-the-box and the final plot is colorful and with lots of information. There isn't much I can do to improve the plot, so I focused on impoving the code itself by make it cleaner and faster. + +The 1st thing I noticed is that the original code generates several large vectors but never use them. Also, there are some other large vectors served only as intermediate results for those unused vectors. I can definitely make the code shorter, cleaner and faster by removing any code related to these variables and there is no impact to the final results. + +The 2nd thing I noticed is that the original code source Luen's code which contains lots of information. However nothing defined in Luen's code has been refered in the following part which implies it is possible that the code works fine without source Luen's code. So I cleaned all variable in current space and reran the etas-training code without source Luen's code. It turned out that there is no change to the final result. + +The 3rd thing I noticed is that in the original code vectors are declared as place holder first and assigned later in a for loop. This approach works, but the better approach in R is to do it with sapply function. + +Finally, I went through the code step-by-step. I noticed that the code for calculating intermediate varaible w58.list and w58.dist + + for(KK in 1:length(w58.list)){ + w58.list[KK]=min((times[1+KK]-times[1:(KK)])/(5.8^mags[1:(KK)]))} + for(KK in 1:length(timelist)){ + w58.dist[KK]=min((timelist[KK]-times[1:n.events[KK]])/(5.8^mags[1:n.events[KK]]))} + +The code needs to get power series 5.8^n inside the loop so it dose the calculation everytime, and the calculation results are abandoned immediately after evaluation of that expression. On the other hand, all of these power values will be recalculated again for next KK value. Apparently it can be improved. Take w58.list as example, K ranges from 1 to N, which means it is necessary to get a power series of N items and the code in the for-loop will evaluate power about N*N/2 times. Take the fact that power evaluation is a pretty expensive operation into consideration, there is notable room for performance improvement here. + +Combine all of these effort together, I checked-in the impvoed version which runs pretty much 2 times faster than the original code. From ef15d7a96f6641f504928edc9517fa5604ee95ec Mon Sep 17 00:00:00 2001 From: qi-zhang Date: Sat, 30 Nov 2013 15:04:01 -0800 Subject: [PATCH 14/19] Delete Week of the Thanksgiving --- Week of the Thanksgiving | 22 ---------------------- 1 file changed, 22 deletions(-) delete mode 100644 Week of the Thanksgiving diff --git a/Week of the Thanksgiving b/Week of the Thanksgiving deleted file mode 100644 index 4917ea3..0000000 --- a/Week of the Thanksgiving +++ /dev/null @@ -1,22 +0,0 @@ -

Weekly Reflections for the week 11/14-11/30

- -

Keep the Code Clean And Scalable

- -I was working on etas-training.R this week, the Quaker team did a pretty job on this, it runs smoothly out-of-the-box and the final plot is colorful and with lots of information. There isn't much I can do to improve the plot, so I focused on impoving the code itself by make it cleaner and faster. - -The 1st thing I noticed is that the original code generates several large vectors but never use them. Also, there are some other large vectors served only as intermediate results for those unused vectors. I can definitely make the code shorter, cleaner and faster by removing any code related to these variables and there is no impact to the final results. - -The 2nd thing I noticed is that the original code source Luen's code which contains lots of information. However nothing defined in Luen's code has been refered in the following part which implies it is possible that the code works fine without source Luen's code. So I cleaned all variable in current space and reran the etas-training code without source Luen's code. It turned out that there is no change to the final result. - -The 3rd thing I noticed is that in the original code vectors are declared as place holder first and assigned later in a for loop. This approach works, but the better approach in R is to do it with sapply function. - -Finally, I went through the code step-by-step. I noticed that the code for calculating intermediate varaible w58.list and w58.dist - - for(KK in 1:length(w58.list)){ - w58.list[KK]=min((times[1+KK]-times[1:(KK)])/(5.8^mags[1:(KK)]))} - for(KK in 1:length(timelist)){ - w58.dist[KK]=min((timelist[KK]-times[1:n.events[KK]])/(5.8^mags[1:n.events[KK]]))} - -The code needs to get power series 5.8^n inside the loop so it dose the calculation everytime, and the calculation results are abandoned immediately after evaluation of that expression. On the other hand, all of these power values will be recalculated again for next KK value. Apparently it can be improved. Take w58.list as example, K ranges from 1 to N, which means it is necessary to get a power series of N items and the code in the for-loop will evaluate power about N*N/2 times. Take the fact that power evaluation is a pretty expensive operation into consideration, there is notable room for performance improvement here. - -Combine all of these effort together, I checked-in the impvoed version which runs pretty much 2 times faster than the original code. From be02d44ff11fb0aedbaabdad5d57e29e1f20fbe4 Mon Sep 17 00:00:00 2001 From: qi-zhang Date: Sat, 30 Nov 2013 15:04:34 -0800 Subject: [PATCH 15/19] Create Week of Thanksgiving.md --- Week of Thanksgiving.md | 22 ++++++++++++++++++++++ 1 file changed, 22 insertions(+) create mode 100644 Week of Thanksgiving.md diff --git a/Week of Thanksgiving.md b/Week of Thanksgiving.md new file mode 100644 index 0000000..4917ea3 --- /dev/null +++ b/Week of Thanksgiving.md @@ -0,0 +1,22 @@ +

Weekly Reflections for the week 11/14-11/30

+ +

Keep the Code Clean And Scalable

+ +I was working on etas-training.R this week, the Quaker team did a pretty job on this, it runs smoothly out-of-the-box and the final plot is colorful and with lots of information. There isn't much I can do to improve the plot, so I focused on impoving the code itself by make it cleaner and faster. + +The 1st thing I noticed is that the original code generates several large vectors but never use them. Also, there are some other large vectors served only as intermediate results for those unused vectors. I can definitely make the code shorter, cleaner and faster by removing any code related to these variables and there is no impact to the final results. + +The 2nd thing I noticed is that the original code source Luen's code which contains lots of information. However nothing defined in Luen's code has been refered in the following part which implies it is possible that the code works fine without source Luen's code. So I cleaned all variable in current space and reran the etas-training code without source Luen's code. It turned out that there is no change to the final result. + +The 3rd thing I noticed is that in the original code vectors are declared as place holder first and assigned later in a for loop. This approach works, but the better approach in R is to do it with sapply function. + +Finally, I went through the code step-by-step. I noticed that the code for calculating intermediate varaible w58.list and w58.dist + + for(KK in 1:length(w58.list)){ + w58.list[KK]=min((times[1+KK]-times[1:(KK)])/(5.8^mags[1:(KK)]))} + for(KK in 1:length(timelist)){ + w58.dist[KK]=min((timelist[KK]-times[1:n.events[KK]])/(5.8^mags[1:n.events[KK]]))} + +The code needs to get power series 5.8^n inside the loop so it dose the calculation everytime, and the calculation results are abandoned immediately after evaluation of that expression. On the other hand, all of these power values will be recalculated again for next KK value. Apparently it can be improved. Take w58.list as example, K ranges from 1 to N, which means it is necessary to get a power series of N items and the code in the for-loop will evaluate power about N*N/2 times. Take the fact that power evaluation is a pretty expensive operation into consideration, there is notable room for performance improvement here. + +Combine all of these effort together, I checked-in the impvoed version which runs pretty much 2 times faster than the original code. From e1df30187087264c1cd86948b1a34dbff83e253e Mon Sep 17 00:00:00 2001 From: qi-zhang Date: Sat, 30 Nov 2013 15:55:48 -0800 Subject: [PATCH 16/19] Update Week of Thanksgiving.md --- Week of Thanksgiving.md | 22 +++++++++++++++------- 1 file changed, 15 insertions(+), 7 deletions(-) diff --git a/Week of Thanksgiving.md b/Week of Thanksgiving.md index 4917ea3..becbdf3 100644 --- a/Week of Thanksgiving.md +++ b/Week of Thanksgiving.md @@ -10,13 +10,21 @@ The 2nd thing I noticed is that the original code source Luen's code which conta The 3rd thing I noticed is that in the original code vectors are declared as place holder first and assigned later in a for loop. This approach works, but the better approach in R is to do it with sapply function. -Finally, I went through the code step-by-step. I noticed that the code for calculating intermediate varaible w58.list and w58.dist +Finally, I went through the code step-by-step and tried to improve the performance of the code. To me, reproducible does not means to share the code and dataset with other people only, it also means that the code should be scalable to handle different dataset. Soon I noticed that the it is possible to improve the code for calculating intermediate varaible w58.list and w58.dist might cause - for(KK in 1:length(w58.list)){ - w58.list[KK]=min((times[1+KK]-times[1:(KK)])/(5.8^mags[1:(KK)]))} - for(KK in 1:length(timelist)){ - w58.dist[KK]=min((timelist[KK]-times[1:n.events[KK]])/(5.8^mags[1:n.events[KK]]))} + for(KK in 1:length(w58.list)){ + w58.list[KK]=min((times[1+KK]-times[1:(KK)])/(5.8^mags[1:(KK)]))} + for(KK in 1:length(timelist)){ + w58.dist[KK]=min((timelist[KK]-times[1:n.events[KK]])/(5.8^mags[1:n.events[KK]]))} -The code needs to get power series 5.8^n inside the loop so it dose the calculation everytime, and the calculation results are abandoned immediately after evaluation of that expression. On the other hand, all of these power values will be recalculated again for next KK value. Apparently it can be improved. Take w58.list as example, K ranges from 1 to N, which means it is necessary to get a power series of N items and the code in the for-loop will evaluate power about N*N/2 times. Take the fact that power evaluation is a pretty expensive operation into consideration, there is notable room for performance improvement here. +The code needs to get power series 5.8^n inside the loop so it dose the calculation everytime, and the calculation results are abandoned immediately after evaluation of that expression. On the other hand, all of these power values will be recalculated again for next KK value. Apparently it can be improved. Take w58.list as example, K ranges from 1 to N, which means it is necessary to get a power series of N items and the code in the for-loop will evaluate power about N*N/2 times. Take the fact that power evaluation is a pretty expensive operation into consideration, there is notable room for improvement here. I modified this part by adding power series generation code before for loop and refer to the pre-generated power series in the for loop instead of calcuating on the fly. Here is the new code. -Combine all of these effort together, I checked-in the impvoed version which runs pretty much 2 times faster than the original code. + power_mags=5.8^mags + w58.list=sapply(c(1:n.training), function(x) { + min((times[1+x]-times[1:x])/(power_mags[1:x])) + }) + w58.dist=sapply(c(1:length(timelist)), function(x) { + min((timelist[x]-times[1:n.events[x]])/(power_mags[1:n.events[x]])) + }) + +By combining all of these efforts together, the impvoed version code I checked in runs pretty much 2 times faster than the original code. From 36b24541e29a6575fefb541b88b2e83c6f3c1470 Mon Sep 17 00:00:00 2001 From: qi-zhang Date: Sat, 30 Nov 2013 16:40:46 -0800 Subject: [PATCH 17/19] Update Week of Thanksgiving.md --- Week of Thanksgiving.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/Week of Thanksgiving.md b/Week of Thanksgiving.md index becbdf3..530be1d 100644 --- a/Week of Thanksgiving.md +++ b/Week of Thanksgiving.md @@ -2,22 +2,22 @@

Keep the Code Clean And Scalable

-I was working on etas-training.R this week, the Quaker team did a pretty job on this, it runs smoothly out-of-the-box and the final plot is colorful and with lots of information. There isn't much I can do to improve the plot, so I focused on impoving the code itself by make it cleaner and faster. +I've been working on etas-training.R this week, the Quaker team did a pretty job on this, it runs smoothly out-of-the-box and the final plot is colorful and with lots of information. There wasn't much I could do to improve the plot, so I focused on impoving the code itself by making it neater and faster. -The 1st thing I noticed is that the original code generates several large vectors but never use them. Also, there are some other large vectors served only as intermediate results for those unused vectors. I can definitely make the code shorter, cleaner and faster by removing any code related to these variables and there is no impact to the final results. +The 1st thing I noticed was that the original code generated several large vectors but never used them. Also, there were some other large vectors served only as intermediate results for those unused vectors. so I made the code shorter, and faster by removing the code related to these variables, which didn't affect the final results. -The 2nd thing I noticed is that the original code source Luen's code which contains lots of information. However nothing defined in Luen's code has been refered in the following part which implies it is possible that the code works fine without source Luen's code. So I cleaned all variable in current space and reran the etas-training code without source Luen's code. It turned out that there is no change to the final result. +The 2nd thing I noticed was that the original code sourced Luen's code that contains lots of information. However, the variables and functions defined in Luen's code were not used by etas-training, so I cleaned all variable in current space and reran the etas-training code without source Luen's code. It turned out that the final results remained unchanged. -The 3rd thing I noticed is that in the original code vectors are declared as place holder first and assigned later in a for loop. This approach works, but the better approach in R is to do it with sapply function. +The 3rd thing I noticed was that in the original code vectors are declared as place holder first and assigned later in a for loop. This approach works, but the better approach in R is to do it with sapply function. -Finally, I went through the code step-by-step and tried to improve the performance of the code. To me, reproducible does not means to share the code and dataset with other people only, it also means that the code should be scalable to handle different dataset. Soon I noticed that the it is possible to improve the code for calculating intermediate varaible w58.list and w58.dist might cause +Finally, I went through the code step-by-step and tried to improve the performance of the code. I think reproducibility does not only mean to share the code and dataset with other people, but also means that the code is scalable in order to handle different dataset. I noticed that the it was possible to improve the code for calculating intermediate variable w58.list and w58.dist for(KK in 1:length(w58.list)){ w58.list[KK]=min((times[1+KK]-times[1:(KK)])/(5.8^mags[1:(KK)]))} for(KK in 1:length(timelist)){ w58.dist[KK]=min((timelist[KK]-times[1:n.events[KK]])/(5.8^mags[1:n.events[KK]]))} -The code needs to get power series 5.8^n inside the loop so it dose the calculation everytime, and the calculation results are abandoned immediately after evaluation of that expression. On the other hand, all of these power values will be recalculated again for next KK value. Apparently it can be improved. Take w58.list as example, K ranges from 1 to N, which means it is necessary to get a power series of N items and the code in the for-loop will evaluate power about N*N/2 times. Take the fact that power evaluation is a pretty expensive operation into consideration, there is notable room for improvement here. I modified this part by adding power series generation code before for loop and refer to the pre-generated power series in the for loop instead of calcuating on the fly. Here is the new code. +The code needed to get power series 5.8^n inside the loop so it did the calculation everytime, and the calculation results were abandoned immediately after evaluation of that expression. On the other hand, all of these power values would be recalculated again for next KK value. Take w58.list as example, K ranges from 1 to N, so we need to get a power series of N items but in the code the power was evaluated about N*N/2 times. Considering the fact that power evaluation is a pretty expensive operation slowing the processing, I modified this part by adding power series generation code before for loop. Below is the new code. power_mags=5.8^mags w58.list=sapply(c(1:n.training), function(x) { @@ -27,4 +27,4 @@ The code needs to get power series 5.8^n inside the loop so it dose the calculat min((timelist[x]-times[1:n.events[x]])/(power_mags[1:n.events[x]])) }) -By combining all of these efforts together, the impvoed version code I checked in runs pretty much 2 times faster than the original code. +By combining all of these efforts together, the improved code I checked in runs almost 2 times faster than the original code. From 038489cafdbceb1416dbcacaec0b8d93a16ef9fd Mon Sep 17 00:00:00 2001 From: qi-zhang Date: Sat, 30 Nov 2013 16:45:45 -0800 Subject: [PATCH 18/19] Update 2013-09-07.md --- 2013-09-07.md | 31 ++++++++++++++++++++++++++++++- 1 file changed, 30 insertions(+), 1 deletion(-) diff --git a/2013-09-07.md b/2013-09-07.md index 831a521..530be1d 100644 --- a/2013-09-07.md +++ b/2013-09-07.md @@ -1 +1,30 @@ -Empty Template +

Weekly Reflections for the week 11/14-11/30

+ +

Keep the Code Clean And Scalable

+ +I've been working on etas-training.R this week, the Quaker team did a pretty job on this, it runs smoothly out-of-the-box and the final plot is colorful and with lots of information. There wasn't much I could do to improve the plot, so I focused on impoving the code itself by making it neater and faster. + +The 1st thing I noticed was that the original code generated several large vectors but never used them. Also, there were some other large vectors served only as intermediate results for those unused vectors. so I made the code shorter, and faster by removing the code related to these variables, which didn't affect the final results. + +The 2nd thing I noticed was that the original code sourced Luen's code that contains lots of information. However, the variables and functions defined in Luen's code were not used by etas-training, so I cleaned all variable in current space and reran the etas-training code without source Luen's code. It turned out that the final results remained unchanged. + +The 3rd thing I noticed was that in the original code vectors are declared as place holder first and assigned later in a for loop. This approach works, but the better approach in R is to do it with sapply function. + +Finally, I went through the code step-by-step and tried to improve the performance of the code. I think reproducibility does not only mean to share the code and dataset with other people, but also means that the code is scalable in order to handle different dataset. I noticed that the it was possible to improve the code for calculating intermediate variable w58.list and w58.dist + + for(KK in 1:length(w58.list)){ + w58.list[KK]=min((times[1+KK]-times[1:(KK)])/(5.8^mags[1:(KK)]))} + for(KK in 1:length(timelist)){ + w58.dist[KK]=min((timelist[KK]-times[1:n.events[KK]])/(5.8^mags[1:n.events[KK]]))} + +The code needed to get power series 5.8^n inside the loop so it did the calculation everytime, and the calculation results were abandoned immediately after evaluation of that expression. On the other hand, all of these power values would be recalculated again for next KK value. Take w58.list as example, K ranges from 1 to N, so we need to get a power series of N items but in the code the power was evaluated about N*N/2 times. Considering the fact that power evaluation is a pretty expensive operation slowing the processing, I modified this part by adding power series generation code before for loop. Below is the new code. + + power_mags=5.8^mags + w58.list=sapply(c(1:n.training), function(x) { + min((times[1+x]-times[1:x])/(power_mags[1:x])) + }) + w58.dist=sapply(c(1:length(timelist)), function(x) { + min((timelist[x]-times[1:n.events[x]])/(power_mags[1:n.events[x]])) + }) + +By combining all of these efforts together, the improved code I checked in runs almost 2 times faster than the original code. From ed35a97073d3590fdd898da1ffdc095e933ab28f Mon Sep 17 00:00:00 2001 From: qi-zhang Date: Sat, 7 Dec 2013 15:14:42 -0800 Subject: [PATCH 19/19] Weekly Reflection of Week 14 --- 2013-09-07.md | 42 ++++++++++++++++++------------------------ 1 file changed, 18 insertions(+), 24 deletions(-) diff --git a/2013-09-07.md b/2013-09-07.md index 530be1d..900a41f 100644 --- a/2013-09-07.md +++ b/2013-09-07.md @@ -1,30 +1,24 @@ -

Weekly Reflections for the week 11/14-11/30

+

Weekly Reflections for the week 12/1-12/7

-

Keep the Code Clean And Scalable

+

Let's Make It Even Better!

-I've been working on etas-training.R this week, the Quaker team did a pretty job on this, it runs smoothly out-of-the-box and the final plot is colorful and with lots of information. There wasn't much I could do to improve the plot, so I focused on impoving the code itself by making it neater and faster. +I've been making further improvements on etas-training.R this week, I focused on etas.CI function, which was widely used in this code and occupied most of running time. The definition of etas.CI functiong was -The 1st thing I noticed was that the original code generated several large vectors but never used them. Also, there were some other large vectors served only as intermediate results for those unused vectors. so I made the code shorter, and faster by removing the code related to these variables, which didn't affect the final results. + etas.CI <- function(time,t.events,mag.events,m0,mu,K,alpha,c,p){ + mu+sum(K*10^(alpha*(mag.events-m0))/(time-t.events+c)^p) + } -The 2nd thing I noticed was that the original code sourced Luen's code that contains lots of information. However, the variables and functions defined in Luen's code were not used by etas-training, so I cleaned all variable in current space and reran the etas-training code without source Luen's code. It turned out that the final results remained unchanged. +Clearly the most expensive operations were the power calculations inside sapply iterations. Considered the fact that there were close to 7,000 magnitude observations consisted of 248 different values, I decided to replace magnitude relevant power calculation with table look up. Given a dataset of N records, etas.CI need to do the power calculation roughly N*N/2 times for different records. On the other hand, considered the fact that the worst earthquake ever recorded was about magnitude 9.00. The modified version only needs to finish no more than 900 calculations to achieve the same purpose. To achieve better performance, I rewrote the whole function and made it a vectore-oriented function after review the code. Here is the implementation of new function -The 3rd thing I noticed was that in the original code vectors are declared as place holder first and assigned later in a for loop. This approach works, but the better approach in R is to do it with sapply function. + etas.CI2 <- function(time_vec, events_idx, m0, mu, K, alpha, c, p) { + min_mag=min(c(mags,m0)) + mag_power=rep(0,(max(mags)-min_mag)*100+1) + for(i in unique(sort(mags[1:max(events_idx)]))-min_mag) { + mag_power[100*i+1.5]=10^(alpha*(i+min_mag-m0)) + } + sapply(c(1:length(time_vec)), function(x) { + mu+K*sum(mag_power[100*(mags[1:events_idx[x]]-min_mag)+1.5]/ + (time_vec[x]-times[1:events_idx[x]]+c)^p) + }) + } -Finally, I went through the code step-by-step and tried to improve the performance of the code. I think reproducibility does not only mean to share the code and dataset with other people, but also means that the code is scalable in order to handle different dataset. I noticed that the it was possible to improve the code for calculating intermediate variable w58.list and w58.dist - - for(KK in 1:length(w58.list)){ - w58.list[KK]=min((times[1+KK]-times[1:(KK)])/(5.8^mags[1:(KK)]))} - for(KK in 1:length(timelist)){ - w58.dist[KK]=min((timelist[KK]-times[1:n.events[KK]])/(5.8^mags[1:n.events[KK]]))} - -The code needed to get power series 5.8^n inside the loop so it did the calculation everytime, and the calculation results were abandoned immediately after evaluation of that expression. On the other hand, all of these power values would be recalculated again for next KK value. Take w58.list as example, K ranges from 1 to N, so we need to get a power series of N items but in the code the power was evaluated about N*N/2 times. Considering the fact that power evaluation is a pretty expensive operation slowing the processing, I modified this part by adding power series generation code before for loop. Below is the new code. - - power_mags=5.8^mags - w58.list=sapply(c(1:n.training), function(x) { - min((times[1+x]-times[1:x])/(power_mags[1:x])) - }) - w58.dist=sapply(c(1:length(timelist)), function(x) { - min((timelist[x]-times[1:n.events[x]])/(power_mags[1:n.events[x]])) - }) - -By combining all of these efforts together, the improved code I checked in runs almost 2 times faster than the original code.