From 4aa3f3aa097a785ccb18bebc30af440a826e0a3e Mon Sep 17 00:00:00 2001 From: qi-zhang Date: Wed, 11 Sep 2013 22:45:57 -0700 Subject: [PATCH 01/11] For 1st week. --- 2013-09-07.md | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/2013-09-07.md b/2013-09-07.md index e69de29..e18a077 100644 --- a/2013-09-07.md +++ b/2013-09-07.md @@ -0,0 +1,11 @@ +Reflection for the week 9/1 to 9/7. + +It was pretty challenging to get ssh access to the Ubuntu server running on my Virtual box. + +By default there is only one network adaptor, known as “Adaptor 1” and it was attached to “NAT”. Such kind of configuration makes it impossible to access the box with its IP address, not to mention establishing ssh access. + +I tried to re-configure the network adaptor to attach to “bridged adaptor” and I’m able to establish ssh access to it. Soon I found that was not a good idea since the IP address of the box keep changing every time I restarted it. + +Finally I configured the “Adaptor 1” back to attached to “NAT” and enabled “Adaptor 2” and let it attached to “Host-Only Adapter”. With help from server guide I found on Ubuntu’s website (https://help.ubuntu.com/13.04/serverguide/network-configuration.html), I finally get the box to support “Adaptor 2” and now I can establish ssh access to the box. + +It was pretty challenging for me since this is the first time I learn some terms in networking. But I’m so glad I did it. From aba31eeafe43e1c3662b191234c42b4d2ea79fb5 Mon Sep 17 00:00:00 2001 From: qi-zhang Date: Wed, 11 Sep 2013 22:49:58 -0700 Subject: [PATCH 02/11] Update 2013-09-07.md --- 2013-09-07.md | 12 +----------- 1 file changed, 1 insertion(+), 11 deletions(-) diff --git a/2013-09-07.md b/2013-09-07.md index e18a077..831a521 100644 --- a/2013-09-07.md +++ b/2013-09-07.md @@ -1,11 +1 @@ -Reflection for the week 9/1 to 9/7. - -It was pretty challenging to get ssh access to the Ubuntu server running on my Virtual box. - -By default there is only one network adaptor, known as “Adaptor 1” and it was attached to “NAT”. Such kind of configuration makes it impossible to access the box with its IP address, not to mention establishing ssh access. - -I tried to re-configure the network adaptor to attach to “bridged adaptor” and I’m able to establish ssh access to it. Soon I found that was not a good idea since the IP address of the box keep changing every time I restarted it. - -Finally I configured the “Adaptor 1” back to attached to “NAT” and enabled “Adaptor 2” and let it attached to “Host-Only Adapter”. With help from server guide I found on Ubuntu’s website (https://help.ubuntu.com/13.04/serverguide/network-configuration.html), I finally get the box to support “Adaptor 2” and now I can establish ssh access to the box. - -It was pretty challenging for me since this is the first time I learn some terms in networking. But I’m so glad I did it. +Empty Template From cfff69f2796dfd9d9dab896972b34abdbc4a7716 Mon Sep 17 00:00:00 2001 From: qi-zhang Date: Sat, 19 Oct 2013 19:43:03 -0700 Subject: [PATCH 03/11] Create week7 --- week7 | 12 ++++++++++++ 1 file changed, 12 insertions(+) create mode 100644 week7 diff --git a/week7 b/week7 new file mode 100644 index 0000000..635ad14 --- /dev/null +++ b/week7 @@ -0,0 +1,12 @@ +

Weekly Reflections for the week 10/13-10/19

+ +

Working with JSON in Python

+ +JSON is format of hierarchical data file. The format of JSON file is similar to a flexible combination of dict and list object in Python. Hierarchical data file means same field name is reusable across the database as long as these field names are nested under different hierarch. That's why it is not a good idea to search for certain information with regular expression in hierarchical data files. To find certain information of a hierarchical file, it is necessary to figure the structure of a JSON file. + +In terms of file format, JSON file is similar to a dict object in Python, but it's value could be flexible combination of dict object and list object. Naturally, we can leverage concept of dict object and list object to parse a JSON file. + +The first step is to create a dict object from a JSON file with json.loads which is a function defined in json module. json.loads() returns a dict object. For example, + + +detail=json.loads(urllib.urlopen("http://earthquake.usgs.gov/product/nearby-cities/ci11380834/us/1382197630296/nearby-cities.json").read()) From ad4654b2dd592c15068b90f4b57eee3b49ec0d7a Mon Sep 17 00:00:00 2001 From: qi-zhang Date: Sat, 19 Oct 2013 20:23:43 -0700 Subject: [PATCH 04/11] Delete week7 --- week7 | 12 ------------ 1 file changed, 12 deletions(-) delete mode 100644 week7 diff --git a/week7 b/week7 deleted file mode 100644 index 635ad14..0000000 --- a/week7 +++ /dev/null @@ -1,12 +0,0 @@ -

Weekly Reflections for the week 10/13-10/19

- -

Working with JSON in Python

- -JSON is format of hierarchical data file. The format of JSON file is similar to a flexible combination of dict and list object in Python. Hierarchical data file means same field name is reusable across the database as long as these field names are nested under different hierarch. That's why it is not a good idea to search for certain information with regular expression in hierarchical data files. To find certain information of a hierarchical file, it is necessary to figure the structure of a JSON file. - -In terms of file format, JSON file is similar to a dict object in Python, but it's value could be flexible combination of dict object and list object. Naturally, we can leverage concept of dict object and list object to parse a JSON file. - -The first step is to create a dict object from a JSON file with json.loads which is a function defined in json module. json.loads() returns a dict object. For example, - - -detail=json.loads(urllib.urlopen("http://earthquake.usgs.gov/product/nearby-cities/ci11380834/us/1382197630296/nearby-cities.json").read()) From 77c0c1d1298fcd9b1ad552fa4cda04ba1bc964de Mon Sep 17 00:00:00 2001 From: qi-zhang Date: Mon, 11 Nov 2013 14:45:50 -0800 Subject: [PATCH 05/11] Create Week 10 --- Week 10 | 4 ++++ 1 file changed, 4 insertions(+) create mode 100644 Week 10 diff --git a/Week 10 b/Week 10 new file mode 100644 index 0000000..348a81e --- /dev/null +++ b/Week 10 @@ -0,0 +1,4 @@ +

Weekly Reflections for the week 11/3-11/9

+ +

Data Clean Up

+ From b178cc998b78f0912ebf12f760e5f9492644852a Mon Sep 17 00:00:00 2001 From: qi-zhang Date: Mon, 11 Nov 2013 14:46:44 -0800 Subject: [PATCH 06/11] Delete Week 10 --- Week 10 | 4 ---- 1 file changed, 4 deletions(-) delete mode 100644 Week 10 diff --git a/Week 10 b/Week 10 deleted file mode 100644 index 348a81e..0000000 --- a/Week 10 +++ /dev/null @@ -1,4 +0,0 @@ -

Weekly Reflections for the week 11/3-11/9

- -

Data Clean Up

- From dc2f30787432f147ff167e534f718cb3cad4f5c6 Mon Sep 17 00:00:00 2001 From: qi-zhang Date: Mon, 11 Nov 2013 14:47:01 -0800 Subject: [PATCH 07/11] Create Week 10.md --- Week 10.md | 4 ++++ 1 file changed, 4 insertions(+) create mode 100644 Week 10.md diff --git a/Week 10.md b/Week 10.md new file mode 100644 index 0000000..348a81e --- /dev/null +++ b/Week 10.md @@ -0,0 +1,4 @@ +

Weekly Reflections for the week 11/3-11/9

+ +

Data Clean Up

+ From 281b92d09f0b86d702d2e0727c9adc43a18052c7 Mon Sep 17 00:00:00 2001 From: qi-zhang Date: Mon, 11 Nov 2013 16:25:00 -0800 Subject: [PATCH 08/11] Update Week 10.md --- Week 10.md | 23 ++++++++++++++++++++++- 1 file changed, 22 insertions(+), 1 deletion(-) diff --git a/Week 10.md b/Week 10.md index 348a81e..ab66790 100644 --- a/Week 10.md +++ b/Week 10.md @@ -1,4 +1,25 @@

Weekly Reflections for the week 11/3-11/9

-

Data Clean Up

+

Requirements for Data Cleaning up

+The ultimate target of data cleaning up is to provide the analyzer an easy way to access those information in certain data structure they prefer. The procedure of generate such kind of data structure could be very detail-oriented and natually we want to provide a general way for anyone who want to use it. + +Sometimes, just taking care of missing data is not the end of the story yet. The original data might also contains lots of useless information, even noises which needs further clean up. It would be great to make it possible for the analyzers to selectively chosse data based on their own criterias. Since the criteria is decided by tha analyzer rather than the data curator, it is not possible to have it hard coded in the data cleaning up code. This is is one of the road block before we can move on furhter. + +My solution is to allow the analyzer to pass a function which decides whether a data should be kept or not when they call the tools provide by data curator which generates the data structure. + +Following pseudo code are based on R. + + # The data curator's code + flexible_data_generator=function(original_data, select_function=NA) { + export_data = original_data + if(is.function(select_function)) { + keep=select_function(year=original_data$year, mag=original_data$mag, ...) + export_data = original_data[ keep, ] + } + + # generate ppx data based on export_data + ... + } + +There is no need for the analyzer to implement their own select_function, but they can if they want to apply certain rules to the original data. From f69dc8df47583f192150f9f8fc5ca7d7df3251a1 Mon Sep 17 00:00:00 2001 From: qi-zhang Date: Mon, 11 Nov 2013 20:52:13 -0800 Subject: [PATCH 09/11] Update Week 10.md --- Week 10.md | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/Week 10.md b/Week 10.md index ab66790..1066ac9 100644 --- a/Week 10.md +++ b/Week 10.md @@ -2,6 +2,8 @@

Requirements for Data Cleaning up

+Sorry for a little bit late to submit the weekly reflection. I hope it is not too late. + The ultimate target of data cleaning up is to provide the analyzer an easy way to access those information in certain data structure they prefer. The procedure of generate such kind of data structure could be very detail-oriented and natually we want to provide a general way for anyone who want to use it. Sometimes, just taking care of missing data is not the end of the story yet. The original data might also contains lots of useless information, even noises which needs further clean up. It would be great to make it possible for the analyzers to selectively chosse data based on their own criterias. Since the criteria is decided by tha analyzer rather than the data curator, it is not possible to have it hard coded in the data cleaning up code. This is is one of the road block before we can move on furhter. @@ -22,4 +24,8 @@ Following pseudo code are based on R. ... } + # The analyzer's code + myselect=function(year, mag, ...) {...} + flexible_data_generator(clean_data, myselect) + There is no need for the analyzer to implement their own select_function, but they can if they want to apply certain rules to the original data. From 09d5f1505d428d98ea05f95b9ae52de683d506ff Mon Sep 17 00:00:00 2001 From: qi-zhang Date: Mon, 11 Nov 2013 20:54:34 -0800 Subject: [PATCH 10/11] Update 2013-09-07.md --- 2013-09-07.md | 32 +++++++++++++++++++++++++++++++- 1 file changed, 31 insertions(+), 1 deletion(-) diff --git a/2013-09-07.md b/2013-09-07.md index 831a521..1066ac9 100644 --- a/2013-09-07.md +++ b/2013-09-07.md @@ -1 +1,31 @@ -Empty Template +

Weekly Reflections for the week 11/3-11/9

+ +

Requirements for Data Cleaning up

+ +Sorry for a little bit late to submit the weekly reflection. I hope it is not too late. + +The ultimate target of data cleaning up is to provide the analyzer an easy way to access those information in certain data structure they prefer. The procedure of generate such kind of data structure could be very detail-oriented and natually we want to provide a general way for anyone who want to use it. + +Sometimes, just taking care of missing data is not the end of the story yet. The original data might also contains lots of useless information, even noises which needs further clean up. It would be great to make it possible for the analyzers to selectively chosse data based on their own criterias. Since the criteria is decided by tha analyzer rather than the data curator, it is not possible to have it hard coded in the data cleaning up code. This is is one of the road block before we can move on furhter. + +My solution is to allow the analyzer to pass a function which decides whether a data should be kept or not when they call the tools provide by data curator which generates the data structure. + +Following pseudo code are based on R. + + # The data curator's code + flexible_data_generator=function(original_data, select_function=NA) { + export_data = original_data + if(is.function(select_function)) { + keep=select_function(year=original_data$year, mag=original_data$mag, ...) + export_data = original_data[ keep, ] + } + + # generate ppx data based on export_data + ... + } + + # The analyzer's code + myselect=function(year, mag, ...) {...} + flexible_data_generator(clean_data, myselect) + +There is no need for the analyzer to implement their own select_function, but they can if they want to apply certain rules to the original data. From 7099cbe65c17e219199bc6700c24e10b2dbe91d2 Mon Sep 17 00:00:00 2001 From: qi-zhang Date: Mon, 11 Nov 2013 20:56:40 -0800 Subject: [PATCH 11/11] Delete 2013-09-07.md --- 2013-09-07.md | 31 ------------------------------- 1 file changed, 31 deletions(-) delete mode 100644 2013-09-07.md diff --git a/2013-09-07.md b/2013-09-07.md deleted file mode 100644 index 1066ac9..0000000 --- a/2013-09-07.md +++ /dev/null @@ -1,31 +0,0 @@ -

Weekly Reflections for the week 11/3-11/9

- -

Requirements for Data Cleaning up

- -Sorry for a little bit late to submit the weekly reflection. I hope it is not too late. - -The ultimate target of data cleaning up is to provide the analyzer an easy way to access those information in certain data structure they prefer. The procedure of generate such kind of data structure could be very detail-oriented and natually we want to provide a general way for anyone who want to use it. - -Sometimes, just taking care of missing data is not the end of the story yet. The original data might also contains lots of useless information, even noises which needs further clean up. It would be great to make it possible for the analyzers to selectively chosse data based on their own criterias. Since the criteria is decided by tha analyzer rather than the data curator, it is not possible to have it hard coded in the data cleaning up code. This is is one of the road block before we can move on furhter. - -My solution is to allow the analyzer to pass a function which decides whether a data should be kept or not when they call the tools provide by data curator which generates the data structure. - -Following pseudo code are based on R. - - # The data curator's code - flexible_data_generator=function(original_data, select_function=NA) { - export_data = original_data - if(is.function(select_function)) { - keep=select_function(year=original_data$year, mag=original_data$mag, ...) - export_data = original_data[ keep, ] - } - - # generate ppx data based on export_data - ... - } - - # The analyzer's code - myselect=function(year, mag, ...) {...} - flexible_data_generator(clean_data, myselect) - -There is no need for the analyzer to implement their own select_function, but they can if they want to apply certain rules to the original data.