From c996dfa072aeeec5d0a06addcc856df969b63f96 Mon Sep 17 00:00:00 2001 From: Justin Gosses Date: Sun, 15 Sep 2019 18:41:29 -0500 Subject: [PATCH 01/28] Create perspectives.md --- perspectives.md | 6 ++++++ 1 file changed, 6 insertions(+) create mode 100644 perspectives.md diff --git a/perspectives.md b/perspectives.md new file mode 100644 index 0000000..c442b45 --- /dev/null +++ b/perspectives.md @@ -0,0 +1,6 @@ +Purpose of this document: +This markdown file is to certain extent a proposal. + +After reading the initial file in this repository called "open-data-guidelines.md", I made some issues here and here. + +After further evaluation. From d2f3ba6d7f7252a03c0acbed0cb9322552c7e501 Mon Sep 17 00:00:00 2001 From: Justin Gosses Date: Sun, 15 Sep 2019 18:41:51 -0500 Subject: [PATCH 02/28] Update perspectives.md --- perspectives.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/perspectives.md b/perspectives.md index c442b45..92998eb 100644 --- a/perspectives.md +++ b/perspectives.md @@ -1,6 +1,6 @@ Purpose of this document: This markdown file is to certain extent a proposal. -After reading the initial file in this repository called "open-data-guidelines.md", I made some issues here and here. +After reading the initial file in this repository called "open-data-guidelines.md", I made some issues here and here. After further evaluation. From e54d3448c9f7b0da78bd8a984e8663f8d20b0fc3 Mon Sep 17 00:00:00 2001 From: Justin Gosses Date: Sun, 15 Sep 2019 20:33:06 -0500 Subject: [PATCH 03/28] Update perspectives.md --- perspectives.md | 49 ++++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 46 insertions(+), 3 deletions(-) diff --git a/perspectives.md b/perspectives.md index 92998eb..24e547f 100644 --- a/perspectives.md +++ b/perspectives.md @@ -1,6 +1,49 @@ -Purpose of this document: -This markdown file is to certain extent a proposal. +### Purpose of this document: +This markdown file is to certain extent a proposal for what to include in this repository. + +### Introduction After reading the initial file in this repository called "open-data-guidelines.md", I made some issues here and here. -After further evaluation. +After further evaluation, the thing that was bugging me was that the document was written directed at the supplier of a dataset. While that is the obvious place to start, it is worth thinking about a range of personas, multiple use cases, and different possible future states. + +## Brainstorm of Personas, Use-cases, Users, Evolutions of Requirements + +### Builders & Maintainers of the website that hosts and presents that data to end-users +These would be the people with the ability to make changes to https://dataunderground.org/dataset. + +Some characteristics one could use to describe them that might matter would be: +1. The amount of time they want to / can contribute. +2. The degree to which they choose to accept edits, code, or other types of control from outside the initial group. +3. Skills, particularly those that affect tech stack. + +### Users +Highest level breakdown of users: +1. Suppliers of the datasets +2. Maintainers of the website (perhaps the most active users!) +3. End-users consuming datasets + +#### A few personas of users: +1. Curious about what data is available but no end goal, project or datasets in mind. + 1. a. Goes to first page and that's it if not interested there. + 1. b. Spends <4 minutes browsing and then bails if nothing in line with interests. + 1. c. Will read details of multiple datasets and maybe download 1 or 2 max. + 1. d. Will read details on every dataset up to first 50 and downloads 4 or 10 to explore in more depth. +2. Geologist with high level of programming skills. Interested in seismic data related to automatic fault picking. + +#### Use-cases +1. Geologists new to programming who want to try standard data science packages like Pandas but want to use geoscience data. +2. Geologists new to programming who want to try standard data science packages like Pandas but want to use geoscience data. +3. Geologists new to programming who want to try standard data science packages like Pandas but want to use geoscience data. +4. Geologists new to programming who want to try standard data science packages like Pandas but want to use geoscience data. + +#### Suppliers of datasets + + +#### Level of hosting + +#### Add-ons + +#### End-users + +#### Intermediate-users From 86e6d775542dce7116567cdcd1c454436bd0a0ef Mon Sep 17 00:00:00 2001 From: Justin Gosses Date: Sun, 15 Sep 2019 20:57:05 -0500 Subject: [PATCH 04/28] Update perspectives.md --- perspectives.md | 44 ++++++++++++++++++++++++++++++-------------- 1 file changed, 30 insertions(+), 14 deletions(-) diff --git a/perspectives.md b/perspectives.md index 24e547f..77059f9 100644 --- a/perspectives.md +++ b/perspectives.md @@ -25,25 +25,41 @@ Highest level breakdown of users: #### A few personas of users: 1. Curious about what data is available but no end goal, project or datasets in mind. - 1. a. Goes to first page and that's it if not interested there. - 1. b. Spends <4 minutes browsing and then bails if nothing in line with interests. - 1. c. Will read details of multiple datasets and maybe download 1 or 2 max. - 1. d. Will read details on every dataset up to first 50 and downloads 4 or 10 to explore in more depth. + a. Goes to first page and that's it if not interested there. + b. Spends <4 minutes browsing and then bails if nothing in line with interests. + c. Will read details of multiple datasets and maybe download 1 or 2 max. + d. Will read details on every dataset up to first 50 and downloads 4 or 10 to explore in more depth. 2. Geologist with high level of programming skills. Interested in seismic data related to automatic fault picking. +3. Geologist with low level of programming skills. Interested in facies prediction. Wants to move example work from an open-dataset to be applied to their own internal well log data. +4. Geophysicist who wants to try out unsupervised machine-learning package that he's seen used recently on geology data and is curious what open data could used for that sort of approach. #### Use-cases -1. Geologists new to programming who want to try standard data science packages like Pandas but want to use geoscience data. -2. Geologists new to programming who want to try standard data science packages like Pandas but want to use geoscience data. -3. Geologists new to programming who want to try standard data science packages like Pandas but want to use geoscience data. -4. Geologists new to programming who want to try standard data science packages like Pandas but want to use geoscience data. +1. Demo datasets in code packages. +2. Hackathon datasets. +3. Datasets to go with open-source code. Users then make changes to the code but keep using the dataset. +4. Various type of datascience: visualization, data exploration, sql practice, unsupervized learning, supervised learning, classification, regression prediction, etc. -#### Suppliers of datasets +#### Suppliers of datasets / who might have edit rights in some form? +1. Owners of original datasets who also collected the data. +2. People who have rights to make the dataset public but didn't collect it originally or prep it. +3. People who found the dataset with an open-source license elsewhere and want to suggest it for inclusion on data-underground. +4. People who want to add a correction to the already published dataset. +5. People who want to supply some additional notes to the dataset (for example which logs have bad formatting and won't load easily via LASIO). +6. People who want to add or link to code to load the dataset or work with the dataset in some way. +7. People who found a paper that references the dataset but don't own either the dataset or the paper. - -#### Level of hosting +#### Level of dataset hosting +here = data-underground +1. This dataset exists only here. +2. This dataset is uploaded here but originally lived in this other place. +3. This dataset is referenced here but still lives in its originally place here. +4. This dataset is referenced here but has been automatically harvested from this site. +4. This dataset is referenced here but to download it you have to go to this other site. #### Add-ons +For: +- Discoverability +- programmatic creation of meta-data from descriptions +- end-user creation of new meta-data +- end-user created code to use with the dataset -#### End-users - -#### Intermediate-users From 0a37b62324fd98d851a775324732913c813970a3 Mon Sep 17 00:00:00 2001 From: Justin Gosses Date: Sun, 15 Sep 2019 20:57:32 -0500 Subject: [PATCH 05/28] Update perspectives.md --- perspectives.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/perspectives.md b/perspectives.md index 77059f9..6484e27 100644 --- a/perspectives.md +++ b/perspectives.md @@ -1,3 +1,5 @@ +# DRAFT + ### Purpose of this document: This markdown file is to certain extent a proposal for what to include in this repository. From 7799d38fcd344826e7c9805476eec64e15a99118 Mon Sep 17 00:00:00 2001 From: Justin Gosses Date: Mon, 16 Sep 2019 21:55:03 -0500 Subject: [PATCH 06/28] Update perspectives.md --- perspectives.md | 43 +++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 41 insertions(+), 2 deletions(-) diff --git a/perspectives.md b/perspectives.md index 6484e27..f41a503 100644 --- a/perspectives.md +++ b/perspectives.md @@ -1,8 +1,47 @@ # DRAFT +## Introduction -### Purpose of this document: -This markdown file is to certain extent a proposal for what to include in this repository. +#### What is this file? +This markdown file is something akin to a slack comment in purpose, but way to long for that format, so I'm putting it here. I apologize for the rambling. This is in part an attempt to collect my thoughts. + +#### Why I'm writing this? +Basically, the original document in this repo was written towards dataset suppliers, and while that's a worth goal, it bugged me a bit for reasons I initially had a hard time narrowing down. Eventually, I decided the things I felt were left out all hard to do with the fact that the focus was overwelmingly on the dataset. + +#### Open-data Sites & Constraints On Users +A focus on the characteristics datasets should have makes sense, of course. However, I'm less bothered the characteristics of individual datasets and more by my experiences, or other peoples' experiences, of trying to work with open-datasets in aggregate. Searching through them, evaluating them, organizing them, and aggregating them is often very difficult due to constraints built in place early. Sometimes constraints occur, because certain metadata wasn't encouraged. Other times sites lack certain filtering capability. Other times aspects of the datasets are not programmatically accessible. + +These concerns are less obvious with only 17 datasets on https://dataunderground.org/dataset as of today. These problems appear more as the number of individual datasets grows greater than users' willingess to read through all of them. + +#### Background That Informs My Requirements/Needs +First, I co-lead the houston data visualization meetup, which means I spend some time searching for good datasets people would enjoy visualizing during our datajams over the course of about 4 hours. There's a lot of potential open-data sites out there. Second, I help maintain data.nasa.gov, which has approximately 40,000 datasets and gets harvested into data.gov, which has almost a quarter of a million. + +#### Minimizing Time-To-Start is Maximizing Use Rate +A hypothesis I have based on my own experiences and, to be completely honest, not based on any real data whatsoever, is that a lot of the most used open-data is just the easiest to use. A lot of datasets are hard to discover, and "discover" is often a more accurate word than "find" as a significant amount of use of open-data comes from people who didn't already know a dataset existed except through the open-data site or someone else who found it on the open-data site. + +To maximize the rate of discoverability, you need to make the amount of time to get there shorter. This requires the ability to sort and filter datasets in ways the correlate with user needs. + +#### Discoverability Problems That Occur With More Datasets on a Site + +##### A. How do you find datasets based on task? +Some users won't care about the dataset content so much as they know it has labels and can be modeled as a time series problem. How do they find all the datasets that meet that definition? + +##### B. How do you find datasets based on data format or data structure? +Some people will want LAS 2.0 well logs. Others will absolutely need well with well paths included, which LAS file formats won't have. + +##### C. Can you find all the versions of a dataset? +If a user stumbles upon a preprocessed dataset like this one will they also see that there is an original dataset here"open-data-guidelines.md", I made some issues here and here. From b977d4e55c7d1358bd44e661eabd6018c4f8dabe Mon Sep 17 00:00:00 2001 From: Justin Gosses Date: Mon, 16 Sep 2019 22:30:21 -0500 Subject: [PATCH 07/28] Update perspectives.md --- perspectives.md | 18 +++++++++++++++++- 1 file changed, 17 insertions(+), 1 deletion(-) diff --git a/perspectives.md b/perspectives.md index f41a503..d8585b8 100644 --- a/perspectives.md +++ b/perspectives.md @@ -38,7 +38,23 @@ Some users will be interested in calculating petrophysical variables in Python a #### Tags and Flexibility and Who Does the Work? The obvious solution to many of the discoverability problems above is tags. One of the problems with tags is that they're typically applied by humans with differing perspectives on what's important. Sometimes even with the same perspective on what's important, they might phrase things differently. What this often leads to is partial coverage of datasets that could fall under any individual tag being actually tagged with that keyword. -A few different solutions exist to the tagging issues highlighted above. First, there can be a defined list of tags with certain mandatory key-value pairs that dataset suppliers have to fill out. Second, there can a heoric website manager, or assigned intern, who goes through and re-tags every dataset. Third, if lengthy enough text description of each item exists, automated text tagging of topics can augment human-generated tags. Fourth, the community that users the dataset can tag datasets as they evaluate & use them, potentially according to an ontology that develops over time and is maintained in a central location. For example, there might be a characteristic about well logs, like whether they have tops, that is deemed interestingly enough that it is flagged by the group as something all datasets with well logs should be re-tagged with. This requires an active community and may not be possible if the open-data site is harvesting hundreds or thousands of datasets from other open-data sites. +A few different solutions exist to the tagging issues highlighted above. + +First, there can be a defined list of tags with certain mandatory key-value pairs that dataset suppliers have to fill out. + +Second, there can a heoric website manager, or assigned intern, who goes through and re-tags every dataset. + +Third, if lengthy enough text description of each item exists, automated text tagging of topics can augment human-generated tags. + +Fourth, the community that users the dataset can tag datasets as they evaluate & use them, potentially according to an ontology that develops over time and is maintained in a central location. For example, there might be a characteristic about well logs, like whether they have tops, that is deemed interestingly enough that it is flagged by the group as something all datasets with well logs should be re-tagged with. This requires an active community and may not be possible if the open-data site is harvesting hundreds or thousands of datasets from other open-data sites. + +Fifth, there isn't a good way to discovery datasets that meet a range of criteria except through a lot of work reading descriptions and downloading datasets. + +#### Comments on the Solutions Listed Above +The first, second, third, and fitch options don't require anyone to have permissions to make edits to datasets except the dataset suppliers and maybe the website managers. The fourth, however, requires everyone to have edit access. This enables a lot of flexibility, especially going into the future after a dataset is initially uploaded, but has the downside of potentially scaring dataset suppliers. + +#### A Back-of-napkin ontology just for example sake + ====================================================================== ## THINGS I WROTE BEFORE WHEN I WAS TOO TIRED AND NEED TO REWRITE From 7d0bbcc550a0285d659c028471ceb3094ad2db5c Mon Sep 17 00:00:00 2001 From: Justin Gosses Date: Sat, 21 Sep 2019 21:17:56 -0500 Subject: [PATCH 08/28] Update perspectives.md --- perspectives.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/perspectives.md b/perspectives.md index d8585b8..6241549 100644 --- a/perspectives.md +++ b/perspectives.md @@ -6,7 +6,7 @@ This markdown file is something akin to a slack comment in purpose, but way to long for that format, so I'm putting it here. I apologize for the rambling. This is in part an attempt to collect my thoughts. #### Why I'm writing this? -Basically, the original document in this repo was written towards dataset suppliers, and while that's a worth goal, it bugged me a bit for reasons I initially had a hard time narrowing down. Eventually, I decided the things I felt were left out all hard to do with the fact that the focus was overwelmingly on the dataset. +The original document in this repo was written towards dataset suppliers, and while that of course makes sense, it bugged me a bit for reasons I initially had a hard time narrowing down. Eventually, I decided the things I felt were left out all hard to do with the fact that the focus was on the dataset which left out considering the site itself and the community around it. #### Open-data Sites & Constraints On Users A focus on the characteristics datasets should have makes sense, of course. However, I'm less bothered the characteristics of individual datasets and more by my experiences, or other peoples' experiences, of trying to work with open-datasets in aggregate. Searching through them, evaluating them, organizing them, and aggregating them is often very difficult due to constraints built in place early. Sometimes constraints occur, because certain metadata wasn't encouraged. Other times sites lack certain filtering capability. Other times aspects of the datasets are not programmatically accessible. From 0d119d343f108cbe2216773e229c1109d68d1d91 Mon Sep 17 00:00:00 2001 From: Justin Gosses Date: Sat, 21 Sep 2019 21:21:22 -0500 Subject: [PATCH 09/28] Update perspectives.md --- perspectives.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/perspectives.md b/perspectives.md index 6241549..4db7bda 100644 --- a/perspectives.md +++ b/perspectives.md @@ -6,7 +6,9 @@ This markdown file is something akin to a slack comment in purpose, but way to long for that format, so I'm putting it here. I apologize for the rambling. This is in part an attempt to collect my thoughts. #### Why I'm writing this? -The original document in this repo was written towards dataset suppliers, and while that of course makes sense, it bugged me a bit for reasons I initially had a hard time narrowing down. Eventually, I decided the things I felt were left out all hard to do with the fact that the focus was on the dataset which left out considering the site itself and the community around it. +After I reading the original document in this repo titled open-data-guidelines.md, I realized it bugged me a bit for reasons I had a hard time narrowing down. After giving it some thought, I realized I felt there were thigns left out due to the focus on the dataset. + +The focus on the dataset makes sense, of course, but I think you get to a better place but not just asking questions about what characteristics the dataset should have but also evaluating the site that hosts the dataset as well as the different types of members in the community around it. #### Open-data Sites & Constraints On Users A focus on the characteristics datasets should have makes sense, of course. However, I'm less bothered the characteristics of individual datasets and more by my experiences, or other peoples' experiences, of trying to work with open-datasets in aggregate. Searching through them, evaluating them, organizing them, and aggregating them is often very difficult due to constraints built in place early. Sometimes constraints occur, because certain metadata wasn't encouraged. Other times sites lack certain filtering capability. Other times aspects of the datasets are not programmatically accessible. From f9401fea3fdd35a4a0d8cf94717a5aa60d9d05e8 Mon Sep 17 00:00:00 2001 From: Justin Gosses Date: Sat, 21 Sep 2019 21:23:35 -0500 Subject: [PATCH 10/28] Update perspectives.md --- perspectives.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/perspectives.md b/perspectives.md index 4db7bda..7495ae5 100644 --- a/perspectives.md +++ b/perspectives.md @@ -6,10 +6,12 @@ This markdown file is something akin to a slack comment in purpose, but way to long for that format, so I'm putting it here. I apologize for the rambling. This is in part an attempt to collect my thoughts. #### Why I'm writing this? -After I reading the original document in this repo titled open-data-guidelines.md, I realized it bugged me a bit for reasons I had a hard time narrowing down. After giving it some thought, I realized I felt there were thigns left out due to the focus on the dataset. +After I reading the original document in this repo titled open-data-guidelines.md, I realized it bugged me a bit for reasons I had a hard time narrowing down. After giving it some thought, I realized I felt there were things left out due to the focus on the dataset. The focus on the dataset makes sense, of course, but I think you get to a better place but not just asking questions about what characteristics the dataset should have but also evaluating the site that hosts the dataset as well as the different types of members in the community around it. +This document is an attempt to suggest other perspectives beyond, what characteristics should the dataset have, to consider when creating an open-data site for datasets geared to geoscience + coding. + #### Open-data Sites & Constraints On Users A focus on the characteristics datasets should have makes sense, of course. However, I'm less bothered the characteristics of individual datasets and more by my experiences, or other peoples' experiences, of trying to work with open-datasets in aggregate. Searching through them, evaluating them, organizing them, and aggregating them is often very difficult due to constraints built in place early. Sometimes constraints occur, because certain metadata wasn't encouraged. Other times sites lack certain filtering capability. Other times aspects of the datasets are not programmatically accessible. From 898a59af261f0986aaa0a4667c0b318414a399c6 Mon Sep 17 00:00:00 2001 From: Justin Gosses Date: Sat, 21 Sep 2019 21:24:27 -0500 Subject: [PATCH 11/28] Update perspectives.md --- perspectives.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/perspectives.md b/perspectives.md index 7495ae5..f56959c 100644 --- a/perspectives.md +++ b/perspectives.md @@ -8,7 +8,7 @@ This markdown file is something akin to a slack comment in purpose, but way to l #### Why I'm writing this? After I reading the original document in this repo titled open-data-guidelines.md, I realized it bugged me a bit for reasons I had a hard time narrowing down. After giving it some thought, I realized I felt there were things left out due to the focus on the dataset. -The focus on the dataset makes sense, of course, but I think you get to a better place but not just asking questions about what characteristics the dataset should have but also evaluating the site that hosts the dataset as well as the different types of members in the community around it. +The focus on the dataset makes sense, of course, but I think you get to a better place but not just asking questions about what characteristics the dataset should have but also considering the site that hosts the dataset as well as the different types of members in the community around it. This document is an attempt to suggest other perspectives beyond, what characteristics should the dataset have, to consider when creating an open-data site for datasets geared to geoscience + coding. From 35da238019166631b5798197daced7d20d8912da Mon Sep 17 00:00:00 2001 From: Justin Gosses Date: Sat, 21 Sep 2019 21:27:07 -0500 Subject: [PATCH 12/28] Update perspectives.md --- perspectives.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/perspectives.md b/perspectives.md index f56959c..a42e73b 100644 --- a/perspectives.md +++ b/perspectives.md @@ -6,14 +6,14 @@ This markdown file is something akin to a slack comment in purpose, but way to long for that format, so I'm putting it here. I apologize for the rambling. This is in part an attempt to collect my thoughts. #### Why I'm writing this? -After I reading the original document in this repo titled open-data-guidelines.md, I realized it bugged me a bit for reasons I had a hard time narrowing down. After giving it some thought, I realized I felt there were things left out due to the focus on the dataset. +After I reading the original document in this repo titled open-data-guidelines.md, it bugged me a little for reasons I had a hard time narrowing down. After giving it some thought, I realized I felt there were things left out due to the focus on the dataset. -The focus on the dataset makes sense, of course, but I think you get to a better place but not just asking questions about what characteristics the dataset should have but also considering the site that hosts the dataset as well as the different types of members in the community around it. +The focus on the dataset makes sense, of course, but I think you get to a better place but not just asking questions about what characteristics the dataset should have but also considering the site that hosts the dataset as well as the different types of members in the community around both the dataset and the site as a hole. This document is an attempt to suggest other perspectives beyond, what characteristics should the dataset have, to consider when creating an open-data site for datasets geared to geoscience + coding. #### Open-data Sites & Constraints On Users -A focus on the characteristics datasets should have makes sense, of course. However, I'm less bothered the characteristics of individual datasets and more by my experiences, or other peoples' experiences, of trying to work with open-datasets in aggregate. Searching through them, evaluating them, organizing them, and aggregating them is often very difficult due to constraints built in place early. Sometimes constraints occur, because certain metadata wasn't encouraged. Other times sites lack certain filtering capability. Other times aspects of the datasets are not programmatically accessible. +As stated above, a focus on the characteristics datasets should have makes sense, of course. However, I typically find myself less bothered by the characteristics of individual datasets and more by my experiences, or other peoples' experiences, of trying to work with open-datasets in aggregate. Searching through them, evaluating them, organizing them, and aggregating them is often very difficult due to constraints built in place early. Sometimes constraints occur, because certain metadata wasn't encouraged. Other times sites lack certain filtering capability. Other times aspects of the datasets are not programmatically accessible. I less often find myself working with a specific dataset and say "oh it would be great it this particular dataset had blank". These concerns are less obvious with only 17 datasets on https://dataunderground.org/dataset as of today. These problems appear more as the number of individual datasets grows greater than users' willingess to read through all of them. From e3eaf30b50c8a86732377d6e929c86ac868685b9 Mon Sep 17 00:00:00 2001 From: Justin Gosses Date: Sat, 21 Sep 2019 21:27:44 -0500 Subject: [PATCH 13/28] Update perspectives.md --- perspectives.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/perspectives.md b/perspectives.md index a42e73b..b65e41d 100644 --- a/perspectives.md +++ b/perspectives.md @@ -10,7 +10,7 @@ After I reading the original Date: Sat, 21 Sep 2019 21:28:43 -0500 Subject: [PATCH 14/28] Update perspectives.md --- perspectives.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/perspectives.md b/perspectives.md index b65e41d..ab0430c 100644 --- a/perspectives.md +++ b/perspectives.md @@ -10,7 +10,7 @@ After I reading the original Date: Sat, 21 Sep 2019 21:39:25 -0500 Subject: [PATCH 15/28] Update perspectives.md --- perspectives.md | 41 +++++++++++++++++++++-------------------- 1 file changed, 21 insertions(+), 20 deletions(-) diff --git a/perspectives.md b/perspectives.md index ab0430c..ef5a2a1 100644 --- a/perspectives.md +++ b/perspectives.md @@ -17,14 +17,22 @@ As stated above, a focus on the characteristics datasets should have makes sense These concerns are less obvious with only 17 datasets on https://dataunderground.org/dataset as of today. These problems appear more as the number of individual datasets grows greater than users' willingess to read through all of them. +Additionally, these types of issues increase as the percent of datasets that can be aggregated into larger datasets increases. If all the datasets are completely separate or different in domain and,or format, than these issues are less of a problem. + #### Background That Informs My Requirements/Needs -First, I co-lead the houston data visualization meetup, which means I spend some time searching for good datasets people would enjoy visualizing during our datajams over the course of about 4 hours. There's a lot of potential open-data sites out there. Second, I help maintain data.nasa.gov, which has approximately 40,000 datasets and gets harvested into data.gov, which has almost a quarter of a million. +First, I co-lead the houston data visualization meetup, which means I spend some time searching for good datasets people would enjoy visualizing during our datajams over the course of about 4 hours. There's a lot of potential open-data sites out there. Second, I help maintain data.nasa.gov, which has approximately ~40,000 datasets and gets harvested into data.gov, which has almost a quarter of a million. #### Minimizing Time-To-Start is Maximizing Use Rate -A hypothesis I have based on my own experiences and, to be completely honest, not based on any real data whatsoever, is that a lot of the most used open-data is just the easiest to use. A lot of datasets are hard to discover, and "discover" is often a more accurate word than "find" as a significant amount of use of open-data comes from people who didn't already know a dataset existed except through the open-data site or someone else who found it on the open-data site. +A hypothesis I have based on my own experiences and, to be completely honest, not based on any real data whatsoever, is that a lot of the most used open-data is just the easiest to use. + +This is what we see with data.nasa.gov. The most used datasets are typically small ones, in a standard format, that are harvested into sites with great user-interfaces making the evaluation and time to start very minimize. + +A lot of datasets are hard to discover, and "discover" is often a more accurate word than "find" as a significant amount of use of open-data comes from people who didn't already know a dataset existed except through the open-data site or someone else who found it on the open-data site. To maximize the rate of discoverability, you need to make the amount of time to get there shorter. This requires the ability to sort and filter datasets in ways the correlate with user needs. +[STRONGLY HELD OPIONION] The search functionality of some open-data sites is more geared to finding datasets than discovering them, which impacts the user experience. + #### Discoverability Problems That Occur With More Datasets on a Site ##### A. How do you find datasets based on task? @@ -59,32 +67,25 @@ The first, second, third, and fitch options don't require anyone to have permiss #### A Back-of-napkin ontology just for example sake +[NOTHING STARTED HERE] -====================================================================== -## THINGS I WROTE BEFORE WHEN I WAS TOO TIRED AND NEED TO REWRITE - -### Introduction -After reading the initial file in this repository called "open-data-guidelines.md", I made some issues here and here. +## Brainstorming Users, Personas, Use-cases, Potential Addons, and Evolutions of Requirements -After further evaluation, the thing that was bugging me was that the document was written directed at the supplier of a dataset. While that is the obvious place to start, it is worth thinking about a range of personas, multiple use cases, and different possible future states. - -## Brainstorm of Personas, Use-cases, Users, Evolutions of Requirements +### Users +Highest level breakdown of users: +1. Suppliers of the datasets +2. Maintainers of the website (perhaps the most active users!) +3. End-users consuming datasets -### Builders & Maintainers of the website that hosts and presents that data to end-users +#### Builders & Maintainers of the website that hosts and presents that data to end-users These would be the people with the ability to make changes to https://dataunderground.org/dataset. Some characteristics one could use to describe them that might matter would be: 1. The amount of time they want to / can contribute. 2. The degree to which they choose to accept edits, code, or other types of control from outside the initial group. -3. Skills, particularly those that affect tech stack. - -### Users -Highest level breakdown of users: -1. Suppliers of the datasets -2. Maintainers of the website (perhaps the most active users!) -3. End-users consuming datasets +3. Skills, particularly those that affect tech stack and the time to do different dataset approval, organization, or cleaning tasks. -#### A few personas of users: +#### A few personas of end-users: 1. Curious about what data is available but no end goal, project or datasets in mind. a. Goes to first page and that's it if not interested there. b. Spends <4 minutes browsing and then bails if nothing in line with interests. @@ -94,7 +95,7 @@ Highest level breakdown of users: 3. Geologist with low level of programming skills. Interested in facies prediction. Wants to move example work from an open-dataset to be applied to their own internal well log data. 4. Geophysicist who wants to try out unsupervised machine-learning package that he's seen used recently on geology data and is curious what open data could used for that sort of approach. -#### Use-cases +#### End-users Use-cases 1. Demo datasets in code packages. 2. Hackathon datasets. 3. Datasets to go with open-source code. Users then make changes to the code but keep using the dataset. From fa7cb0858ccb707ecf53a5bdd5144b6cb25813ba Mon Sep 17 00:00:00 2001 From: Justin Gosses Date: Sat, 21 Sep 2019 21:42:14 -0500 Subject: [PATCH 16/28] Update perspectives.md --- perspectives.md | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/perspectives.md b/perspectives.md index ef5a2a1..a0bb890 100644 --- a/perspectives.md +++ b/perspectives.md @@ -52,20 +52,20 @@ The obvious solution to many of the discoverability problems above is tags. One A few different solutions exist to the tagging issues highlighted above. -First, there can be a defined list of tags with certain mandatory key-value pairs that dataset suppliers have to fill out. +1. First, there can be a defined list of tags with certain mandatory key-value pairs that dataset suppliers have to fill out. -Second, there can a heoric website manager, or assigned intern, who goes through and re-tags every dataset. +2. Second, there can a heoric website manager, or assigned intern, who goes through and re-tags every dataset. -Third, if lengthy enough text description of each item exists, automated text tagging of topics can augment human-generated tags. +3. Third, if lengthy enough text description of each item exists, automated text tagging of topics can augment human-generated tags. -Fourth, the community that users the dataset can tag datasets as they evaluate & use them, potentially according to an ontology that develops over time and is maintained in a central location. For example, there might be a characteristic about well logs, like whether they have tops, that is deemed interestingly enough that it is flagged by the group as something all datasets with well logs should be re-tagged with. This requires an active community and may not be possible if the open-data site is harvesting hundreds or thousands of datasets from other open-data sites. +4. Fourth, the community that users the dataset can tag datasets as they evaluate & use them, potentially according to an ontology that develops over time and is maintained in a central location. For example, there might be a characteristic about well logs, like whether they have tops, that is deemed interestingly enough that it is flagged by the group as something all datasets with well logs should be re-tagged with. This requires an active community and may not be possible if the open-data site is harvesting hundreds or thousands of datasets from other open-data sites. -Fifth, there isn't a good way to discovery datasets that meet a range of criteria except through a lot of work reading descriptions and downloading datasets. +5. Fifth, there isn't a good way to discovery datasets that meet a range of criteria except through a lot of work reading descriptions and downloading datasets. -#### Comments on the Solutions Listed Above -The first, second, third, and fitch options don't require anyone to have permissions to make edits to datasets except the dataset suppliers and maybe the website managers. The fourth, however, requires everyone to have edit access. This enables a lot of flexibility, especially going into the future after a dataset is initially uploaded, but has the downside of potentially scaring dataset suppliers. +##### Comments on the Solutions Listed Above +The first, second, third, and fifth options don't require anyone to have permissions to make edits to datasets except the dataset suppliers and maybe the website managers. The fourth, however, requires everyone to have edit access. This enables a lot of flexibility, especially going into the future after a dataset is initially uploaded, but has the downside of potentially scaring dataset suppliers. -#### A Back-of-napkin ontology just for example sake +##### A Back-of-napkin geo-coding tag ontology just for example sake [NOTHING STARTED HERE] From 88d34ce820ba0cd12512f25f566ebbb52d695de0 Mon Sep 17 00:00:00 2001 From: Justin Gosses Date: Sat, 21 Sep 2019 21:51:11 -0500 Subject: [PATCH 17/28] Update perspectives.md --- perspectives.md | 23 +++++++++++++---------- 1 file changed, 13 insertions(+), 10 deletions(-) diff --git a/perspectives.md b/perspectives.md index a0bb890..6867aaf 100644 --- a/perspectives.md +++ b/perspectives.md @@ -1,11 +1,9 @@ -# DRAFT - -## Introduction +#### DRAFT / Brain Dump #### What is this file? This markdown file is something akin to a slack comment in purpose, but way to long for that format, so I'm putting it here. I apologize for the rambling. This is in part an attempt to collect my thoughts. -#### Why I'm writing this? +#### Introduction / Why I'm writing this? After I reading the original document in this repo titled open-data-guidelines.md, it bugged me a little for reasons I had a hard time narrowing down. After giving it some thought, I realized I felt there were things left out due to the focus on the dataset. The focus on the dataset makes sense, of course, but I think you get to a better place but not just asking questions about what characteristics the dataset should have but also considering the site that hosts the dataset as well as the different types of members in the community around both the dataset and the site as a hole. @@ -110,13 +108,18 @@ Some characteristics one could use to describe them that might matter would be: 6. People who want to add or link to code to load the dataset or work with the dataset in some way. 7. People who found a paper that references the dataset but don't own either the dataset or the paper. -#### Level of dataset hosting +#### Levels of dataset hosting here = data-underground -1. This dataset exists only here. -2. This dataset is uploaded here but originally lived in this other place. -3. This dataset is referenced here but still lives in its originally place here. -4. This dataset is referenced here but has been automatically harvested from this site. -4. This dataset is referenced here but to download it you have to go to this other site. +1. This dataset was uploaded here originally and exists only here. +2. This dataset is uploaded here but also lives in its original location where it may or may not have changed since. +3. This dataset is programmatically harvested (uploaded) to this site and there are programmatic checks in place to re-harvest if it changes. +4. This dataset is manually uploaded here from another site but there is no telling if the original has changed elsewhere. +5. This dataset is not uploaded here but merely referenced with a direct download link. +6. This dataset is not uploaded here but merely referenced with a direct download link and a marker for it the dataset has changed since original placement. +7. This dataset is not uploaded here but merely referenced with a link to the original site. There is no direct download link to the actual file(s). + +#### Issues Related to levels of dataset hosting + #### Add-ons For: From 81bc81b912ca6d1e670e3e2ef2b440c91abaa6bb Mon Sep 17 00:00:00 2001 From: Justin Gosses Date: Sat, 21 Sep 2019 21:54:31 -0500 Subject: [PATCH 18/28] Update perspectives.md --- perspectives.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/perspectives.md b/perspectives.md index 6867aaf..b53f681 100644 --- a/perspectives.md +++ b/perspectives.md @@ -21,7 +21,7 @@ Additionally, these types of issues increase as the percent of datasets that can First, I co-lead the houston data visualization meetup, which means I spend some time searching for good datasets people would enjoy visualizing during our datajams over the course of about 4 hours. There's a lot of potential open-data sites out there. Second, I help maintain data.nasa.gov, which has approximately ~40,000 datasets and gets harvested into data.gov, which has almost a quarter of a million. #### Minimizing Time-To-Start is Maximizing Use Rate -A hypothesis I have based on my own experiences and, to be completely honest, not based on any real data whatsoever, is that a lot of the most used open-data is just the easiest to use. +A hypothesis I have based on my own experiences and, to be honest, relatively little real data is that a lot of the most used open-data is just the easiest to use. This is what we see with data.nasa.gov. The most used datasets are typically small ones, in a standard format, that are harvested into sites with great user-interfaces making the evaluation and time to start very minimize. From e4df767221fd191551be9662991fcbfb8bba3083 Mon Sep 17 00:00:00 2001 From: Justin Gosses Date: Sat, 21 Sep 2019 21:55:47 -0500 Subject: [PATCH 19/28] Update perspectives.md --- perspectives.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/perspectives.md b/perspectives.md index b53f681..e72ad7d 100644 --- a/perspectives.md +++ b/perspectives.md @@ -23,7 +23,7 @@ First, I co-lead the houston data visualization meetup, which means I spend some #### Minimizing Time-To-Start is Maximizing Use Rate A hypothesis I have based on my own experiences and, to be honest, relatively little real data is that a lot of the most used open-data is just the easiest to use. -This is what we see with data.nasa.gov. The most used datasets are typically small ones, in a standard format, that are harvested into sites with great user-interfaces making the evaluation and time to start very minimize. +This is what we see with data.nasa.gov. The most used datasets are typically small ones, only one file, in CSV format, that are harvested into sites with great user-interfaces, like kaggle and data.world, making the evaluation and time to start very minimize. A lot of datasets are hard to discover, and "discover" is often a more accurate word than "find" as a significant amount of use of open-data comes from people who didn't already know a dataset existed except through the open-data site or someone else who found it on the open-data site. From 42f14d17ca076f8398b3a779aedbe79fcdea7a75 Mon Sep 17 00:00:00 2001 From: Justin Gosses Date: Sat, 21 Sep 2019 21:58:17 -0500 Subject: [PATCH 20/28] Update perspectives.md --- perspectives.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/perspectives.md b/perspectives.md index e72ad7d..59dc510 100644 --- a/perspectives.md +++ b/perspectives.md @@ -115,7 +115,7 @@ here = data-underground 3. This dataset is programmatically harvested (uploaded) to this site and there are programmatic checks in place to re-harvest if it changes. 4. This dataset is manually uploaded here from another site but there is no telling if the original has changed elsewhere. 5. This dataset is not uploaded here but merely referenced with a direct download link. -6. This dataset is not uploaded here but merely referenced with a direct download link and a marker for it the dataset has changed since original placement. +6. This dataset is not uploaded here but merely referenced with a direct download link and a flag for if the dataset has changed since original placement. 7. This dataset is not uploaded here but merely referenced with a link to the original site. There is no direct download link to the actual file(s). #### Issues Related to levels of dataset hosting From ed6b4be3e8c803f57a50e42c4f4aa085b9d4c7a7 Mon Sep 17 00:00:00 2001 From: Justin Gosses Date: Sat, 21 Sep 2019 22:06:36 -0500 Subject: [PATCH 21/28] Update perspectives.md --- perspectives.md | 25 ++++++++++++++++--------- 1 file changed, 16 insertions(+), 9 deletions(-) diff --git a/perspectives.md b/perspectives.md index 59dc510..4fb152d 100644 --- a/perspectives.md +++ b/perspectives.md @@ -119,12 +119,19 @@ here = data-underground 7. This dataset is not uploaded here but merely referenced with a link to the original site. There is no direct download link to the actual file(s). #### Issues Related to levels of dataset hosting - - -#### Add-ons -For: -- Discoverability -- programmatic creation of meta-data from descriptions -- end-user creation of new meta-data -- end-user created code to use with the dataset - +- Different people access the dataset at different times and get different data without realizing it. +- Datasets at links are not longer at those links at some point. +- Should datasets only referenced be backed up somewhere? +- What tags are necessary to make these issues clear to end-users to prevent confusion? +- If direct download links are available if prevent programmatic download as well as programmatic download & description. + +#### Future or External Programmatic Add-ons +As data-undergound has an API, it is worth considering what external (or internal) parties could build with that API early, so that the necessary metadata are populated to a useful extent. + +Example Potential Add-ons: +- Visualizations to show summary descriptions of all the datasets in data-underground. +- Programmatic creation of new meta-data from dataset descriptions or existing metadata tags. +- Discoverability filter abilities that go beyond the original site's tools. +- Programmatic download and programmatic description of datasets resulting in additional metadata creation. +- Linking ability to related papers, code projects, starter code, etc. +- Things you haven't thought of yet but someone else has. From e4b259f0f3e7586707b155391a4cd022268206c8 Mon Sep 17 00:00:00 2001 From: Justin Gosses Date: Sat, 21 Sep 2019 22:09:55 -0500 Subject: [PATCH 22/28] Rename perspectives.md to Thoughts_on_Scaling_data-underground.md --- perspectives.md => Thoughts_on_Scaling_data-underground.md | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename perspectives.md => Thoughts_on_Scaling_data-underground.md (100%) diff --git a/perspectives.md b/Thoughts_on_Scaling_data-underground.md similarity index 100% rename from perspectives.md rename to Thoughts_on_Scaling_data-underground.md From 598d0fb4a4b62b4070e7664214bdcbd1e08f24dc Mon Sep 17 00:00:00 2001 From: Justin Gosses Date: Sat, 21 Sep 2019 22:12:00 -0500 Subject: [PATCH 23/28] Update Thoughts_on_Scaling_data-underground.md --- Thoughts_on_Scaling_data-underground.md | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/Thoughts_on_Scaling_data-underground.md b/Thoughts_on_Scaling_data-underground.md index 4fb152d..9cd3b0e 100644 --- a/Thoughts_on_Scaling_data-underground.md +++ b/Thoughts_on_Scaling_data-underground.md @@ -1,7 +1,11 @@ #### DRAFT / Brain Dump #### What is this file? -This markdown file is something akin to a slack comment in purpose, but way to long for that format, so I'm putting it here. I apologize for the rambling. This is in part an attempt to collect my thoughts. +This markdown file is something akin to a slack comment in purpose, but way to long for that format, so I'm putting it here. I apologize for the rambling. + +This document is not meant as a directive for data-undergound at all but more as a series of ideas from a single person to consider. + +- Justin Gosses #### Introduction / Why I'm writing this? After I reading the original document in this repo titled open-data-guidelines.md, it bugged me a little for reasons I had a hard time narrowing down. After giving it some thought, I realized I felt there were things left out due to the focus on the dataset. From d1de90118d436d028f466de760e7297b1563b54f Mon Sep 17 00:00:00 2001 From: Justin Gosses Date: Sat, 21 Sep 2019 22:12:21 -0500 Subject: [PATCH 24/28] Update Thoughts_on_Scaling_data-underground.md --- Thoughts_on_Scaling_data-underground.md | 2 -- 1 file changed, 2 deletions(-) diff --git a/Thoughts_on_Scaling_data-underground.md b/Thoughts_on_Scaling_data-underground.md index 9cd3b0e..747f3f7 100644 --- a/Thoughts_on_Scaling_data-underground.md +++ b/Thoughts_on_Scaling_data-underground.md @@ -1,5 +1,3 @@ -#### DRAFT / Brain Dump - #### What is this file? This markdown file is something akin to a slack comment in purpose, but way to long for that format, so I'm putting it here. I apologize for the rambling. From 36a26a7eed2867aeca40593fcf220f295b71f8d0 Mon Sep 17 00:00:00 2001 From: Justin Gosses Date: Sat, 21 Sep 2019 22:16:08 -0500 Subject: [PATCH 25/28] Update Thoughts_on_Scaling_data-underground.md --- Thoughts_on_Scaling_data-underground.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Thoughts_on_Scaling_data-underground.md b/Thoughts_on_Scaling_data-underground.md index 747f3f7..0f90020 100644 --- a/Thoughts_on_Scaling_data-underground.md +++ b/Thoughts_on_Scaling_data-underground.md @@ -13,7 +13,7 @@ The focus on the dataset makes sense, of course, but I think you get to a better This document is an attempt to suggest other perspectives to consider when creating an open-data site for datasets geared to geoscience + coding beyond what characteristics should the dataset have. #### Open-data Sites & Constraints On Users -As stated above, a focus on the characteristics datasets should have makes sense, of course. However, I typically find myself less bothered by the characteristics of individual datasets and more by my experiences, or other peoples' experiences, of trying to work with open-datasets in aggregate. Searching through them, evaluating them, organizing them, and aggregating them is often very difficult due to constraints built in place early. Sometimes constraints occur, because certain metadata wasn't encouraged. Other times sites lack certain filtering capability. Other times aspects of the datasets are not programmatically accessible. I less often find myself working with a specific dataset and say "oh it would be great it this particular dataset had blank". +I typically find myself less bothered by the characteristics of individual datasets and more by my experiences, or other peoples' experiences, of trying to work with open-datasets in aggregate. Searching through them, evaluating them, organizing them, and aggregating them is often very difficult due to constraints built in place early. Sometimes constraints occur, because certain metadata wasn't encouraged. Other times sites lack certain filtering capability. Other times aspects of the datasets are not programmatically accessible. I less often find myself working with a specific dataset and say "oh it would be great it this particular dataset had blank". These concerns are less obvious with only 17 datasets on https://dataunderground.org/dataset as of today. These problems appear more as the number of individual datasets grows greater than users' willingess to read through all of them. From 0368824f5260c39de82b35a2a85fa0222b557d4a Mon Sep 17 00:00:00 2001 From: Justin Gosses Date: Sat, 21 Sep 2019 22:18:47 -0500 Subject: [PATCH 26/28] Update Thoughts_on_Scaling_data-underground.md --- Thoughts_on_Scaling_data-underground.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/Thoughts_on_Scaling_data-underground.md b/Thoughts_on_Scaling_data-underground.md index 0f90020..8687bb2 100644 --- a/Thoughts_on_Scaling_data-underground.md +++ b/Thoughts_on_Scaling_data-underground.md @@ -20,7 +20,7 @@ These concerns are less obvious with only 17 datasets on https://dataunderground Additionally, these types of issues increase as the percent of datasets that can be aggregated into larger datasets increases. If all the datasets are completely separate or different in domain and,or format, than these issues are less of a problem. #### Background That Informs My Requirements/Needs -First, I co-lead the houston data visualization meetup, which means I spend some time searching for good datasets people would enjoy visualizing during our datajams over the course of about 4 hours. There's a lot of potential open-data sites out there. Second, I help maintain data.nasa.gov, which has approximately ~40,000 datasets and gets harvested into data.gov, which has almost a quarter of a million. +First, I co-lead the houston data visualization meetup, which means I spend some time searching for good datasets people would enjoy visualizing during our datajams over the course of about 4 hours. Second, I help maintain data.nasa.gov, which has approximately ~40,000 datasets and gets harvested into data.gov, which has almost a quarter of a million. #### Minimizing Time-To-Start is Maximizing Use Rate A hypothesis I have based on my own experiences and, to be honest, relatively little real data is that a lot of the most used open-data is just the easiest to use. @@ -63,7 +63,7 @@ A few different solutions exist to the tagging issues highlighted above. 5. Fifth, there isn't a good way to discovery datasets that meet a range of criteria except through a lot of work reading descriptions and downloading datasets. ##### Comments on the Solutions Listed Above -The first, second, third, and fifth options don't require anyone to have permissions to make edits to datasets except the dataset suppliers and maybe the website managers. The fourth, however, requires everyone to have edit access. This enables a lot of flexibility, especially going into the future after a dataset is initially uploaded, but has the downside of potentially scaring dataset suppliers. +The first, second, third, and fifth options don't require anyone to have permissions to make edits to datasets except the dataset suppliers and maybe the website managers. The fourth, however, requires everyone to have edit access. This has the potential to scare dataset suppliers if data-underground is the primary site with the dataset. ##### A Back-of-napkin geo-coding tag ontology just for example sake From ea9760fa80564c941a0f7b6ed9e669bd6079d243 Mon Sep 17 00:00:00 2001 From: Justin Gosses Date: Sat, 21 Sep 2019 22:20:07 -0500 Subject: [PATCH 27/28] Update Thoughts_on_Scaling_data-underground.md --- Thoughts_on_Scaling_data-underground.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Thoughts_on_Scaling_data-underground.md b/Thoughts_on_Scaling_data-underground.md index 8687bb2..774a000 100644 --- a/Thoughts_on_Scaling_data-underground.md +++ b/Thoughts_on_Scaling_data-underground.md @@ -1,7 +1,7 @@ #### What is this file? This markdown file is something akin to a slack comment in purpose, but way to long for that format, so I'm putting it here. I apologize for the rambling. -This document is not meant as a directive for data-undergound at all but more as a series of ideas from a single person to consider. +This document is not meant as a directive for data-undergound at all but more as a series of ideas from a single person to consider. It is, to some extent, a brain dump. Sorry for the rambling. - Justin Gosses From be4b6454a8970294c1534c84765791479b893a9f Mon Sep 17 00:00:00 2001 From: Justin Gosses Date: Sat, 21 Sep 2019 22:22:42 -0500 Subject: [PATCH 28/28] Update Thoughts_on_Scaling_data-underground.md --- Thoughts_on_Scaling_data-underground.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/Thoughts_on_Scaling_data-underground.md b/Thoughts_on_Scaling_data-underground.md index 774a000..e1dd828 100644 --- a/Thoughts_on_Scaling_data-underground.md +++ b/Thoughts_on_Scaling_data-underground.md @@ -12,16 +12,16 @@ The focus on the dataset makes sense, of course, but I think you get to a better This document is an attempt to suggest other perspectives to consider when creating an open-data site for datasets geared to geoscience + coding beyond what characteristics should the dataset have. -#### Open-data Sites & Constraints On Users +#### Background That Informs My Requirements/Needs +First, I co-lead the houston data visualization meetup, which means I spend some time searching for good datasets people would enjoy visualizing during our datajams over the course of about 4 hours. Second, I help maintain data.nasa.gov, which has approximately ~40,000 datasets and gets harvested into data.gov, which has almost a quarter of a million. + +#### Problems With Scale I typically find myself less bothered by the characteristics of individual datasets and more by my experiences, or other peoples' experiences, of trying to work with open-datasets in aggregate. Searching through them, evaluating them, organizing them, and aggregating them is often very difficult due to constraints built in place early. Sometimes constraints occur, because certain metadata wasn't encouraged. Other times sites lack certain filtering capability. Other times aspects of the datasets are not programmatically accessible. I less often find myself working with a specific dataset and say "oh it would be great it this particular dataset had blank". These concerns are less obvious with only 17 datasets on https://dataunderground.org/dataset as of today. These problems appear more as the number of individual datasets grows greater than users' willingess to read through all of them. Additionally, these types of issues increase as the percent of datasets that can be aggregated into larger datasets increases. If all the datasets are completely separate or different in domain and,or format, than these issues are less of a problem. -#### Background That Informs My Requirements/Needs -First, I co-lead the houston data visualization meetup, which means I spend some time searching for good datasets people would enjoy visualizing during our datajams over the course of about 4 hours. Second, I help maintain data.nasa.gov, which has approximately ~40,000 datasets and gets harvested into data.gov, which has almost a quarter of a million. - #### Minimizing Time-To-Start is Maximizing Use Rate A hypothesis I have based on my own experiences and, to be honest, relatively little real data is that a lot of the most used open-data is just the easiest to use.