softwareunderground · JustinGOSSES · Sep 15, 2019 · Sep 15, 2019 · Sep 16, 2019 · Sep 16, 2019
diff --git a/Thoughts_on_Scaling_data-underground.md b/Thoughts_on_Scaling_data-underground.md
@@ -0,0 +1,139 @@
+#### What is this file?
+This markdown file is something akin to a slack comment in purpose, but way to long for that format, so I'm putting it here. I apologize for the rambling. 
+
+<b>This document is not meant as a directive for data-undergound at all</b> but more as a series of ideas from a single person to consider. It is, to some extent, a brain dump. Sorry for the rambling.
+
+- Justin Gosses
+
+#### Introduction / Why I'm writing this?
+After I reading the original <a href="https://github.com/softwareunderground/data-underground/blob/master/open-data-guidelines.md">document</a> in this repo titled open-data-guidelines.md, it bugged me a little for reasons I had a hard time narrowing down. After giving it some thought, I realized I felt there were things left out due to the focus on the dataset.
+
+The focus on the dataset makes sense, of course, but I think you get to a better place but not just asking questions about what characteristics the dataset should have but also considering the site that hosts the dataset as well as the different types of members in the community around both the dataset and the site as a hole.
+
+<i>This document is an attempt to suggest other perspectives to consider when creating an open-data site for datasets geared to geoscience + coding beyond what characteristics should the dataset have.</i>
+
+#### Background That Informs My Requirements/Needs
+First, I co-lead the houston data visualization meetup, which means I spend some time searching for good datasets people would enjoy visualizing during our datajams over the course of about 4 hours. Second, I help maintain data.nasa.gov, which has approximately ~40,000 datasets and gets harvested into <a href="https://catalog.data.gov/dataset">data.gov</a>, which has almost a quarter of a million. 
+
+#### Problems With Scale
+I typically find myself less bothered by the characteristics of individual datasets and more by my experiences, or other peoples' experiences, of trying to work with open-datasets in aggregate. Searching through them, evaluating them, organizing them, and aggregating them is often very difficult due to constraints built in place early. Sometimes constraints occur, because certain metadata wasn't encouraged. Other times sites lack certain filtering capability. Other times aspects of the datasets are not programmatically accessible. I less often find myself working with a specific dataset and say "oh it would be great it this particular dataset had blank".
+
+These concerns are less obvious with only 17 datasets on https://dataunderground.org/dataset as of today. These problems appear more as the number of individual datasets grows greater than users' willingess to read through all of them. 
+
+Additionally, these types of issues increase as the percent of datasets that can be aggregated into larger datasets increases. If all the datasets are completely separate or different in domain and,or format, than these issues are less of a problem.
+
+#### Minimizing Time-To-Start is Maximizing Use Rate
+A hypothesis I have based on my own experiences and, to be honest, relatively little real data is that a lot of the most used open-data is just the easiest to use. 
+
+This is what we see with data.nasa.gov. The most used datasets are typically small ones, only one file, in CSV format, that are harvested into sites with great user-interfaces, like kaggle and data.world, making the evaluation and time to start very minimize. 
+
+A lot of datasets are hard to discover, and "discover" is often a more accurate word than "find" as a significant amount of use of open-data comes from people who didn't already know a dataset existed except through the open-data site or someone else who found it on the open-data site.  
+
+To maximize the rate of discoverability, you need to make the amount of time to get there shorter. This requires the ability to sort and filter datasets in ways the correlate with user needs. 
+
+[STRONGLY HELD OPIONION] The search functionality of some open-data sites is more geared to finding datasets than discovering them, which impacts the user experience.
+
+#### Discoverability Problems That Occur With More Datasets on a Site
+
+##### A. How do you find datasets based on task?
+Some users won't care about the dataset content so much as they know it has labels and can be modeled as a time series problem. How do they find all the datasets that meet that definition?
+
+##### B. How do you find datasets based on data format or data structure?
+Some people will want LAS 2.0 well logs. Others will absolutely need well with well paths included, which LAS file formats won't have.
+
+##### C. Can you find all the versions of a dataset?
+If a user stumbles upon a preprocessed dataset like <a href="https://github.com/JustinGOSSES/McMurray-Wabiskaw-preprocessed-datasets/blob/master/processed_datasets/mcmurray_facies_dataframe.h5.zip">this one</a> will they also see that there is an original dataset <a href="https://dataunderground.org/dataset/athabasca">here</a??
+
+#### D. How do find datasets based on mininum number of instances?
+Some users will be interested in calculating petrophysical variables in Python and want to know what curves are available. Others will want to do stratigraphic top prediction and need more than 800 wells at minimum. 
+
+#### Tags and Flexibility and Who Does the Work?
+The obvious solution to many of the discoverability problems above is tags. One of the problems with tags is that they're typically applied by humans with differing perspectives on what's important. Sometimes even with the same perspective on what's important, they might phrase things differently. What this often leads to is partial coverage of datasets that could fall under any individual tag being actually tagged with that keyword. 
+
+A few different solutions exist to the tagging issues highlighted above. 
+
+1. First, there can be a defined list of tags with certain mandatory key-value pairs that dataset suppliers have to fill out. 
+
+2. Second, there can a heoric website manager, or assigned intern, who goes through and re-tags every dataset. 
+
+3. Third, if lengthy enough text description of each item exists, automated text tagging of topics can augment human-generated tags. 
+
+4. Fourth, the community that users the dataset can tag datasets as they evaluate & use them, potentially according to an ontology that develops over time and is maintained in a central location. For example, there might be a characteristic about well logs, like whether they have tops, that is deemed interestingly enough that it is flagged by the group as something all datasets with well logs should be re-tagged with. This requires an active community and may not be possible if the open-data site is harvesting hundreds or thousands of datasets from other open-data sites.
+
+5. Fifth, there isn't a good way to discovery datasets that meet a range of criteria except through a lot of work reading descriptions and downloading datasets.
+
+##### Comments on the Solutions Listed Above
+The first, second, third, and fifth options don't require anyone to have permissions to make edits to datasets except the dataset suppliers and maybe the website managers. The fourth, however, requires everyone to have edit access. This has the potential to scare dataset suppliers if data-underground is the primary site with the dataset.
+
+##### A Back-of-napkin geo-coding tag ontology just for example sake
+
+[NOTHING STARTED HERE]
+
+## Brainstorming Users, Personas, Use-cases, Potential Addons, and Evolutions of Requirements
+
+### Users
+Highest level breakdown of users:
+1. Suppliers of the datasets
+2. Maintainers of the website (perhaps the most active users!)
+3. End-users consuming datasets
+
+#### Builders & Maintainers of the website that hosts and presents that data to end-users
+These would be the people with the ability to make changes to https://dataunderground.org/dataset.
+
+Some characteristics one could use to describe them that might matter would be:
+1. The amount of time they want to / can contribute.
+2. The degree to which they choose to accept edits, code, or other types of control from outside the initial group. 
+3. Skills, particularly those that affect tech stack and the time to do different dataset approval, organization, or cleaning tasks. 
+
+#### A few personas of end-users:
+1. Curious about what data is available but no end goal, project or datasets in mind. 
+  a. Goes to first page and that's it if not interested there.
+  b. Spends <4 minutes browsing and then bails if nothing in line with interests.
+  c. Will read details of multiple datasets and maybe download 1 or 2 max.
+  d. Will read details on every dataset up to first 50 and downloads 4 or 10 to explore in more depth.
+2. Geologist with high level of programming skills. Interested in seismic data related to automatic fault picking.
+3. Geologist with low level of programming skills. Interested in facies prediction. Wants to move example work from an open-dataset to be applied to their own internal well log data.
+4. Geophysicist who wants to try out unsupervised machine-learning package that he's seen used recently on geology data and is curious what open data could used for that sort of approach.
+
+#### End-users Use-cases
+1. Demo datasets in code packages. 
+2. Hackathon datasets.
+3. Datasets to go with open-source code. Users then make changes to the code but keep using the dataset.
+4. Various type of datascience: visualization, data exploration, sql practice, unsupervized learning, supervised learning, classification, regression prediction, etc.
+
+#### Suppliers of datasets / who might have edit rights in some form?
+1. Owners of original datasets who also collected the data. 
+2. People who have rights to make the dataset public but didn't collect it originally or prep it. 
+3. People who found the dataset with an open-source license elsewhere and want to suggest it for inclusion on data-underground.
+4. People who want to add a correction to the already published dataset. 
+5. People who want to supply some additional notes to the dataset (for example which logs have bad formatting and won't load easily via LASIO). 
+6. People who want to add or link to code to load the dataset or work with the dataset in some way. 
+7. People who found a paper that references the dataset but don't own either the dataset or the paper.
+
+#### Levels of dataset hosting
+here = data-underground
+1. This dataset was uploaded here originally and exists only here. 
+2. This dataset is uploaded here but also lives in its original location where it may or may not have changed since.
+3. This dataset is programmatically harvested (uploaded) to this site and there are programmatic checks in place to re-harvest if it changes. 
+4. This dataset is manually uploaded here from another site but there is no telling if the original has changed elsewhere.
+5. This dataset is not uploaded here but merely referenced with a direct download link.
+6. This dataset is not uploaded here but merely referenced with a direct download link and a flag for if the dataset has changed since original placement.
+7. This dataset is not uploaded here but merely referenced with a link to the original site. There is no direct download link to the actual file(s).
+
+#### Issues Related to levels of dataset hosting
+- Different people access the dataset at different times and get different data without realizing it.
+- Datasets at links are not longer at those links at some point.
+- Should datasets only referenced be backed up somewhere?
+- What tags are necessary to make these issues clear to end-users to prevent confusion?
+- If direct download links are available if prevent programmatic download as well as programmatic download & description. 
+
+#### Future or External Programmatic Add-ons
+As data-undergound has an API, it is worth considering what external (or internal) parties could build with that API early, so that the necessary metadata are populated to a useful extent.
+
+Example Potential Add-ons:
+- Visualizations to show summary descriptions of all the datasets in data-underground.
+- Programmatic creation of new meta-data from dataset descriptions or existing metadata tags.
+- Discoverability filter abilities that go beyond the original site's tools.
+- Programmatic download and programmatic description of datasets resulting in additional metadata creation.
+- Linking ability to related papers, code projects, starter code, etc.
+- Things you haven't thought of yet but someone else has.