GitHub - boom-lab/crocolakeloader

CrocoLakeLoader

CrocoLakeLoader is a Python package containing modules to interface with CrocoLake -- a database of oceanographic observations that is developed and maintained in the framework of the NSF-sponsored project CROCODILE (CESM Regional Ocean and Carbon cOnfigurator with Data assimilation and Embedding).

Its strength is to offer a uniform interface to a suite of oceanographic observations, which are all maintained in the same format (parquet), offering the user with the opportunity to retrieve and manipulate observations while remaining agnostic of their original data format, thus avoiding the extra-learning required to deal with multiple formats (netCDF, CSV, parquet, etc.) and merging different sources.

Installation

On your terminal, run:

pip install crocolakeloader

Download CrocoLake

CrocoLake is available in two versions, "PHY" and "BGC", which hold physical and biogeochemical data respectively. A full list of variable names is found in utils/params.py. In general, you'll want to use the "PHY" database if you're looking at only temperature and/or salinity; "BGC" otherwise. Often (but not always) temperature and salinity are present when and where BGC measurements are, but not vice versa.

To download the most recent version of CrocoLake PHY, from the repository root run

wget "https://whoi-my.sharepoint.com/:u:/g/personal/enrico_milanese_whoi_edu/EYd-6370NqtNskScY-6hytIByrMA5LEIUVONBgzop9IVog?e=ElcHa4&download=1" -O ./CrocoLake/CrocoLakePHY/CrocoLakePHY.zip
cd ./CrocoLake/CrocoLakePHY/
unzip CrocoLakePHY.zip

For CrocoLake BGC:

wget "https://whoi-my.sharepoint.com/:u:/g/personal/enrico_milanese_whoi_edu/EbjTk9CJgCZJlkvPmwI38NsBQQvUL6MXkTLBAPV5jZutVg?e=RPY9vP&download=1" -O ./CrocoLake/CrocoLakeBGC/CrocoLakeBGC.zip
cd ./CrocoLake/CrocoLakeBGC/
unzip CrocoLakeBGC.zip

Loading data from CrocoLake

What follows is a brief guide on how to load data from CrocoLake. More examples (including how to manipulate the data) are in the notebooks folder.

Before going ahead, remember to download CrocoLake if you haven't already.

The simplest way to load it into your working space is through the Loader class:

from crocolakeloader.loader import Loader
loader = Loader.Loader(
    db_type="PHY",  # specify "PHY" or "BGC" for physical or biogeochemical databases
    db_rootpath="/path/to/my/CrocoLake"
)
ddf = loader.get_dataframe()

Loader() needs at minimum the database type ("PHY" or "BGC") and the path to the database. get_dataframe() returns a dask dataframe. If you're not familiar with dask, you can think of it as a wrapper to deal with data that are larger than what your machine's memory can load. A dask dataframe behaves almost identically like a pandas dataframe, and if you indeed want to use a pandas dataframe, you can just do (but DON'T do it, yet):

df = ddf.compute()

Note that this will load into memory all the data that ddf is referencing to: our first simple example would load more data than most systems can handle, so let's see how we can apply some filters.

Filter variables

If you want to load only some specific variables (see list here), you can pass a name list to Loader():

selected_variables = [
    "LATITUDE",
    "LONGITUDE",
    "PRES",
    "PSAL",
    "TEMP"
]

loader = Loader.Loader(
    selected_variables=selected_variables,
    db_type="PHY",
    db_rootpath="/path/to/my/CrocoLake"
)

ddf = loader.get_dataframe()

Filter sources

Similarly, you can also filter by data source (list here) with a list:

db_source = ["ARGO"]

loader = Loader.Loader(
    selected_variables=selected_variables,
    db_type="PHY",
    db_list=db_source,
    db_rootpath="/path/to/my/CrocoLake"
)

ddf = loader.get_dataframe()

Filter by variables values

Filtering by values (i.e. row-wise, e.g. to restrain the geographical coordinates or time period) requires to define and apply a filter to the loader object:

filters = [
    ("LATITUDE",'>',5),
    ("LATITUDE",'<',30),
    ("LONGITUDE",'>',-90),
    ("LONGITUDE",'<',-30),
    ("TEMP",">=",-1e30),
    ("TEMP","<=",+1e30)
]

loader.set_filters(filters)

ddf = loader.get_dataframe()

Two notes on the filters:

To discard invalid values (NaNs), request the variable to be inside a very large interval (e.g. between -1e30 and +1e30)
The filters must be passed in the appropriate format (see the filters option here); it's easier done than explained, but basically a single list contains AND predicates, and outer, parallel lists are combined with OR predicates. In the example above, all conditions must be satisfied by a row to be kept into the dataframe. If we want to keep all the rows with valid temperature or pressure values in the region, we would do:

filters = [
    [
        ("LATITUDE",'>',5),
        ("LATITUDE",'<',30),
        ("LONGITUDE",'>',-90),
        ("LONGITUDE",'<',-30),
        ("TEMP",">=",-1e30),
        ("TEMP","<=",+1e30)
    ],[
        ("LATITUDE",'>',5),
        ("LATITUDE",'<',30),
        ("LONGITUDE",'>',-90),
        ("LONGITUDE",'<',-30),
        ("PRES",">=",-1e30),
        ("PRES","<=",+1e30)
    ]
]

Available sources

As of this release, CrocoLake includes all the Argo physical and biogeochemical data present in the GDAC, GLODAP's database, and QC-ed observations from Spray Gliders.

We are always working on including new sources, and the next candidates are the North Atlantic CPR Survey and the Oleander project.

If you are interested in a particular dataset to be added, get in touch!

CrocoLake-Tools

If you are interested in how CrocoLake is generated, check out CrocoLake-Tools.

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
CrocoLake		CrocoLake
crocolakeloader		crocolakeloader
notebooks		notebooks
scripts		scripts
test		test
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CrocoLakeLoader

Table of Contents

Installation

Download CrocoLake

Loading data from CrocoLake

Filter variables

Filter sources

Filter by variables values

Available sources

CrocoLake-Tools

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

boom-lab/crocolakeloader

Folders and files

Latest commit

History

Repository files navigation

CrocoLakeLoader

Table of Contents

Installation

Download CrocoLake

Loading data from CrocoLake

Filter variables

Filter sources

Filter by variables values

Available sources

CrocoLake-Tools

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages