Skip to content

boom-lab/crocolakeloader

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

57 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CrocoLakeLoader

CrocoLakeLoader is a Python package containing modules to interface with CrocoLake -- a database of oceanographic observations that is developed and maintained in the framework of the NSF-sponsored project CROCODILE (CESM Regional Ocean and Carbon cOnfigurator with Data assimilation and Embedding).

Its strength is to offer a uniform interface to a suite of oceanographic observations, which are all maintained in the same format (parquet), offering the user with the opportunity to retrieve and manipulate observations while remaining agnostic of their original data format, thus avoiding the extra-learning required to deal with multiple formats (netCDF, CSV, parquet, etc.) and merging different sources.

Table of Contents

  1. Installation
  2. Loading data from CrocoLake
    1. Loading the data
      1. Filter variables
      2. Filter sources
      3. Filter by variables values
  3. Available sources
  4. CrocoLake-Tools

Installation

On your terminal, run:

pip install crocolakeloader
Download CrocoLake

CrocoLake is available in two versions, "PHY" and "BGC", which hold physical and biogeochemical data respectively. A full list of variable names is found in utils/params.py. In general, you'll want to use the "PHY" database if you're looking at only temperature and/or salinity; "BGC" otherwise. Often (but not always) temperature and salinity are present when and where BGC measurements are, but not vice versa.

To download the most recent version of CrocoLake PHY, from the repository root run

wget "https://whoi-my.sharepoint.com/:u:/g/personal/enrico_milanese_whoi_edu/EYd-6370NqtNskScY-6hytIByrMA5LEIUVONBgzop9IVog?e=ElcHa4&download=1" -O ./CrocoLake/CrocoLakePHY/CrocoLakePHY.zip
cd ./CrocoLake/CrocoLakePHY/
unzip CrocoLakePHY.zip

For CrocoLake BGC:

wget "https://whoi-my.sharepoint.com/:u:/g/personal/enrico_milanese_whoi_edu/EbjTk9CJgCZJlkvPmwI38NsBQQvUL6MXkTLBAPV5jZutVg?e=RPY9vP&download=1" -O ./CrocoLake/CrocoLakeBGC/CrocoLakeBGC.zip
cd ./CrocoLake/CrocoLakeBGC/
unzip CrocoLakeBGC.zip

You can set up your own paths, but if you followed the previous steps you should now have a folder structure that is already compatible with the example notebooks, like this (listing folders only): |--- crocolakeloader | |--- CrocoLake/ | | |--- CrocoLakePHY/ | | | |--- 1002_PHY_ARGO-QC-DEV_2025-02-15 | | | |--- 1101_PHY_GLODAP_2024-12-11 | | | |--- 1201_PHY_SPRAY_2024-12-11 | | |--- CrocoLakeBGC/ | | | |--- 1002_BGC_ARGO-QC-DEV_2025-02-15 | | | |--- 1101_BGC_GLODAP_2024-12-11 | |--- notebooks/ | |--- test/

Loading data from CrocoLake

What follows is a brief guide on how to load data from CrocoLake. More examples (including how to manipulate the data) are in the notebooks folder.

Before going ahead, remember to download CrocoLake if you haven't already.

The simplest way to load it into your working space is through the Loader class:

from crocolakeloader.loader import Loader
loader = Loader.Loader(
    db_type="PHY",  # specify "PHY" or "BGC" for physical or biogeochemical databases
    db_rootpath="/path/to/my/CrocoLake"
)
ddf = loader.get_dataframe()

Loader() needs at minimum the database type ("PHY" or "BGC") and the path to the database. get_dataframe() returns a dask dataframe. If you're not familiar with dask, you can think of it as a wrapper to deal with data that are larger than what your machine's memory can load. A dask dataframe behaves almost identically like a pandas dataframe, and if you indeed want to use a pandas dataframe, you can just do (but DON'T do it, yet):

df = ddf.compute()

Note that this will load into memory all the data that ddf is referencing to: our first simple example would load more data than most systems can handle, so let's see how we can apply some filters.

Filter variables

If you want to load only some specific variables (see list here), you can pass a name list to Loader():

selected_variables = [
    "LATITUDE",
    "LONGITUDE",
    "PRES",
    "PSAL",
    "TEMP"
]

loader = Loader.Loader(
    selected_variables=selected_variables,
    db_type="PHY",
    db_rootpath="/path/to/my/CrocoLake"
)

ddf = loader.get_dataframe()
Filter sources

Similarly, you can also filter by data source (list here) with a list:

db_source = ["ARGO"]

loader = Loader.Loader(
    selected_variables=selected_variables,
    db_type="PHY",
    db_list=db_source,
    db_rootpath="/path/to/my/CrocoLake"
)

ddf = loader.get_dataframe()
Filter by variables values

Filtering by values (i.e. row-wise, e.g. to restrain the geographical coordinates or time period) requires to define and apply a filter to the loader object:

filters = [
    ("LATITUDE",'>',5),
    ("LATITUDE",'<',30),
    ("LONGITUDE",'>',-90),
    ("LONGITUDE",'<',-30),
    ("TEMP",">=",-1e30),
    ("TEMP","<=",+1e30)
]

loader.set_filters(filters)

ddf = loader.get_dataframe()

Two notes on the filters:

  • To discard invalid values (NaNs), request the variable to be inside a very large interval (e.g. between -1e30 and +1e30)
  • The filters must be passed in the appropriate format (see the filters option here); it's easier done than explained, but basically a single list contains AND predicates, and outer, parallel lists are combined with OR predicates. In the example above, all conditions must be satisfied by a row to be kept into the dataframe. If we want to keep all the rows with valid temperature or pressure values in the region, we would do:
filters = [
    [
        ("LATITUDE",'>',5),
        ("LATITUDE",'<',30),
        ("LONGITUDE",'>',-90),
        ("LONGITUDE",'<',-30),
        ("TEMP",">=",-1e30),
        ("TEMP","<=",+1e30)
    ],[
        ("LATITUDE",'>',5),
        ("LATITUDE",'<',30),
        ("LONGITUDE",'>',-90),
        ("LONGITUDE",'<',-30),
        ("PRES",">=",-1e30),
        ("PRES","<=",+1e30)
    ]
]

Available sources

As of this release, CrocoLake includes all the Argo physical and biogeochemical data present in the GDAC, GLODAP's database, and QC-ed observations from Spray Gliders.

We are always working on including new sources, and the next candidates are the North Atlantic CPR Survey and the Oleander project.

If you are interested in a particular dataset to be added, get in touch!

CrocoLake-Tools

If you are interested in how CrocoLake is generated, check out CrocoLake-Tools.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •