CrocoLakeLoader is a Python package containing modules to interface with CrocoLake -- a database of oceanographic observations that is developed and maintained in the framework of the NSF-sponsored project CROCODILE (CESM Regional Ocean and Carbon cOnfigurator with Data assimilation and Embedding).
Its strength is to offer a uniform interface to a suite of oceanographic observations, which are all maintained in the same format (parquet), offering the user with the opportunity to retrieve and manipulate observations while remaining agnostic of their original data format, thus avoiding the extra-learning required to deal with multiple formats (netCDF, CSV, parquet, etc.) and merging different sources.
On your terminal, run:
pip install crocolakeloader
CrocoLake is available in two versions, "PHY" and "BGC", which hold physical and biogeochemical data respectively. A full list of variable names is found in utils/params.py. In general, you'll want to use the "PHY" database if you're looking at only temperature and/or salinity; "BGC" otherwise. Often (but not always) temperature and salinity are present when and where BGC measurements are, but not vice versa.
To download the most recent version of CrocoLake PHY, from the repository root run
wget "https://whoi-my.sharepoint.com/:u:/g/personal/enrico_milanese_whoi_edu/EYd-6370NqtNskScY-6hytIByrMA5LEIUVONBgzop9IVog?e=ElcHa4&download=1" -O ./CrocoLake/CrocoLakePHY/CrocoLakePHY.zip
cd ./CrocoLake/CrocoLakePHY/
unzip CrocoLakePHY.zip
For CrocoLake BGC:
wget "https://whoi-my.sharepoint.com/:u:/g/personal/enrico_milanese_whoi_edu/EbjTk9CJgCZJlkvPmwI38NsBQQvUL6MXkTLBAPV5jZutVg?e=RPY9vP&download=1" -O ./CrocoLake/CrocoLakeBGC/CrocoLakeBGC.zip
cd ./CrocoLake/CrocoLakeBGC/
unzip CrocoLakeBGC.zip
You can set up your own paths, but if you followed the previous steps you should now have a folder structure that is already compatible with the example notebooks, like this (listing folders only): |--- crocolakeloader | |--- CrocoLake/ | | |--- CrocoLakePHY/ | | | |--- 1002_PHY_ARGO-QC-DEV_2025-02-15 | | | |--- 1101_PHY_GLODAP_2024-12-11 | | | |--- 1201_PHY_SPRAY_2024-12-11 | | |--- CrocoLakeBGC/ | | | |--- 1002_BGC_ARGO-QC-DEV_2025-02-15 | | | |--- 1101_BGC_GLODAP_2024-12-11 | |--- notebooks/ | |--- test/
What follows is a brief guide on how to load data from CrocoLake. More examples (including how to manipulate the data) are in the notebooks folder.
Before going ahead, remember to download CrocoLake if you haven't already.
The simplest way to load it into your working space is through the Loader class:
from crocolakeloader.loader import Loader
loader = Loader.Loader(
db_type="PHY", # specify "PHY" or "BGC" for physical or biogeochemical databases
db_rootpath="/path/to/my/CrocoLake"
)
ddf = loader.get_dataframe()
Loader() needs at minimum the database type ("PHY" or "BGC") and the path to the database. get_dataframe() returns a dask dataframe. If you're not familiar with dask, you can think of it as a wrapper to deal with data that are larger than what your machine's memory can load. A dask dataframe behaves almost identically like a pandas dataframe, and if you indeed want to use a pandas dataframe, you can just do (but DON'T do it, yet):
df = ddf.compute()
Note that this will load into memory all the data that ddf is referencing to: our first simple example would load more data than most systems can handle, so let's see how we can apply some filters.
If you want to load only some specific variables (see list here), you can pass a name list to Loader():
selected_variables = [
"LATITUDE",
"LONGITUDE",
"PRES",
"PSAL",
"TEMP"
]
loader = Loader.Loader(
selected_variables=selected_variables,
db_type="PHY",
db_rootpath="/path/to/my/CrocoLake"
)
ddf = loader.get_dataframe()
Similarly, you can also filter by data source (list here) with a list:
db_source = ["ARGO"]
loader = Loader.Loader(
selected_variables=selected_variables,
db_type="PHY",
db_list=db_source,
db_rootpath="/path/to/my/CrocoLake"
)
ddf = loader.get_dataframe()
Filtering by values (i.e. row-wise, e.g. to restrain the geographical coordinates or time period) requires to define and apply a filter to the loader object:
filters = [
("LATITUDE",'>',5),
("LATITUDE",'<',30),
("LONGITUDE",'>',-90),
("LONGITUDE",'<',-30),
("TEMP",">=",-1e30),
("TEMP","<=",+1e30)
]
loader.set_filters(filters)
ddf = loader.get_dataframe()
Two notes on the filters:
- To discard invalid values (NaNs), request the variable to be inside a very large interval (e.g. between
-1e30and+1e30) - The filters must be passed in the appropriate format (see the filters option here); it's easier done than explained, but basically a single list contains AND predicates, and outer, parallel lists are combined with OR predicates. In the example above, all conditions must be satisfied by a row to be kept into the dataframe. If we want to keep all the rows with valid temperature or pressure values in the region, we would do:
filters = [
[
("LATITUDE",'>',5),
("LATITUDE",'<',30),
("LONGITUDE",'>',-90),
("LONGITUDE",'<',-30),
("TEMP",">=",-1e30),
("TEMP","<=",+1e30)
],[
("LATITUDE",'>',5),
("LATITUDE",'<',30),
("LONGITUDE",'>',-90),
("LONGITUDE",'<',-30),
("PRES",">=",-1e30),
("PRES","<=",+1e30)
]
]
As of this release, CrocoLake includes all the Argo physical and biogeochemical data present in the GDAC, GLODAP's database, and QC-ed observations from Spray Gliders.
We are always working on including new sources, and the next candidates are the North Atlantic CPR Survey and the Oleander project.
If you are interested in a particular dataset to be added, get in touch!
If you are interested in how CrocoLake is generated, check out CrocoLake-Tools.