Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions doc/release-notes/11254-croissant-builtin.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
## Croissant Support Is Now Built In

Croissant is a metadata export format for machine learning datasets that (until this release) was optional and implemented as external exporter. The code has been merged into the main Dataverse code base which means the Croissant format is automatically available in your installation of Dataverse, alongside older formats like Dublin Core and DDI. If you were using the external Croissant exporter, the merged code is equivalent to version 0.1.6. Croissant bugs and feature requests should now be filed against the main Dataverse repo (https://github.com/IQSS/dataverse) and the old repo (https://github.com/gdcc/exporter-croissant) should be considered retired.

As described in the [Discoverability](https://dataverse-guide--12130.org.readthedocs.build/en/12130/admin/discoverability.html#id6) section of the Admin Guide, Croissant is inserted into the "head" of the HTML of dataset landing pages, as requested by the [Google Dataset Search](https://datasetsearch.research.google.com) team so that their tool can filter by datasets that support Croissant. In previous versions of Dataverse, when Croissant was optional and hadn't been enabled, we used the older "Schema.org JSON-LD" format in the "head". If you'd like to keep this behavior, you can use the feature flag [dataverse.legacy.schemaorg-in-html-head](https://dataverse-guide--12130.org.readthedocs.build/en/12130/installation/config.html#dataverse.legacy.schemaorg-in-html-head).

We are aware that the amount of data in the "head" of the HTML can grow quite large for both Croissant and Schema.org JSON-LD. This is especially true of Croissant which exposes variable-level information. We plan to address this in https://github.com/IQSS/dataverse/issues/12123 . We also plan to support Croissant 1.1 in the future and are tracking this at https://github.com/IQSS/dataverse/issues/12014 .

See also #11254 and #12130.

## New Settings

- dataverse.legacy.schemaorg-in-html-head
17 changes: 9 additions & 8 deletions doc/sphinx-guides/source/admin/discoverability.rst
Original file line number Diff line number Diff line change
Expand Up @@ -30,21 +30,22 @@ The HTML source of a dataset landing page includes "DC" (Dublin Core) ``<meta>``
<meta name="DC.type" content="Dataset"
<meta name="DC.title" content="..."

.. _schema.org-head:
.. _croissant-head:

Schema.org JSON-LD/Croissant Metadata
+++++++++++++++++++++++++++++++++++++
Croissant Metadata in the ``<head>`` of Dataset Landing Pages
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

The ``<head>`` of the HTML source of a dataset landing page includes Schema.org JSON-LD metadata like this::
`Croissant <https://github.com/mlcommons/croissant>`_ is a metadata format for machine learning datasets.

In Dataverse, the ``<head>`` of the HTML source of a dataset landing page includes Croissant metadata like this::

<script type="application/ld+json">{"@context":"http://schema.org","@type":"Dataset","@id":"https://doi.org/...
<script type="application/ld+json">{"@context":..."cr":"http://mlcommons.org/croissant/"...

If you enable the Croissant metadata export format (see :ref:`external-exporters`) the ``<head>`` will show Croissant metadata instead. It looks similar, but you should see ``"cr": "http://mlcommons.org/croissant/"`` in the output.
This is the same Croissant file you can download from a dataset landing page by clicking "Metadata" then "Export Metadata" (see :ref:`metadata-export-formats`) and the API (see ``croissant`` at :ref:`export-dataset-metadata-api`).

For backward compatibility, if you enable Croissant, the older Schema.org JSON-LD format (``schema.org`` in the API) will still be available from both the web interface (see :ref:`metadata-export-formats`) and the API (see :ref:`export-dataset-metadata-api`).
We include Croissant in the ``<head>`` because it's `recommended <https://github.com/mlcommons/croissant/issues/530#issuecomment-1964227662>`_ by Google for `Google Dataset Search <https://datasetsearch.research.google.com>`_, where they offer a filter to narrow results to only datasets with support for Croissant.

The Dataverse team has been working with Google on both formats. Google has `indicated <https://github.com/mlcommons/croissant/issues/530#issuecomment-1964227662>`_ that for `Google Dataset Search <https://datasetsearch.research.google.com>`_ (the main reason we started adding this extra metadata in the ``<head>`` of dataset pages), Croissant is the successor to the older format.
Before Croissant was invented, Google recommended a different format that Dataverse refers to as "Schema.org JSON-LD" in the user interface (and ``schema.org`` in the API). If you prefer to put that older format in the ``<head>``, which was the behavior in older versions of Dataverse, see :ref:`dataverse.legacy.schemaorg-in-html-head`.

.. _discovery-sign-posting:

Expand Down
3 changes: 2 additions & 1 deletion doc/sphinx-guides/source/api/native-api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2006,6 +2006,7 @@ Available Dataset Metadata Exporters

The following dataset metadata exporters ship with Dataverse:

- ``croissant``
- ``Datacite``
- ``dataverse_json``
- ``dcterms``
Expand Down Expand Up @@ -2034,7 +2035,7 @@ Please note that the ``schema.org`` format has changed in backwards-incompatible

Both forms are valid according to Google's Structured Data Testing Tool at https://search.google.com/structured-data/testing-tool . Schema.org JSON-LD is an evolving standard that permits a great deal of flexibility. For example, https://schema.org/docs/gs.html#schemaorg_expected indicates that even when objects are expected, it's ok to just use text. As with all metadata export formats, we will try to keep the Schema.org JSON-LD format backward-compatible to make integrations more stable, despite the flexibility that's afforded by the standard.

The standard has further evolved into a format called Croissant. For details, see :ref:`schema.org-head` in the Admin Guide.
The standard has further evolved into a format called Croissant. For details, see :ref:`croissant-head` in the Admin Guide.

The ``schema.org`` format changed after Dataverse 6.4 as well. Previously its content type was "application/json" but now it is "application/ld+json".

Expand Down
15 changes: 14 additions & 1 deletion doc/sphinx-guides/source/developers/coding-style.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,8 @@ Java
Formatting Code
~~~~~~~~~~~~~~~

How to format Java code is being discussed on `Zulip <https://dataverse.zulipchat.com/#narrow/channel/379673-dev/topic/code.20formatting.20.28Spotless.2C.20Checkstyle.2C.20etc.2E.29/near/432974039>`_ and the `dev mailing list <https://groups.google.com/g/dataverse-dev/c/y2Jpk3szTf8/m/NhTJvXblAgAJ>`_.

Tabs vs. Spaces
^^^^^^^^^^^^^^^

Expand Down Expand Up @@ -59,10 +61,21 @@ Place curly braces according to the style below, which is an example you can see
}
}

Format Code with Spotless
^^^^^^^^^^^^^^^^^^^^^^^^^

In some of our libraries we've had success formatting code with `Spotless <https://github.com/diffplug/spotless>`_. See https://github.com/gdcc/xoai/issues/35 for an early discussion.

We've added Spotless to the main repo but have limited it to certain files. If you'd like to use Spotless on files you're editing, update the config in pom.xml to include them.

To run Spotless on your code:

``mvn spotless:apply``

Format Code You Changed with Netbeans
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

IQSS has standardized on Netbeans. It is much appreciated when you format your code (but only the code you touched!) using the out-of-the-box Netbeans configuration. If you have created an entirely new Java class, you can just click Source -> Format. If you are adjusting code in an existing class, highlight the code you changed and then click Source -> Format. Keeping the "diff" in your pull requests small makes them easier to code review.
For a long time IQSS standardized on Netbeans. For files not included in the Spotless config mentioned above, it is much appreciated when you format your code (but only the code you touched!) using the out-of-the-box Netbeans configuration. If you have created an entirely new Java class, you can just click Source -> Format. If you are adjusting code in an existing class, highlight the code you changed and then click Source -> Format. Keeping the "diff" in your pull requests small makes them easier to code review.

Checking Your Formatting With Checkstyle
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Expand Down
3 changes: 1 addition & 2 deletions doc/sphinx-guides/source/installation/advanced.rst
Original file line number Diff line number Diff line change
Expand Up @@ -136,9 +136,8 @@ Use the :ref:`dataverse.spi.exporters.directory` configuration option to specify
Inventory of External Exporters
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

For a list of external exporters, see the README at https://github.com/gdcc/dataverse-exporters. To highlight a few:
For a list of external exporters, see the README at https://github.com/gdcc/dataverse-exporters. For example:

- Croissant
- RO-Crate

Developing New Exporters
Expand Down
10 changes: 9 additions & 1 deletion doc/sphinx-guides/source/installation/config.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3851,6 +3851,15 @@ Example: ``dataverse.api.mdc.min-delay-ms=100`` (enforces a minimum 100ms delay

Can also be set via any `supported MicroProfile Config API source`_, e.g. the environment variable ``DATAVERSE_API_MDC_MIN_DELAY_MS``.

.. _dataverse.legacy.schemaorg-in-html-head:

dataverse.legacy.schemaorg-in-html-head
+++++++++++++++++++++++++++++++++++++++

Instead of Croissant, use the legacy format (Schema.org JSON-LD) in the head of dataset landing pages by setting ``dataverse.legacy.schemaorg-in-html-head=true``. See :ref:`croissant-head`.

Can also be set via any `supported MicroProfile Config API source`_, e.g. the environment variable ``DATAVERSE_LEGACY_SCHEMAORG_IN_HTML_HEAD``.

.. dataverse.ldn

Linked Data Notifications (LDN) Allowed Hosts
Expand Down Expand Up @@ -4033,7 +4042,6 @@ Only contact DataCite to update a DOI after checking to see if DataCite has outd




.. _:ApplicationServerSettings:

Application Server Settings
Expand Down
4 changes: 2 additions & 2 deletions doc/sphinx-guides/source/user/dataset-management.rst
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ Supported Metadata Export Formats

Once a dataset has been published, its metadata can be exported in a variety of other metadata standards and formats, which help make datasets more :doc:`discoverable </admin/discoverability>` and usable in other systems, such as other data repositories. On each dataset page's metadata tab, the following exports are available:

- Croissant
- Dublin Core
- DDI (Data Documentation Initiative Codebook 2.5)
- DDI HTML Codebook (A more human-readable, HTML version of the DDI Codebook 2.5 metadata export)
Expand All @@ -37,9 +38,8 @@ Once a dataset has been published, its metadata can be exported in a variety of
- OpenAIRE
- Schema.org JSON-LD

Additional formats can be enabled. See :ref:`inventory-of-external-exporters` in the Installation Guide. To highlight a few:
Additional formats can be enabled. See :ref:`inventory-of-external-exporters` in the Installation Guide. For example:

- Croissant
- RO-Crate

Each of these metadata exports contains the metadata of the most recently published version of the dataset.
Expand Down
24 changes: 24 additions & 0 deletions pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -1116,6 +1116,30 @@
</execution>
</executions>
</plugin>
<plugin>
<groupId>com.diffplug.spotless</groupId>
<artifactId>spotless-maven-plugin</artifactId>
<version>3.2.1</version>
<configuration>
<java>
<includes>
<include>src/main/java/edu/harvard/iq/dataverse/export/CroissantExporter.java</include>
<include>src/test/java/edu/harvard/iq/dataverse/export/CroissantExporterTest.java</include>
</includes>
<importOrder>
<wildcardsLast>false</wildcardsLast>
</importOrder>
<removeUnusedImports>
<engine>google-java-format</engine>
</removeUnusedImports>
<googleJavaFormat>
<version>1.17.0</version>
<style>AOSP</style>
<reflowLongStrings>true</reflowLongStrings>
</googleJavaFormat>
</java>
</configuration>
</plugin>
</plugins>
</build>
<profiles>
Expand Down
4 changes: 4 additions & 0 deletions src/main/java/edu/harvard/iq/dataverse/DatasetPage.java
Original file line number Diff line number Diff line change
Expand Up @@ -1485,6 +1485,10 @@ public boolean canSeeCurationStatus() {
}
}

public boolean isUseLegacyFormatInHead() {
return JvmSettings.SCHEMAORG_IN_HTML_HEAD.lookupOptional(Boolean.class).orElse(false);
}

/*
* 4.2.1 optimization.
* HOWEVER, this doesn't appear to be saving us anything!
Expand Down
Loading