Skip to content

Add Croissant to Signposting "describedby" output #10542

@pdurbin

Description

@pdurbin

Today @siacus and I were talking about how dataset landing pages can become heavy when the machine-readable JSON we put in the <head> (Schema.org JSON-LD or Croissant) gets large. In a real-life dataset with 25K files, the Croissant file can be 7.1 MB.

We talked about putting a link to the Croissant file in our Signposting output, like we do for Schema.org JSON-LD. Basically, robots could request just the headers (e.g. with curl --head) and receive a link to the Croissant file, rather than the entire payload, which can be large.

Unfortunately, people suffering from heavy dataset pages won't get relief until the large content is removed from the <head> of the page, but putting the link in Signposting gives machines an option for the future if the world wants to move in that direction. We already suggested Signposting to the Croissant/Google Dataset Search team at mlcommons/croissant#530 (comment)

In our Signposting output, we already include a link for downloading Schema.org JSON-LD data via API. For example:

<https://dataverse.harvard.edu/api/datasets/export?exporter=schema.org&persistentId=doi:10.7910/DVN/TJCLKP>;rel="describedby"

The Signposting spec seems to allow multiple "describedby" values, but if we prefer to keep a single "describedby" value, we could consider swapping out schema.org for croissant when it's available, like we do for the <head> tag:

I don't think this is a lot of work. A 3 is probably enough but I'll give it a 10 for reviewing the Signposting spec and talking to that community, if need be, about multiple "describedby" values. The file to edit is SignpostingResources.java as seen in PR #8981.

See also this issue we opened with the Croissant team where we asked for guidance on large Croissant files:

Related issues and PRs:

Metadata

Metadata

Assignees

No one assigned

    Labels

    FY25 Sprint 10FY25 Sprint 10 (2024-11-06 - 2024-11-20)FY25 Sprint 11FY25 Sprint 11 (2024-11-20 - 2024-12-04)Size: 10A percentage of a sprint. 7 hours.

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions