GeoRic Dataset for Image Captioning

Version 1.0

Sources

The dataset contains photographs, captions and geographic metadata extracted from the Geograph project website.

Dataset

Description

The dataset consists of 29,038 images with the corresponding captions and image location coordinates (latitude and longitude). The captions are naturally produced and contain extensive geographic referencing (mentions of geographic entities in relation to the photograph location). An example entry in the dataset:

UUID	URL	Caption	Latitude	Longitude	Split
709ca41e-c1f3-4ba5-96e2-c49b58f9df8f	https://www.geograph.org.uk/photo/3079623	Farmland to the west of Burnham Market	52.93659	0.70376	train

Files

georic_dataset_v1.0.csv: the GeoRic dataset
georic_images_v1.0.zip: an archived folder with the images from the GeoRic dataset
- Every image in this folder is extracted from the Geograph project website. All rights to the photographs belong to the original copyright owners.
  - The URLs of the original photographs are provided in the GeoRic dataset.
- Every image is linked to an entry in the GeoRic dataset by its name:
  - georic_images/{UUID}.jpg corresponds to an entry with UUID UUID
  - E.g. georic_images/709ca41e-c1f3-4ba5-96e2-c49b58f9df8f.jpg corresponds to the example dataset entry above.

License

The dataset is licensed for reuse under the Creative Commons Attribution-ShareAlike 4.0 International License.

Disclaimer

The GeoRic dataset is created for research purposes only. We claim no ownership of the photographs in the dataset; the rights to the photographs and all the related materials belong to the original copyright owners.

Citing GeoRic dataset

@inproceedings{nikiforova2020geo,
  title={Geo-Aware Image Caption Generation},
  author={Nikiforova, Sofia and Deoskar, Tejaswini and Paperno, Denis and Winter, Yoad},
  booktitle={Proceedings of the 28th International Conference on Computational Linguistics},
  pages={3143--3156},
  year={2020}
}

Dataset statistics

Basic statistics

	Number of captions	Length in tokens	Number of vocabulary words	Number of geo-entities
Total	29,038	289,028	229,429	59,599
Average (per caption)		9.95	7.9	2.05

Spatial prepositions statistics

The number of captions that contain a given spatial preposition.

	Number of captions
Near	9602
In	7090
Along	2359
Across	1776
North of	2233
South of	2058
East of	1267
West of	1374

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GeoRic Dataset for Image Captioning

Sources

Dataset

Description

Files

License

Disclaimer

Citing GeoRic dataset

Dataset statistics

Basic statistics

Spatial prepositions statistics

About

Uh oh!

Releases

Packages

sonniki/GeoRic

Folders and files

Latest commit

History

Repository files navigation

GeoRic Dataset for Image Captioning

Sources

Dataset

Description

Files

License

Disclaimer

Citing GeoRic dataset

Dataset statistics

Basic statistics

Spatial prepositions statistics

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages