-
Notifications
You must be signed in to change notification settings - Fork 21
Bulk importer #94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Bulk importer #94
Conversation
| - ... | ||
| To enable easily search and files integration we believe these two objects should extend from records like many other modules do (collections, requests, etc). | ||
|
|
||
| The other two objects involved will be the *Resources* and the *Serializers*. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might want to change this name to avoid confusion with resources and services. Any suggestions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you think about record type instead of resource? I think we used this name elsewhere too
| - Determine if a group of records get new DOIs (minted during the process) | ||
| - Update many records at once. | ||
| - Delete many records at once. | ||
| - See the status of past and current uploads. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how long would the retention period be to keep the previous imports?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are still discussing that internally. Our initial approach would be to keep the import tasks indefinitely and "just" delete the attached files of successfully created/updated records after 3 months (probably configurable).
We also considered an "archive" option to help clean up the interface.
|
|
||
| GUI-based bulk importing and editing of records and files is a widely desired, highly useful feature, which will help to make the platform appealing to a much broader base of institutional users. | ||
|
|
||
| The proposed feature is a beta version of a bulk importer for metadata (in CSV format) and associated files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are any other formats also planned?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initially, we are working only with CSV, but we are designing the tool so anyone can add their preferred format.
| ## Unresolved questions | ||
|
|
||
| - Metadata file re-upload to correct errors | ||
| - How do we set a file for preview? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there any limit on the import size or maximum number of files? It might be quite troublesome to handle big imports
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since I wrote this, we have shown it to potential users, and this has come up on several occasions.
I think we will have the same "limitations" as the current deposit form. To mitigate this, we are pondering allowing users to enter known URIs into the files column. Imagine a shared location the service would have access to, like a bucket on AWS/GCP or directly a URL that can be fetched somehow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the problem might be both on the CSV upload and associated file upload. The CSV /marcxml itself will be problematic if it contains a lot of records, for 2 reasons: size of the csv itself, and the time the underlying task will take to process it - depending on how much memory celery worker has available
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're definitely right. We did consider the problem with the celery workers' memory. Unfortunately, I don't have a solution other than making the process as memory-efficient as possible.
For example, we plan to start a task for each record (row) inside the input file and let that task do the transformation and validation so we will process one record at a time.
I think adding an "artificial" limit to the number of records at this point might not make a lot of sense. Once we have the process in place, and knowing where it can "break", we can load test it and set an informative limit.
| As an administrator user I want to ... | ||
| - Upload many records (with their files) at once into my instance. | ||
| - Select in which communities the records are publishes. | ||
| - Determine if a group of records get new DOIs (minted during the process) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this use case already reflected in the mockup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, on the third screen, there is a checkbox section. For now, it has two options: mint DOIs and publish.
Currently, InvenioRDM's only affordance for the bulk creation, import, and/or editing of records and files requires direct CLI-command-driven engagement with records and files APIs.
GUI-based bulk importing and editing of records and files is a widely desired, highly useful feature, which will help to make the platform appealing to a much broader base of institutional users.
The proposed feature is a beta version of a bulk importer for metadata (in CSV format) and associated files.