-
Notifications
You must be signed in to change notification settings - Fork 10
feat: Add MODS metadata fields to complement Dublin Core in ArticleOAIIndex #1213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds 25 MODS (Metadata Object Description Schema) fields to the ArticleOAIIndex to complement the existing Dublin Core metadata with more structured bibliographic information. The additions enable advanced bibliometric analysis, better interoperability with institutional repositories, and compliance with funding agency mandates.
Key Changes:
- Added 25 new MODS field declarations organized into 7 semantic groups (name, subject, relatedItem, part, identifier, originInfo, accessCondition)
- Implemented 25 corresponding
prepare_*methods to extract and format data from Article model relationships - Added 2 private helper methods for formatting affiliation and location data
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
article/search_indexes.py
Outdated
| if (pub_history.organization.acronym and | ||
| pub_history.organization.acronym.strip()): | ||
| pub_name += f" ({pub_history.organization.acronym.strip()})" | ||
|
|
||
| # Fallback para estrutura Institution legada | ||
| elif (pub_history.institution and | ||
| pub_history.institution.institution and | ||
| pub_history.institution.institution.institution_identification and | ||
| pub_history.institution.institution.institution_identification.name): | ||
|
|
||
| inst_id = pub_history.institution.institution.institution_identification | ||
| pub_name = inst_id.name.strip() | ||
|
|
||
| if inst_id.acronym and inst_id.acronym.strip(): | ||
| pub_name += f" ({inst_id.acronym.strip()})" |
Copilot
AI
Nov 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Code duplication in acronym handling. Lines 1458-1460 and 1471-1472 have identical logic for appending acronyms in parentheses. Consider extracting to a helper method:
def _format_name_with_acronym(self, name, acronym):
if not name:
return None
result = name.strip()
if acronym and acronym.strip():
result += f" ({acronym.strip()})"
return result
article/search_indexes.py
Outdated
| if (hasattr(obj, 'pub_date_month') and obj.pub_date_month and | ||
| str(obj.pub_date_month).strip()): | ||
| date_parts.append(str(obj.pub_date_month).zfill(2)) | ||
|
|
||
| if (hasattr(obj, 'pub_date_day') and obj.pub_date_day and | ||
| str(obj.pub_date_day).strip()): | ||
| date_parts.append(str(obj.pub_date_day).zfill(2)) |
Copilot
AI
Nov 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential issue with date formatting: The code uses zfill(2) to pad month and day values (lines 1575, 1579), but doesn't validate that the values are numeric or within valid ranges. If pub_date_month or pub_date_day contain non-numeric strings, zfill will still work but produce invalid dates like "2023-ab-15". Consider adding validation or using proper date formatting.
| if (hasattr(obj, 'pub_date_month') and obj.pub_date_month and | |
| str(obj.pub_date_month).strip()): | |
| date_parts.append(str(obj.pub_date_month).zfill(2)) | |
| if (hasattr(obj, 'pub_date_day') and obj.pub_date_day and | |
| str(obj.pub_date_day).strip()): | |
| date_parts.append(str(obj.pub_date_day).zfill(2)) | |
| # Validate and add month | |
| if (hasattr(obj, 'pub_date_month') and obj.pub_date_month and | |
| str(obj.pub_date_month).strip()): | |
| try: | |
| month_int = int(obj.pub_date_month) | |
| if 1 <= month_int <= 12: | |
| date_parts.append(f"{month_int:02d}") | |
| # Validate and add day only if month is valid | |
| if (hasattr(obj, 'pub_date_day') and obj.pub_date_day and | |
| str(obj.pub_date_day).strip()): | |
| try: | |
| day_int = int(obj.pub_date_day) | |
| if 1 <= day_int <= 31: | |
| date_parts.append(f"{day_int:02d}") | |
| except (ValueError, TypeError): | |
| pass | |
| except (ValueError, TypeError): | |
| pass |
article/search_indexes.py
Outdated
| ) | ||
|
|
||
| # MODS - originInfo | ||
| # Editora(s) |
Copilot
AI
Nov 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spelling error: "Editora(s)" should use English "Publisher(s)" to maintain consistency with the codebase (all other comments are in English except Portuguese JUSTIFICATIVA sections).
| # Editora(s) | |
| # Publisher(s) |
article/search_indexes.py
Outdated
| researchers = obj.researchers.prefetch_related( | ||
| 'researcheraka_set__researcher_identifier' | ||
| ).all() |
Copilot
AI
Nov 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The prefetch_related path 'researcheraka_set__researcher_identifier' may cause N+1 queries if researcher_identifier has foreign keys. Consider using select_related for the researcher_identifier relationship as well: prefetch_related(Prefetch('researcheraka_set', queryset=ResearcherAKA.objects.select_related('researcher_identifier')))
article/search_indexes.py
Outdated
| orcid = aka.researcher_identifier.identifier.strip() | ||
| if orcid and orcid not in orcids: | ||
| orcids.append(orcid) |
Copilot
AI
Nov 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The logic checks if orcid and orcid not in orcids before appending, which should prevent duplicates. However, this check happens inside nested loops (researchers -> aka), and the same ORCID from the same researcher could still be checked multiple times unnecessarily. The current implementation is correct but not optimal - consider collecting into a set first, then converting to list at the end for better performance with large datasets.
article/search_indexes.py
Outdated
| location = institution.location | ||
|
|
||
| # Cidade | ||
| if location.city and hasattr(location.city, 'name') and location.city.name: | ||
| parts.append(location.city.name.strip()) | ||
|
|
||
| # Estado | ||
| if location.state: | ||
| state_text = location.state.acronym or location.state.name | ||
| if state_text: | ||
| parts.append(state_text.strip()) | ||
|
|
||
| # País | ||
| if location.country and location.country.name: | ||
| parts.append(location.country.name.strip()) | ||
|
|
||
| except (AttributeError, TypeError): | ||
| # Fallback para string simples | ||
| return str(affiliation) if affiliation else None | ||
|
|
||
| return " - ".join(parts) if parts else None | ||
|
|
||
| def _format_place_location(self, location): | ||
| """ | ||
| Formata localização em string: "Cidade, Estado, País" | ||
| Args: | ||
| location: objeto Location | ||
| Returns: | ||
| str: "São Paulo, SP, Brasil" | ||
| """ | ||
| if not location: | ||
| return None | ||
|
|
||
| parts = [] | ||
|
|
||
| try: | ||
| # Cidade | ||
| if location.city and hasattr(location.city, 'name') and location.city.name: | ||
| parts.append(location.city.name.strip()) | ||
|
|
||
| # Estado (preferir sigla) | ||
| if location.state: | ||
| state_text = location.state.acronym or location.state.name | ||
| if state_text: | ||
| parts.append(state_text.strip()) | ||
|
|
||
| # País | ||
| if location.country and location.country.name: | ||
| parts.append(location.country.name.strip()) | ||
|
|
||
| except (AttributeError, TypeError): | ||
| return None | ||
|
|
||
| return ", ".join(parts) if parts else None | ||
|
|
Copilot
AI
Nov 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Code duplication between _format_affiliation and _format_place_location. Both methods have similar logic for formatting location information (city, state, country). The main difference is the separator (- vs , ) and the inclusion of institution info in _format_affiliation. Consider extracting the common location formatting logic into a separate helper method to reduce duplication.
| location = institution.location | |
| # Cidade | |
| if location.city and hasattr(location.city, 'name') and location.city.name: | |
| parts.append(location.city.name.strip()) | |
| # Estado | |
| if location.state: | |
| state_text = location.state.acronym or location.state.name | |
| if state_text: | |
| parts.append(state_text.strip()) | |
| # País | |
| if location.country and location.country.name: | |
| parts.append(location.country.name.strip()) | |
| except (AttributeError, TypeError): | |
| # Fallback para string simples | |
| return str(affiliation) if affiliation else None | |
| return " - ".join(parts) if parts else None | |
| def _format_place_location(self, location): | |
| """ | |
| Formata localização em string: "Cidade, Estado, País" | |
| Args: | |
| location: objeto Location | |
| Returns: | |
| str: "São Paulo, SP, Brasil" | |
| """ | |
| if not location: | |
| return None | |
| parts = [] | |
| try: | |
| # Cidade | |
| if location.city and hasattr(location.city, 'name') and location.city.name: | |
| parts.append(location.city.name.strip()) | |
| # Estado (preferir sigla) | |
| if location.state: | |
| state_text = location.state.acronym or location.state.name | |
| if state_text: | |
| parts.append(state_text.strip()) | |
| # País | |
| if location.country and location.country.name: | |
| parts.append(location.country.name.strip()) | |
| except (AttributeError, TypeError): | |
| return None | |
| return ", ".join(parts) if parts else None | |
| location_str = self._format_location_parts(institution.location, separator=" - ") | |
| if location_str: | |
| parts.append(location_str) | |
| except (AttributeError, TypeError): | |
| # Fallback para string simples | |
| return str(affiliation) if affiliation else None | |
| return " - ".join(parts) if parts else None | |
| def _format_location_parts(self, location, separator=", "): | |
| """ | |
| Helper to format location as "City, State, Country" or with custom separator. | |
| Returns None if location is not valid. | |
| """ | |
| if not location: | |
| return None | |
| parts = [] | |
| try: | |
| # Cidade | |
| if location.city and hasattr(location.city, 'name') and location.city.name: | |
| parts.append(location.city.name.strip()) | |
| # Estado (preferir sigla) | |
| if location.state: | |
| state_text = location.state.acronym or location.state.name | |
| if state_text: | |
| parts.append(state_text.strip()) | |
| # País | |
| if location.country and location.country.name: | |
| parts.append(location.country.name.strip()) | |
| except (AttributeError, TypeError): | |
| return None | |
| return separator.join(parts) if parts else None | |
| def _format_place_location(self, location): | |
| """ | |
| Formata localização em string: "Cidade, Estado, País" | |
| Args: | |
| location: objeto Location | |
| Returns: | |
| str: "São Paulo, SP, Brasil" | |
| """ | |
| return self._format_location_parts(location, separator=", ") |
article/search_indexes.py
Outdated
|
|
||
| def prepare_mods_part_elocation(self, obj): | ||
| """ | ||
| elocation-id (para artigos sem paginação tradicional) |
Copilot
AI
Nov 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Inconsistent capitalization in comment: "elocation-id" starts with lowercase while all other field type comments in this section use title case or proper capitalization (e.g., "Página inicial do artigo", "ISSN da versão impressa"). Consider using "Elocation-id (para artigos sem paginação tradicional)" for consistency.
| elocation-id (para artigos sem paginação tradicional) | |
| Elocation-id (para artigos sem paginação tradicional) |
article/search_indexes.py
Outdated
| Exemplo: "e370704" | ||
| """ | ||
| # Só retorna elocation se NÃO houver paginação tradicional | ||
| if not obj.first_page and obj.elocation_id: |
Copilot
AI
Nov 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic for elocation_id returns None if first_page exists, but doesn't check if first_page is actually a valid/non-empty value. If first_page is an empty string or whitespace-only, the method will return None even though there's a valid elocation_id. Consider checking: if not obj.first_page or not str(obj.first_page).strip() instead of just if not obj.first_page.
| if not obj.first_page and obj.elocation_id: | |
| if (not obj.first_page or not str(obj.first_page).strip()) and obj.elocation_id: |
article/search_indexes.py
Outdated
| if not (obj.journal and hasattr(obj.journal, 'publisher_history') and | ||
| obj.journal.publisher_history.exists()): |
Copilot
AI
Nov 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Complex nested condition is hard to read. The condition spans lines 1444-1445 and checks three levels deep. Consider extracting to a helper variable for better readability:
has_publisher_history = (obj.journal and
hasattr(obj.journal, 'publisher_history') and
obj.journal.publisher_history.exists())
if not has_publisher_history:
return publishers| if not (obj.journal and hasattr(obj.journal, 'publisher_history') and | |
| obj.journal.publisher_history.exists()): | |
| has_publisher_history = ( | |
| obj.journal and hasattr(obj.journal, 'publisher_history') and obj.journal.publisher_history.exists() | |
| ) | |
| if not has_publisher_history: |
article/search_indexes.py
Outdated
| if pub_name and pub_name not in publishers: | ||
| publishers.append(pub_name) |
Copilot
AI
Nov 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Missing strip() call on comparison. Line 1474 checks if pub_name and pub_name not in publishers but doesn't call .strip() on pub_name before the comparison, while it was stripped when building it (lines 1456, 1469). If pub_name somehow contains only whitespace after stripping during construction (which would make it falsy), this could theoretically pass the if pub_name check. However, since the strings are already stripped during construction, this is likely safe but inconsistent with the defensive pattern used elsewhere.
| if pub_name and pub_name not in publishers: | |
| publishers.append(pub_name) | |
| if pub_name: | |
| pub_name = pub_name.strip() | |
| if pub_name and pub_name not in publishers: | |
| publishers.append(pub_name) |
- Extrai métodos auxiliares para eliminar duplicação de código - Adiciona validação de datas com verificação de ranges (mês 1-12, dia 1-31) - Otimiza queries de ORCID com Prefetch e select_related - Usa set para coleta de ORCIDs (O(1) vs O(n) em lookups) - Remove verificações hasattr redundantes em blocos try-except - Adiciona variáveis intermediárias para condições complexas - Valida valores vazios/whitespace antes de comparações
article/search_indexes.py
Outdated
| from haystack import indexes | ||
| from legendarium.formatter import descriptive_format | ||
| from django.db.models import Prefetch | ||
| from researcher.models import ResearcherAKA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Rossi-Luciano não vai precisar de ResearcherAKA
article/search_indexes.py
Outdated
| ) | ||
| ).all() | ||
|
|
||
| for researcher in researchers: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Rossi-Luciano consulte o modelo Article. Veja como é a relação m2m com Researcher. Consulte o modelo Researcher. Veja que ele tem property orcid:
Line 77 in b48e0d7
| def orcid(self): |
article/search_indexes.py
Outdated
| EXEMPLO XML (MODS 3.7): | ||
| <name type="personal"> | ||
| <namePart>Silva, João</namePart> | ||
| <nameIdentifier type="orcid" authority="orcid"> | ||
| 0000-0001-2345-6789 | ||
| </nameIdentifier> | ||
| <role> | ||
| <roleTerm type="text" authority="marcrelator">author</roleTerm> | ||
| </role> | ||
| </name> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Rossi-Luciano como a partir de uma lista de orcid será possível associar cada orcid ao autor correspondente corretamente?
article/search_indexes.py
Outdated
| thematic_area = thematic_area_journal.thematic_area | ||
|
|
||
| # Usar hierarquia: level2 > level1 > level0 | ||
| topic_text = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Rossi-Luciano qual o motivo deste critério de escolha das áreas? Por que não foi usado todos os níveis?
article/search_indexes.py
Outdated
| if obj.issue and obj.issue.volume: | ||
| return str(obj.issue.volume).strip() | ||
| return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Rossi-Luciano troque por:
try:
return str(obj.issue.volume).strip()
except (AttributeError, TypeError, ValueError):
return None
article/search_indexes.py
Outdated
| if obj.issue and obj.issue.number: | ||
| return str(obj.issue.number).strip() | ||
| return None | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Rossi-Luciano veja comentário anterior
article/search_indexes.py
Outdated
| str: Designação do suplemento ou None | ||
| Exemplo: "suppl.1" | ||
| """ | ||
| if obj.issue and obj.issue.supplement: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Rossi-Luciano comentário anterior
article/search_indexes.py
Outdated
| has_publisher_history = ( | ||
| obj.journal | ||
| and hasattr(obj.journal, "publisher_history") | ||
| and obj.journal.publisher_history.exists() | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Rossi-Luciano elimine os códigos que testem o .exists() porque isso é uma consulta. Eliminando isso, as consultas ao banco de dados caem pela metade.
article/search_indexes.py
Outdated
|
|
||
| parts = [] | ||
|
|
||
| try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Rossi-Luciano observe que nos próprios modelos este tipo de questão já está resolvida:
Line 554 in b48e0d7
| def formatted_location(self): |
robertatakenaka
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Rossi-Luciano Por favor, fiz alguns comentários que devem se estender a outros casos não comentados. Fazer as correções.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
article/search_indexes.py
Outdated
| if official.issn_print and official.issn_print not in issns: | ||
| issns.append(official.issn_print) | ||
|
|
||
| if official.issn_electronic and official.electronic not in issns: |
Copilot
AI
Nov 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a typo in the attribute name. Line 1038 uses official.electronic but should be official.issn_electronic to check for duplicates correctly. This will cause the condition to fail since electronic is not a valid attribute.
| if official.issn_electronic and official.electronic not in issns: | |
| if official.issn_electronic and official.issn_electronic not in issns: |
article/search_indexes.py
Outdated
| """...""" | ||
| if obj.researchers: | ||
| researchers = obj.researchers.select_related("person_name").filter( | ||
| person_name__isnull=False | ||
| ) | ||
| return set([str(researcher.person_name) for researcher in researchers]) | ||
| return list(set([str(researcher.person_name) for researcher in researchers])) | ||
| return [] | ||
|
|
||
| def prepare_collab(self, obj): | ||
| """This is the instituional author.""" | ||
| if obj.collab: | ||
| return set([collab.collab for collab in obj.collab.all()]) | ||
| return list(set([collab.collab for collab in obj.collab.all()])) | ||
| return [] | ||
|
|
||
| def prepare_kw(self, obj): | ||
| """The keywords of the article.""" | ||
| if obj.keywords: | ||
| return set([keyword.text for keyword in obj.keywords.all()]) | ||
| return list(set([keyword.text for keyword in obj.keywords.all()])) | ||
| return [] | ||
|
|
||
| def prepare_description(self, obj): | ||
| """The abstracts of the articles | ||
| This is a property that filter by article ``DocumentAbstract.objects.filter(article=self)`` | ||
| """ | ||
| """The abstracts of the articles...""" |
Copilot
AI
Nov 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The docstrings for several methods have been reduced to just "..." which loses important documentation. Consider keeping the original docstrings or providing meaningful descriptions. For example, prepare_creator previously explained the select_related optimization, and prepare_description explained the data source.
article/search_indexes.py
Outdated
| def prepare_mods_subject_area(self, obj): | ||
| """...""" | ||
| areas = [] | ||
|
|
||
| # Padronizado: .exists() direto | ||
| if obj.journal and obj.journal.subject.exists(): | ||
| for subject_area in obj.journal.subject.all(): | ||
| if subject_area.value and subject_area.value.strip(): | ||
| areas.append(subject_area.value.strip()) | ||
|
|
||
| return areas | ||
|
|
||
| def prepare_mods_subject_wos(self, obj): | ||
| """...""" | ||
| wos_categories = [] | ||
|
|
||
| # Padronizado: .exists() direto | ||
| if obj.journal and obj.journal.wos_area.exists(): | ||
| for wos_category in obj.journal.wos_area.all(): | ||
| if wos_category.value and wos_category.value.strip(): | ||
| wos_categories.append(wos_category.value.strip()) | ||
|
|
||
| return wos_categories |
Copilot
AI
Nov 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The docstrings for prepare_mods_subject_area and prepare_mods_subject_wos are abbreviated to "..." while other similar methods have comprehensive documentation. For consistency and maintainability, these methods should have full docstrings explaining their purpose, data sources, and MODS mapping like the other methods in this section.
article/search_indexes.py
Outdated
| return ["com_%s" % col for col in obj.collections] | ||
| try: | ||
| return ["com_%s" % col.acron3 for col in obj.collections if col.acron3] | ||
| except Exception: |
Copilot
AI
Nov 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Using a bare except Exception: can hide unexpected errors and make debugging difficult. Consider catching specific exceptions like AttributeError or TypeError that might occur when accessing col.acron3. This makes the error handling more explicit and maintainable.
| except Exception: | |
| except (AttributeError, TypeError): |
O que esse PR faz?
Adiciona 25 campos MODS (Metadata Object Description Schema) ao
ArticleOAIIndex, complementando Dublin Core com metadados bibliográficos estruturados.Campos adicionados (7 grupos):
Benefícios:
Onde a revisão poderia começar?
Arquivo:
article/search_indexes.pyOrdem sugerida:
metadata.mods.*prepare_mods_name_*(linhas ~200-350) - extração de ORCIDs e afiliaçõesprepare_mods_subject_*(linhas ~350-450) - integração com taxonomiasComo este poderia ser testado manualmente?
1. Reindexar Solr:
2. Verificar campos no Solr:
3. Testar queries MODS:
Algum cenário de contexto que queira dar?
NA
Quais são tickets relevantes?
NA
Referências
Documentação MODS:
Elementos específicos:
Padrões relacionados: