Skip to content

Extension tags converted to URIs during parsing #5

@glamberson

Description

@glamberson

Extension Tags Converted to URIs During Parsing

Bug Description

When parsing GEDCOM 7 files with extension tags declared in SCHMA, the parser converts the extension tag names to their URI values, making it impossible to identify which extension tag was used.

Steps to Reproduce

import gedcom7

test_gedcom = """0 HEAD
1 GEDC
2 VERS 7.0
1 SCHMA
2 TAG _TEST https://example.com/test
0 @I1@ INDI
1 NAME John /Doe/
1 _TEST Some test data
0 TRLR"""

structures = list(gedcom7.loads(test_gedcom))

for struct in structures:
    if struct.tag == "INDI":
        for child in struct.children:
            print(f"Tag: {child.tag}")

Expected Behavior

The child tag should be _TEST (the actual tag used in the file).

Actual Behavior

The child tag is https://example.com/test (the URI from the SCHMA declaration).

Impact

This makes it impossible to:

  1. Identify which extension tag was actually used in the file
  2. Process extension tags differently based on their tag name
  3. Export GEDCOM 7 files with the same extension tags that were imported

Additional Information

  • Version: gedcom7 0.4.0
  • Python: 3.13.5
  • This behavior is particularly problematic when multiple extension tags map to the same URI for different purposes

Suggested Fix

The parser should preserve the original tag name, perhaps storing both the tag and its URI:

  • tag property: The actual tag from the file (e.g., "_TEST")
  • uri or type_id property: The URI from SCHMA (e.g., "https://example.com/test")

This would maintain backward compatibility while allowing access to the original tag name.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions