GEDCOM 7.0 is a breaking change with GEDCOM 5.5.1. This means that 5.5.1 files cannot be parsed as-is as if they were 7.0 files. This project is a zero-dependency public-domain ANSI-C implementation of a 5.5.1 to 7.0 converter. C was chosen because it as very few features, so it should be able to convert the code to other languages easily; and because many other languages have methods for calling C code natively.
Current status:
- Single-pass operations
- Detect character encodings, as documented in ELF Serialisation.
- Convert to UTF-8
- Normalize line whitespace, including stripping leading spaces
- Remove
CONC - Normalize case of tags
- Limit character set of cross-reference identifiers
- Fix
@usage - Convert
LANGpayloads to BCP 47 tags, using FHISO's mapping - Convert
DATE- replace date_phrase with
PHRASEstructure - replace calendar escapes with calendar tags
- change
BCandB.C.toBCEand remove if found in unsupported calendars - replace dual years with single years and
PHRASEs - replace just-year dual years in unqualified date with
BET/AND
- replace date_phrase with
- Convert
AGE- change age words to canonical forms (stillborn as
0y, child as< 8y, infant as< 1y) withPHRASEs - Normalize spacing in
AGEpayloads - add missing
y
- change age words to canonical forms (stillborn as
- Convert
MEDI.FORMpayloads to media types - (deferred) Convert
INDI.NAME- (deferred) replace
/surname/with name part - (deferred) combine payload and parts
- (deferred) convert
_RUFNAMEtoRUFNAME
- (deferred) replace
- (deferred) Convert
PLACstructures toPLACErecords andWHEREpointers thereto - Enumerated values
- Normalize case
- Convert user-text to
PHRASEs
- change
SOURwith text payload into pointer toSOURwithNOTE - change
NOTErecord or with pointer payload intoSNOTE - change
OBJEwith no payload to pointer to newOBJErecord - Convert
FONEandROMNtoTRANand theirTYPEs to BCP-47LANGs - tag renaming, including
EMAI,_EMAIL→EMAILFORM.TYPE→FORM.MEDI- (deferred)
_SDATE→SDATE--_SDATEis also used as "accessed at" date for web resources by some applications so this change is not universally correct _UID→UID_ASSO→ASSO_CRE,_CREAT→CREA_DATE→DATE- other?
-
ASSO.RELA→ASSO.ROLE(changing payload OTHER + PHRASE) - change
RFN,RIN, andAFNtoEXID - change
_FSFTID,_APIDtoEXID - remove
SUBN,HEAD.FILE,HEAD.CHAR- (deferred)
HEAD.PLACwas originally on this list, but has been deferred to a later version
- (deferred)
- change
FILEpayloads into URLs- Windows-style
\becomes/ - Windows diver letter
C:\WINDOWSbecomesfile:///c:/WINDOWS - POSIX-stye
/User/foobecomesfile:///User/foo
- Windows-style
- update the
GEDC.VERSto7.0 - (extra) change string-valued
INDI.ALIAintoNAMEwithTYPEAKA - (5.5) change base64-encoded OBJE into GEDZIP
- Change any illegal tag
XYZinto_EXT_XYZ
- two-pass operations
- use heuristic to change some pointer-
NOTEto nested-NOTEinstead ofSNOTE - add
SCHMAfor all used known extensions- add URIs (or standard tags) for all extensions from https://wiki-de.genealogy.net/GEDCOM/_Nutzerdef-Tag and http://www.gencom.org.nz/GEDCOM_tags.html
- use heuristic to change some pointer-
Edit Makefile as needed; likely changes include
- Change from
CC := clangto your C compiler - If on Windows, change the target from
ged5to7toged5to7.exe
Then run make.
To instead build using Visual Studio, simply open the c-converter.sln file with Visual Studio and build the solution normally.
To run, execute the resulting ged5to7.
Run ged5to7 --help for a list of command-line options.
The code is designed to be thread-safe (no mutable globals or static locals) though threading has not yet been added.
The code is currently first-draft status by someone who usually does not write large code bases others read.
It has inconsistent naming (e.g., ged_destroy_event vs changePayloadToDynamic),
some shortcuts (e.g., some structs are allocated as longs and cast to struct),
inconsistent style (e.g., three different ways to emit locally-created GedEvents),
etc.
In some places some energy was spent making it efficient, in other places it is definitely not as efficient as it easily could be.
Overall, the code needs a major refactor before it is easy to read.