Welcome to GED4PY’s documentation!

Contents:

GEDCOM parser for Python

https://img.shields.io/pypi/v/ged4py.svg https://img.shields.io/travis/andy-z/ged4py.svg Documentation Status

Implementation of the GEDCOM parser in Python

Features

  • Parsing of GEDCOM files as defined by 5.5.1 version of GEDCOM standard

  • Supported file encodings are UTF-8 (with or without BOM), ASCII or ANSEL

  • Designed to parse large files efficiently

  • Supports Python 3.6+

Credits

This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template.

Installation

Stable release

To install GEDCOM parser for Python, run this command in your terminal:

$ pip install ged4py

This is the preferred method to install GEDCOM parser for Python, as it will always install the most recent stable release.

If you don’t have pip installed, this Python installation guide can guide you through the process.

From sources

The sources for GEDCOM parser for Python can be downloaded from the Github repo.

You can either clone the public repository:

$ git clone git://github.com/andy-z/ged4py

Or download the tarball:

$ curl  -OL https://github.com/andy-z/ged4py/tarball/master

Once you have a copy of the source, you can install it with:

$ python setup.py install

Usage

Currently ged4py supports parsing of existing GEDCOM files, there is no support for (re-)generating GEDCOM data. The main interface for parsing is ged4py.parser.GedcomReader class. To create parser instance one has to pass file with GEDCOM data as a single required parameter, this can be either file name of a Python file object. If file object is passed then the file has to be open in a binary mode and it has to support seek() and tell() methods. Example of instantiating a parser:

from ged4py import GedcomReader

path = "/path/to/file.gedcom"
with GedcomReader(path) as parser:
    # GedcomReader provides context support
    ...

or using in-memory buffer as a file (could be useful for testing):

import io
from ged4py import GedcomReader

data = b"..."                 # make some binary date here
with io.BytesIO(data) as file:
    parser = GedcomReader(file)
    ...

In most cases parser should be able to determine input file encoding from the file if data in the file follows GEDCOM specification. In other cases parser may need external help, if you know file encoding you can provide it as an argument to parser:

parser = GedcomReader(path, encoding="utf-8")

Any encoding supported by Python codecs module can be used as an argument. In addition, this package registers two additional encodings from the ansel package:

ansel

American National Standard for Extended Latin Alphabet Coded Character Set for Bibliographic Use (ANSEL)

gedcom

GEDCOM extensions for ANSEL

By default parser raises exception if it encounters errors while decoding data in a file. To override this behavior one can specify different error policy, following the same pattern as standard codecs.decode() method, e.g.:

parser = GedcomReader(path, encoding="utf-8", errors='replace')

Main mode of operation for parser is iterating over records in a file in sequential manner. GEDCOM records are organized in hierarchical structures, and ged4py parser facilitates access to these hierarchies by grouping records in tree-like structures. Instead of providing iterator over every record in a file parser iterates over top-level (level 0) records, and for each level-0 record it returns record structure which includes nested records below level 0.

The main method of the parser is the method records0() which returns iterator over all level-0 records. Method takes an optional argument for a tag name, without argument all level-0 records are returned by iterator (starting with “HEAD” and ending with “TRLR”). If tag is given then only the records with matching tag are returned:

with GedcomReader(path) as parser:
    # iterate over all INDI records
    for record in parser.records0("INDI"):
        ....

Records returned by iterator are instances of class ged4py.model.Record or one of its few sub-classes. Each record instance has a small set of attributes:

  • level - record level, 0 for top-level records

  • xref_id - record reference ID, may be None

  • tag - record tag name

  • value - record value, can be None, string, or value of some other type depending on record type

  • sub_records - list of subordinate records, direct sub-records of this record, it is easier to access items in this list using methods described below.

If, for example, GEDCOM file contains sequence of records like this:

0 @ID12345@ INDI
1 NAME John /Smith/
1 BIRT
2 DATE 1 JAN 1901
2 PLAC Some place
1 FAMC @ID45623@
1 FAMS @ID7612@

then the record object returned from iterator will have these attributes:

  • level is 0 (true for all records returned by records0()),

  • xref_id - “@ID12345@”,

  • tag - “INDI”,

  • value - None,

  • sub_records - list of Record instances corresponding to “NAME”, “BIRT”, “FAMC”, and “FAMS” tags (but not “DATE” or “PLAC”, records for these tags will be in sub_records of “BIRT” record).

Record class has few convenience methods:

  • sub_tags() - return all direct subordinate records with a given tag name, list of records is returned, possibly empty.

  • sub_tag() - return subordinate record with a given tag name (or tag “path”), if there is more than one record with matching tag then first one is returned, without match None is returned.

  • sub_tag_value() - return value of subordinate record with a given tag name (or tag “path”), or None if record is not found or its value is None.

With the example records from above one can do record.sub_tag("BIRT/DATE") on level-0 record to retrieve a Record instance corresponding to level-2 “DATE” record, or alternatively use record.sub_tag_value("BIRT/DATE") to retrieve the value attribute of the same record.

There are few specialized sub-classes of Record each corresponding to specific record tag:

  • NAME records generate ged4py.model.NameRec instances, this class knows how to split name representation into name components (first, last, maiden) and has attributes for accessing those.

  • DATE records generate ged4py.model.Date instances, the value attribute of this class is converted into ged4py.date.DateValue instance.

  • INDI records are represented by ged4py.model.Individual class.

  • “pointer” records whose value has special GEDCOM <POINTER> syntax (@xref_id@) are represented by ged4py.model.Pointer class. This class has special property ref which returns referenced record. Methods sub_tag() and sub_tag_value() have keyword argument follow which can be set to True to allow automatic dereferencing of the pointer records.

ged4py API

ged4py

Top-level package for GEDCOM parser for Python.

Technical information

Character encoding

GEDCOM originally provided very little support for non-Latin alphabets. To support Latin-based characters beyond ASCII set GEDCOM used ANSEL 8-bit encoding which added a bunch of diacritical marks (modifiers) and few commonly used non-ASCI characters. Support for non-Latin characters was added in latter version of GEDCOM standard, version 5.3 added wording for UNICODE support (mostly broken) and draft 5.5.1 improved situation by declaring UTF-8 encoding as supported UNICODE encoding. Several systems producing GEDCOM output today seem to have converged on UTF-8.

The encoding of GEDCOM file is determined by the content of the file itself, in particular by the CHAR record in the header (which is a required record), e.g.:

0 HEAD
  1 SOUR PAF
    2 VERS 2.1
  1 DEST ANSTFILE
  1 CHAR ANSEL

GEDCOM standard seems to imply that character set specified in CHAR record applies to everything after that record and until TRLR record (last record in file). My interpretation of that statement is that all header records before and including CHAR should be encoded with default ANSEL encoding. This may be a source of incompatibilities, I can imagine that software encoding its output in e.g. UTF-8 can decide to encode all header records in the the same UTF-8 which can cause errors if decoded using ANSEL.

Additional source of concerns is the BOM record that some applications (or many on Windows) tend to add to files encoded with UTF-8 (or UTF-16). Presence of BOM usually implies that the whole content of the file should be decoded using UTF-8/-16. This contradicts assumption that initial part of GEDCOM header is encoded in ANSEL.

Ged4py tries to make a best guess as to how it should decode input data, and it uses simple algorithm to determine that:

  • if file starts with BOM record then ged4py reads the whole file using UTF-8 or UTF-16 encoding, if the CHAR record specifies something other than UTF-8/-16 the exception is raised;

  • otherwise if file starts with regular “0” and ” ” ASCII characters the header is read using ANSEL encoding until CHAR record is met, after that reading switches to the encoding specified in that record;

  • decoding errors are handled according to the mode specified when opening GEDCOM file, it can be one of standard error handling schemes defined in codecs module. This scheme applies to to both header (before CHAR record) and regular content.

See also Tamura Jones’ excellent article summarizing many varieties of illegal encodings that may be present in GEDCOM files.

Name representation

GEDCOM NAME record defines a structured format for representing names but applications are not required to fill that structural information and can instead present name as a value part or NAME record in a “custom of culture” representation. Only requirement for that representation is that surname should be delimited by slash characters, e.g.:

0 @I1@ INDI
  1 NAME John /Smith/            -- given name and surname
0 @I2@ INDI
  1 NAME Joanne                  -- without surname
0 @I3@ INDI
  1 NAME /Иванов/ Иван Ив.       -- surname and given name
0 @I4@ INDI
  1 NAME Sir John /Ivanoff/ Jr.  -- with prefix/suffix

Potentially individual can have more than one NAME record which can be distinguished by TYPE record which can be arbitrary string, GEDCOM does not define standard or allowed types. Types could be use for example to specify maiden name or names in previous marriages, e.g.:

0 @I1@ INDI
  1 NAME Жанна /Иванова/
  1 NAME Jeanne /d'Arc/
    2 TYPE maiden

Couple of application that I know of do not use TYPE records for maiden name representation instead they chose different ways to encode names. Here is how individual applications encode names.

Agelong Tree (Genery)

Agelong Tree produces single NAME record per individual, I don’t think it is possible to make it to create more than one NAME record. Given name and and surname are encoded as value in the NAME record, and given name also appears in GIVN sub-record:

1 NAME Given Name /Surname/
  2 GIVN Given Name

If person has a maiden name then it is encoded as additional surname enclosed in parentheses, also SURN sub-record specifies maiden name:

1 NAME Given Name /Surname (Maiden)/
  2 GIVN Given Name
  2 SURN Maiden

Additionally Agelong tends to represent missing parts of names in GEDCOM file with question mark (?).

Agelong can also store name suffix and prefix, they are not included into NAME record value but stored as NPFX and NSFX sub-records:

1 NAME Given Name /Surname/
  2 NPFX Dr.
  2 GIVN Given Name
  2 NSFX Jr.

MyHeritage

MyHeritage Family Tree Builder can generate more than one NAME record but I could not find a way to specify TYPE of the created NAME records, likely all NAME records are created without TYPE which is not too useful.

Given name and and surname are encoded as value in the NAME record and they also appear in GIVN and SURN sub-records:

1 NAME Given Name /Surname/
  2 GIVN Given Name
  2 SURN Surname

If name of the person after marriage is different from birth/maiden name (apparently in MyHeritage this can only happen for female individuals) then married name is stored in a non-standard sub-record with _MARNM tag:

1 NAME Given Name /Maiden/
  2 GIVN Given Name
  2 SURN Maiden
  2 _MARNM Married

MyHeritage can also store name suffix and prefix, and also nickname in corresponding sub-records, they are not rendered in NAME record value:

1 NAME Given Name /Surname/
  2 NPFX Dr.
  2 GIVN Given Name
  2 SURN Surname
  2 NSFX Jr.
  2 NICK Professore

MyHeritage can also store few name pieces in NAME sub-records using non-standard tags such as _AKA, _RNAME (for religious name), _FORMERNAME, etc.

ged4py behavior

ged4py tries to determine individual name pieces from all info in GEDCOM records. Because interpretation of the information depends on the application which produced GEDCOM file ged4py also has to determine the application name. Application name (a.k.a. GEDCOM “dialect”) is determined from file header and is stored in a dialect property of GedcomReader class (one of the DIALECT_* constants defined in ged4py.model module). In general naming of individuals can be overly complicated, ged4py tries to build a simpler model of person naming by determining four pieces of each individual’s name:

  • given name, in some cultures it can include middle (or father) name

  • first name, ged4py just uses first word (before space) of given name

  • last name, for married females who changed their name in marriage ged4py assumes this to be a married name

  • maiden name, only applies to married females who changed their name in marriage

Here is the algorithm that ged4py uses for extracting these pieces:

  • for Agelong dialect:

    • only NAME record value is used, sub-records are ignored

    • maiden name is determined from parenthesized portion of surname

    • last name is everything except maiden name in surname

    • given name is value without surname, collects everything before and after slashes in NAME value

  • for MyHeritage dialect:

    • if _MARNM sub-record is present then it is used as last name and everything between slashes in NAME value is used as maiden name

    • otherwise everything between slashes is used as last name, maiden name is empty

    • given name is NAME value without slashes and stuff between slashes

  • for other cases (“default” dialect):

    • if there is NAME record with TYPE sub-record equal ‘maiden’ then use surname from that record value as maiden name

    • if there is more than one NAME record choose one without TYPE sub-record as “primary” name, or use first NAME record; last name comes from primary NAME value between slashes, first name is the rest of value.

Examples

This page collects several simple code examples which use ged4py.

Example 1

Trivial example of opening the file, iterating over INDI records (which produces Individual instances) and printing basic information for each person. format() method is used to produce printable representation of a name, though this is only one of possible ways to format names. Method sub_tag_value() is used to access the values of subordinate tags of the record, it can follow many levels of tags.

import sys
from ged4py.parser import GedcomReader

# open GEDCOM file
with GedcomReader(sys.argv[1]) as parser:
    # iterate over each INDI record in a file
    for i, indi in enumerate(parser.records0("INDI")):
        # Print a name (one of many possible representations)
        print(f"{i}: {indi.name.format()}")

        father = indi.father
        if father: 
            print(f"    father: {father.name.format()}")

        mother = indi.mother
        if mother: 
            print(f"    mother: {mother.name.format()}")

        # Get _value_ of the BIRT/DATE tag
        birth_date = indi.sub_tag_value("BIRT/DATE")
        if birth_date:
            print(f"    birth date: {birth_date}")

        # Get _value_ of the BIRT/PLAC tag
        birth_place = indi.sub_tag_value("BIRT/PLAC")
        if birth_place:
            print(f"    birth place: {birth_place}")

Example 2

This example iterates over FAM records in the file which represent family structure. FAM records do not have special record type so they produce generic Record instances. This example shows the use of sub_tag() method which can dereference pointer records contained in FAM records to retrieve corresponding INDI records.

import sys
from ged4py.parser import GedcomReader

with GedcomReader(sys.argv[1]) as parser:
    # iterate over each FAM record in a file
    for i, fam in enumerate(parser.records0("FAM")):

        print(f"{i}:")

        # Get records for spouses, FAM record contains pointers to INDI
        # records but sub_tag knows how to follow the pointers and return
        # the referenced records instead.
        husband, wife = fam.sub_tag("HUSB"), fam.sub_tag("WIFE")
        if husband: 
            print(f"    husband: {husband.name.format()}")
        if wife: 
            print(f"    wife: {wife.name.format()}")

        # Get _value_ of the MARR/DATE tag
        marr_date = fam.sub_tag_value("MARR/DATE")
        if marr_date:
            print(f"    marriage date: {marr_date}")

        # access all CHIL records, sub_tags method returns list (possibly empty)
        children = fam.sub_tags("CHIL")
        for child in children:
            # print name and date of birth
            print(f"    child: {child.name.format()}")
            birth_date = child.sub_tag_value("BIRT/DATE")
            if birth_date:
                print(f"        birth date: {birth_date}")

Example 3

This example shows how to specialize date formatting. Date representation in different calendars is a very complicated topic and ged4py cannot solve it in any general way. Instead it gives clients an option to specialize date handling in whatever way clients prefer. This is done by implementing DateValueVisitor interface and passing a visitor instance to ged4py.date.DateValue.accept() method. For completeness one also has to implement CalendarDateVisitor to format or do anything else to the instances of CalendarDate, this is not shown in the example.

import sys
from ged4py.parser import GedcomReader
from ged4py.date import DateValueVisitor


class DateFormatter(DateValueVisitor):
    """Visitor class that produces string representation of dates.
    """
    def visitSimple(self, date):
        return f"{date.date}"

    def visitPeriod(self, date):
        return f"from {date.date1} to {date.date2}"

    def visitFrom(self, date):
        return f"from {date.date}"

    def visitTo(self, date):
        return f"to {date.date}"

    def visitRange(self, date):
        return f"between {date.date1} and {date.date2}"

    def visitBefore(self, date):
        return f"before {date.date}"

    def visitAfter(self, date):
        return f"after {date.date}"

    def visitAbout(self, date):
        return f"about {date.date}"

    def visitCalculated(self, date):
        return f"calculated {date.date}"

    def visitEstimated(self, date):
        return f"estimated {date.date}"

    def visitInterpreted(self, date):
        return f"interpreted {date.date} ({date.phrase})"

    def visitPhrase(self, date):
        return f"({date.phrase})"


format_visitor = DateFormatter()

with GedcomReader(sys.argv[1]) as parser:
    # iterate over each INDI record in a file
    for i, indi in enumerate(parser.records0("INDI")):
        print(f"{i}: {indi.name.format()}")

        # get all possible event types and print their dates,
        # full list of events is longer, this is only an example
        events = indi.sub_tags("BIRT", "CHR", "DEAT", "BURI", "ADOP", "EVEN")
        for event in events:
            date = event.sub_tag_value("DATE")
            # Some event types like generic EVEN can define TYPE tag
            event_type = event.sub_tag_value("TYPE")
            # pass a visitor to format the date
            if date:
                date_str = date.accept(format_visitor)
            else:
                date_str = "N/A"
            print(f"    event: {event.tag} date: {date_str} type: {event_type}")

Contributing

Contributions are welcome, and they are greatly appreciated! Every little bit helps, and credit will always be given.

You can contribute in many ways:

Types of Contributions

Report Bugs

Report bugs at https://github.com/andy-z/ged4py/issues.

If you are reporting a bug, please include:

  • Your operating system name and version.

  • Any details about your local setup that might be helpful in troubleshooting.

  • Detailed steps to reproduce the bug.

Fix Bugs

Look through the GitHub issues for bugs. Anything tagged with “bug” and “help wanted” is open to whoever wants to implement it.

Implement Features

Look through the GitHub issues for features. Anything tagged with “enhancement” and “help wanted” is open to whoever wants to implement it.

Write Documentation

GEDCOM parser for Python could always use more documentation, whether as part of the official GEDCOM parser for Python docs, in docstrings, or even on the web in blog posts, articles, and such.

Submit Feedback

The best way to send feedback is to file an issue at https://github.com/andy-z/ged4py/issues.

If you are proposing a feature:

  • Explain in detail how it would work.

  • Keep the scope as narrow as possible, to make it easier to implement.

  • Remember that this is a volunteer-driven project, and that contributions are welcome :)

Get Started!

Ready to contribute? Here’s how to set up ged4py for local development.

  1. Fork the ged4py repo on GitHub.

  2. Clone your fork locally:

    $ git clone git@github.com:your_name_here/ged4py.git
    
  3. Install your local copy into a virtualenv. Assuming you have virtualenvwrapper installed, this is how you set up your fork for local development:

    $ mkvirtualenv ged4py
    $ cd ged4py/
    $ python setup.py develop
    
  4. Create a branch for local development:

    $ git checkout -b name-of-your-bugfix-or-feature
    

    Now you can make your changes locally.

  5. When you’re done making changes, check that your changes pass flake8 and the tests, including testing other Python versions with tox:

    $ flake8 ged4py tests
    $ python setup.py test or py.test
    $ tox
    

    To get flake8 and tox, just pip install them into your virtualenv.

  6. Commit your changes and push your branch to GitHub:

    $ git add .
    $ git commit -m "Your detailed description of your changes."
    $ git push origin name-of-your-bugfix-or-feature
    
  7. Submit a pull request through the GitHub website.

Pull Request Guidelines

Before you submit a pull request, check that it meets these guidelines:

  1. The pull request should include tests.

  2. If the pull request adds functionality, the docs should be updated. Put your new functionality into a function with a docstring, and add the feature to the list in README.rst.

  3. The pull request should work for Python 3.6+. Check https://travis-ci.org/andy-z/ged4py/pull_requests and make sure that the tests pass for all supported Python versions.

Tips

To run a subset of tests:

$ python -m unittest tests.test_ged4py

Credits

Development Lead

Contributors

History

0.4.4 (2021-05-01)

Add Python3.9 to tox, github test, and classifiers.

0.4.3 (2021-04-30)

  • Extend behavior of Record.sub_tags() method.

0.4.2 (2021-04-09)

  • Fix crash in sub_tag() with broken files

0.4.1 (2021-04-08)

  • Improve handling of invalid dates

0.4.0 (2020-10-09)

  • Python3 goodies, use enum classes for enums

0.3.2 (2020-10-04)

  • Use numpydoc style for docstrings, add extension to Sphinx

  • Drop Python2 compatibility code

0.3.1 (2020-09-28)

  • Use github actions instead of Travis CI

0.3.0 (2020-09-28)

  • Drop Python2 support

  • Python3 supported versions are 3.6 - 3.8

0.2.4 (2020-08-30)

  • Extend dialect detection for new genery.com SOUR format

0.2.3 (2020-08-29)

  • Disable hashing for Record types

  • Add hash method for DateValue and CalendarDate classes

  • Improve ordering of DateValue instances

0.2.2 (2020-08-16)

  • Fix parsing of DATE records with leading blanks

0.2.1 (2020-08-15)

  • Extend documentation with examples

  • Extend docstrings for few classes

0.2.0 (2020-07-05)

  • Improve support for GEDCOM date types

0.1.13 (2020-04-15)

  • Add support for MacOS line breaks (single CR character)

0.1.12 (2020-03-01)

  • Add support for a bunch of illegal encodings (thanks @Tuisto59 for report).

0.1.11 (2019-01-06)

  • Improve support for ANSEL encoded documents that use combining characters.

0.1.10 (2018-10-17)

  • Add protection for empty DATE fields.

0.1.9 (2018-05-17)

  • Improve exception messages, convert bytes to string

0.1.8 (2018-05-16)

  • Add simple integrity checks to parser

0.1.7 (2018-04-23)

  • Fix for DateValue comparison, few small improvements

0.1.6 (2018-04-02)

  • Improve handling of non-standard dates, any date string that cannot be parsed according to GEDCOM syntax is assumed to be a “Date phrase”

0.1.5 (2018-03-25)

  • Fix for exception due to empty NAME record

0.1.4 (2018-01-31)

  • Improve name parsing for ALTREE dialect

0.1.3 (2018-01-16)

  • improve Py3 compatibility

0.1.2 (2017-11-26)

  • Get rid of name formatting options, too complicated for this package.

  • Describe name parsing for different dialects.

0.1.1 (2017-11-20)

  • Fix for missing modules.

0.1.0 (2017-07-17)

  • First release on PyPI.

Indices and tables