Welcome to GED4PY’s documentation!¶
Contents:
GEDCOM parser for Python¶
Implementation of the GEDCOM parser in Python
Free software: MIT license
Documentation: https://ged4py.readthedocs.io.
Features¶
Parsing of GEDCOM files as defined by 5.5.1 version of GEDCOM standard
Supported file encodings are UTF-8 (with or without BOM), ASCII or ANSEL
Designed to parse large files efficiently
Supports Python 3.6+
Credits¶
This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template.
Installation¶
Stable release¶
To install GEDCOM parser for Python, run this command in your terminal:
$ pip install ged4py
This is the preferred method to install GEDCOM parser for Python, as it will always install the most recent stable release.
If you don’t have pip installed, this Python installation guide can guide you through the process.
From sources¶
The sources for GEDCOM parser for Python can be downloaded from the Github repo.
You can either clone the public repository:
$ git clone git://github.com/andy-z/ged4py
Or download the tarball:
$ curl -OL https://github.com/andy-z/ged4py/tarball/master
Once you have a copy of the source, you can install it with:
$ python setup.py install
Usage¶
Currently ged4py
supports parsing of existing GEDCOM files, there
is no support for (re-)generating GEDCOM data. The main interface for parsing
is ged4py.parser.GedcomReader
class. To create parser instance
one has to pass file with GEDCOM data as a single required parameter, this
can be either file name of a Python file object. If file object is passed
then the file has to be open in a binary mode and it has to support
seek()
and tell()
methods. Example of instantiating a parser:
from ged4py import GedcomReader
path = "/path/to/file.gedcom"
with GedcomReader(path) as parser:
# GedcomReader provides context support
...
or using in-memory buffer as a file (could be useful for testing):
import io
from ged4py import GedcomReader
data = b"..." # make some binary date here
with io.BytesIO(data) as file:
parser = GedcomReader(file)
...
In most cases parser should be able to determine input file encoding from the file if data in the file follows GEDCOM specification. In other cases parser may need external help, if you know file encoding you can provide it as an argument to parser:
parser = GedcomReader(path, encoding="utf-8")
Any encoding supported by Python codecs
module can be used as
an argument. In addition, this package registers two additional encodings
from the ansel package:
ansel |
American National Standard for Extended Latin Alphabet Coded Character Set for Bibliographic Use (ANSEL) |
gedcom |
GEDCOM extensions for ANSEL |
By default parser raises exception if it encounters errors while decoding
data in a file. To override this behavior one can specify different error
policy, following the same pattern as standard codecs.decode()
method, e.g.:
parser = GedcomReader(path, encoding="utf-8", errors='replace')
Main mode of operation for parser is iterating over records in a file in sequential manner. GEDCOM records are organized in hierarchical structures, and ged4py parser facilitates access to these hierarchies by grouping records in tree-like structures. Instead of providing iterator over every record in a file parser iterates over top-level (level 0) records, and for each level-0 record it returns record structure which includes nested records below level 0.
The main method of the parser is the method
records0()
which returns iterator over all
level-0 records. Method takes an optional argument for a tag name, without
argument all level-0 records are returned by iterator (starting with “HEAD”
and ending with “TRLR”). If tag is given then only the records with
matching tag are returned:
with GedcomReader(path) as parser:
# iterate over all INDI records
for record in parser.records0("INDI"):
....
Records returned by iterator are instances of class
ged4py.model.Record
or one of its few sub-classes. Each record
instance has a small set of attributes:
level
- record level, 0 for top-level recordsxref_id
- record reference ID, may beNone
tag
- record tag namevalue
- record value, can beNone
, string, or value of some other type depending on record typesub_records
- list of subordinate records, direct sub-records of this record, it is easier to access items in this list using methods described below.
If, for example, GEDCOM file contains sequence of records like this:
0 @ID12345@ INDI
1 NAME John /Smith/
1 BIRT
2 DATE 1 JAN 1901
2 PLAC Some place
1 FAMC @ID45623@
1 FAMS @ID7612@
then the record object returned from iterator will have these attributes:
level
is 0 (true for all records returned byrecords0()
),xref_id
- “@ID12345@”,tag
- “INDI”,value
-None
,sub_records
- list ofRecord
instances corresponding to “NAME”, “BIRT”, “FAMC”, and “FAMS” tags (but not “DATE” or “PLAC”, records for these tags will be insub_records
of “BIRT” record).
Record
class has few convenience methods:
sub_tags()
- return all direct subordinate records with a given tag name, list of records is returned, possibly empty.sub_tag()
- return subordinate record with a given tag name (or tag “path”), if there is more than one record with matching tag then first one is returned, without matchNone
is returned.sub_tag_value()
- return value of subordinate record with a given tag name (or tag “path”), orNone
if record is not found or its value isNone
.
With the example records from above one can do record.sub_tag("BIRT/DATE")
on level-0 record to retrieve a Record
instance
corresponding to level-2 “DATE” record, or alternatively use
record.sub_tag_value("BIRT/DATE")
to retrieve the value
attribute of
the same record.
There are few specialized sub-classes of Record
each corresponding to specific record tag:
NAME records generate
ged4py.model.NameRec
instances, this class knows how to split name representation into name components (first, last, maiden) and has attributes for accessing those.DATE records generate
ged4py.model.Date
instances, thevalue
attribute of this class is converted intoged4py.date.DateValue
instance.INDI records are represented by
ged4py.model.Individual
class.“pointer” records whose
value
has special GEDCOM <POINTER> syntax (@xref_id@
) are represented byged4py.model.Pointer
class. This class has special propertyref
which returns referenced record. Methodssub_tag()
andsub_tag_value()
have keyword argumentfollow
which can be set toTrue
to allow automatic dereferencing of the pointer records.
Technical information¶
Character encoding¶
GEDCOM originally provided very little support for non-Latin alphabets. To support Latin-based characters beyond ASCII set GEDCOM used ANSEL 8-bit encoding which added a bunch of diacritical marks (modifiers) and few commonly used non-ASCI characters. Support for non-Latin characters was added in latter version of GEDCOM standard, version 5.3 added wording for UNICODE support (mostly broken) and draft 5.5.1 improved situation by declaring UTF-8 encoding as supported UNICODE encoding. Several systems producing GEDCOM output today seem to have converged on UTF-8.
The encoding of GEDCOM file is determined by the content of the file
itself, in particular by the CHAR
record in the header (which is a
required record), e.g.:
0 HEAD
1 SOUR PAF
2 VERS 2.1
1 DEST ANSTFILE
1 CHAR ANSEL
GEDCOM standard seems to imply that character set specified in CHAR
record applies to everything after that record and until TRLR
record
(last record in file). My interpretation of that statement is that
all header records before and including CHAR
should be encoded with
default ANSEL encoding. This may be a source of incompatibilities, I can
imagine that software encoding its output in e.g. UTF-8 can decide to
encode all header records in the the same UTF-8 which can cause errors if
decoded using ANSEL.
Additional source of concerns is the BOM record that some applications (or many on Windows) tend to add to files encoded with UTF-8 (or UTF-16). Presence of BOM usually implies that the whole content of the file should be decoded using UTF-8/-16. This contradicts assumption that initial part of GEDCOM header is encoded in ANSEL.
Ged4py tries to make a best guess as to how it should decode input data, and it uses simple algorithm to determine that:
if file starts with BOM record then ged4py reads the whole file using UTF-8 or UTF-16 encoding, if the
CHAR
record specifies something other than UTF-8/-16 the exception is raised;otherwise if file starts with regular “0” and ” ” ASCII characters the header is read using ANSEL encoding until
CHAR
record is met, after that reading switches to the encoding specified in that record;decoding errors are handled according to the mode specified when opening GEDCOM file, it can be one of standard error handling schemes defined in
codecs
module. This scheme applies to to both header (beforeCHAR
record) and regular content.
See also Tamura Jones’ excellent article summarizing many varieties of illegal encodings that may be present in GEDCOM files.
Name representation¶
GEDCOM NAME
record defines a structured format for representing names but
applications are not required to fill that structural information and can
instead present name as a value part or NAME
record in a “custom of
culture” representation. Only requirement for that representation is that
surname should be delimited by slash characters, e.g.:
0 @I1@ INDI
1 NAME John /Smith/ -- given name and surname
0 @I2@ INDI
1 NAME Joanne -- without surname
0 @I3@ INDI
1 NAME /Иванов/ Иван Ив. -- surname and given name
0 @I4@ INDI
1 NAME Sir John /Ivanoff/ Jr. -- with prefix/suffix
Potentially individual can have more than one NAME record which can be distinguished by TYPE record which can be arbitrary string, GEDCOM does not define standard or allowed types. Types could be use for example to specify maiden name or names in previous marriages, e.g.:
0 @I1@ INDI
1 NAME Жанна /Иванова/
1 NAME Jeanne /d'Arc/
2 TYPE maiden
Couple of application that I know of do not use TYPE records for maiden name representation instead they chose different ways to encode names. Here is how individual applications encode names.
Agelong Tree (Genery)¶
Agelong Tree produces single NAME record per individual, I don’t think it is possible to make it to create more than one NAME record. Given name and and surname are encoded as value in the NAME record, and given name also appears in GIVN sub-record:
1 NAME Given Name /Surname/
2 GIVN Given Name
If person has a maiden name then it is encoded as additional surname enclosed in parentheses, also SURN sub-record specifies maiden name:
1 NAME Given Name /Surname (Maiden)/
2 GIVN Given Name
2 SURN Maiden
Additionally Agelong tends to represent missing parts of names in GEDCOM file with question mark (?).
Agelong can also store name suffix and prefix, they are not included into NAME record value but stored as NPFX and NSFX sub-records:
1 NAME Given Name /Surname/
2 NPFX Dr.
2 GIVN Given Name
2 NSFX Jr.
MyHeritage¶
MyHeritage Family Tree Builder can generate more than one NAME record but I could not find a way to specify TYPE of the created NAME records, likely all NAME records are created without TYPE which is not too useful.
Given name and and surname are encoded as value in the NAME record and they also appear in GIVN and SURN sub-records:
1 NAME Given Name /Surname/
2 GIVN Given Name
2 SURN Surname
If name of the person after marriage is different from birth/maiden name
(apparently in MyHeritage this can only happen for female individuals) then
married name is stored in a non-standard sub-record with _MARNM
tag:
1 NAME Given Name /Maiden/
2 GIVN Given Name
2 SURN Maiden
2 _MARNM Married
MyHeritage can also store name suffix and prefix, and also nickname in corresponding sub-records, they are not rendered in NAME record value:
1 NAME Given Name /Surname/
2 NPFX Dr.
2 GIVN Given Name
2 SURN Surname
2 NSFX Jr.
2 NICK Professore
MyHeritage can also store few name pieces in NAME sub-records using
non-standard tags such as _AKA
, _RNAME
(for religious name),
_FORMERNAME
, etc.
ged4py behavior¶
ged4py tries to determine individual name pieces from all info in GEDCOM
records. Because interpretation of the information depends on the application
which produced GEDCOM file ged4py also has to determine the application name.
Application name (a.k.a. GEDCOM “dialect”) is determined from file header and
is stored in a dialect
property of GedcomReader
class (one of the DIALECT_* constants defined in ged4py.model
module). In general naming of individuals can be overly complicated, ged4py
tries to build a simpler model of person naming by determining four pieces of
each individual’s name:
given name, in some cultures it can include middle (or father) name
first name, ged4py just uses first word (before space) of given name
last name, for married females who changed their name in marriage ged4py assumes this to be a married name
maiden name, only applies to married females who changed their name in marriage
Here is the algorithm that ged4py uses for extracting these pieces:
for Agelong dialect:
only NAME record value is used, sub-records are ignored
maiden name is determined from parenthesized portion of surname
last name is everything except maiden name in surname
given name is value without surname, collects everything before and after slashes in NAME value
for MyHeritage dialect:
if
_MARNM
sub-record is present then it is used as last name and everything between slashes in NAME value is used as maiden nameotherwise everything between slashes is used as last name, maiden name is empty
given name is NAME value without slashes and stuff between slashes
for other cases (“default” dialect):
if there is NAME record with TYPE sub-record equal ‘maiden’ then use surname from that record value as maiden name
if there is more than one NAME record choose one without TYPE sub-record as “primary” name, or use first NAME record; last name comes from primary NAME value between slashes, first name is the rest of value.
Examples¶
This page collects several simple code examples which use ged4py
.
Example 1¶
Trivial example of opening the file, iterating over INDI records (which
produces Individual
instances) and printing basic
information for each person. format()
method is
used to produce printable representation of a name, though this is only one of
possible ways to format names. Method
sub_tag_value()
is used to access the values of
subordinate tags of the record, it can follow many levels of tags.
import sys
from ged4py.parser import GedcomReader
# open GEDCOM file
with GedcomReader(sys.argv[1]) as parser:
# iterate over each INDI record in a file
for i, indi in enumerate(parser.records0("INDI")):
# Print a name (one of many possible representations)
print(f"{i}: {indi.name.format()}")
father = indi.father
if father:
print(f" father: {father.name.format()}")
mother = indi.mother
if mother:
print(f" mother: {mother.name.format()}")
# Get _value_ of the BIRT/DATE tag
birth_date = indi.sub_tag_value("BIRT/DATE")
if birth_date:
print(f" birth date: {birth_date}")
# Get _value_ of the BIRT/PLAC tag
birth_place = indi.sub_tag_value("BIRT/PLAC")
if birth_place:
print(f" birth place: {birth_place}")
Example 2¶
This example iterates over FAM records in the file which represent family
structure. FAM records do not have special record type so they produce generic
Record
instances. This example shows the use of
sub_tag()
method which can dereference pointer
records contained in FAM records to retrieve corresponding INDI records.
import sys
from ged4py.parser import GedcomReader
with GedcomReader(sys.argv[1]) as parser:
# iterate over each FAM record in a file
for i, fam in enumerate(parser.records0("FAM")):
print(f"{i}:")
# Get records for spouses, FAM record contains pointers to INDI
# records but sub_tag knows how to follow the pointers and return
# the referenced records instead.
husband, wife = fam.sub_tag("HUSB"), fam.sub_tag("WIFE")
if husband:
print(f" husband: {husband.name.format()}")
if wife:
print(f" wife: {wife.name.format()}")
# Get _value_ of the MARR/DATE tag
marr_date = fam.sub_tag_value("MARR/DATE")
if marr_date:
print(f" marriage date: {marr_date}")
# access all CHIL records, sub_tags method returns list (possibly empty)
children = fam.sub_tags("CHIL")
for child in children:
# print name and date of birth
print(f" child: {child.name.format()}")
birth_date = child.sub_tag_value("BIRT/DATE")
if birth_date:
print(f" birth date: {birth_date}")
Example 3¶
This example shows how to specialize date formatting. Date representation in
different calendars is a very complicated topic and ged4py
cannot solve it
in any general way. Instead it gives clients an option to specialize date
handling in whatever way clients prefer. This is done by implementing
DateValueVisitor
interface and passing a visitor
instance to ged4py.date.DateValue.accept()
method. For completeness
one also has to implement CalendarDateVisitor
to
format or do anything else to the instances of
CalendarDate
, this is not shown in the example.
import sys
from ged4py.parser import GedcomReader
from ged4py.date import DateValueVisitor
class DateFormatter(DateValueVisitor):
"""Visitor class that produces string representation of dates.
"""
def visitSimple(self, date):
return f"{date.date}"
def visitPeriod(self, date):
return f"from {date.date1} to {date.date2}"
def visitFrom(self, date):
return f"from {date.date}"
def visitTo(self, date):
return f"to {date.date}"
def visitRange(self, date):
return f"between {date.date1} and {date.date2}"
def visitBefore(self, date):
return f"before {date.date}"
def visitAfter(self, date):
return f"after {date.date}"
def visitAbout(self, date):
return f"about {date.date}"
def visitCalculated(self, date):
return f"calculated {date.date}"
def visitEstimated(self, date):
return f"estimated {date.date}"
def visitInterpreted(self, date):
return f"interpreted {date.date} ({date.phrase})"
def visitPhrase(self, date):
return f"({date.phrase})"
format_visitor = DateFormatter()
with GedcomReader(sys.argv[1]) as parser:
# iterate over each INDI record in a file
for i, indi in enumerate(parser.records0("INDI")):
print(f"{i}: {indi.name.format()}")
# get all possible event types and print their dates,
# full list of events is longer, this is only an example
events = indi.sub_tags("BIRT", "CHR", "DEAT", "BURI", "ADOP", "EVEN")
for event in events:
date = event.sub_tag_value("DATE")
# Some event types like generic EVEN can define TYPE tag
event_type = event.sub_tag_value("TYPE")
# pass a visitor to format the date
if date:
date_str = date.accept(format_visitor)
else:
date_str = "N/A"
print(f" event: {event.tag} date: {date_str} type: {event_type}")
Contributing¶
Contributions are welcome, and they are greatly appreciated! Every little bit helps, and credit will always be given.
You can contribute in many ways:
Types of Contributions¶
Report Bugs¶
Report bugs at https://github.com/andy-z/ged4py/issues.
If you are reporting a bug, please include:
Your operating system name and version.
Any details about your local setup that might be helpful in troubleshooting.
Detailed steps to reproduce the bug.
Fix Bugs¶
Look through the GitHub issues for bugs. Anything tagged with “bug” and “help wanted” is open to whoever wants to implement it.
Implement Features¶
Look through the GitHub issues for features. Anything tagged with “enhancement” and “help wanted” is open to whoever wants to implement it.
Write Documentation¶
GEDCOM parser for Python could always use more documentation, whether as part of the official GEDCOM parser for Python docs, in docstrings, or even on the web in blog posts, articles, and such.
Submit Feedback¶
The best way to send feedback is to file an issue at https://github.com/andy-z/ged4py/issues.
If you are proposing a feature:
Explain in detail how it would work.
Keep the scope as narrow as possible, to make it easier to implement.
Remember that this is a volunteer-driven project, and that contributions are welcome :)
Get Started!¶
Ready to contribute? Here’s how to set up ged4py
for local development.
Fork the
ged4py
repo on GitHub.Clone your fork locally:
$ git clone git@github.com:your_name_here/ged4py.git
Install your local copy into a virtualenv. Assuming you have virtualenvwrapper installed, this is how you set up your fork for local development:
$ mkvirtualenv ged4py $ cd ged4py/ $ python setup.py develop
Create a branch for local development:
$ git checkout -b name-of-your-bugfix-or-feature
Now you can make your changes locally.
When you’re done making changes, check that your changes pass flake8 and the tests, including testing other Python versions with tox:
$ flake8 ged4py tests $ python setup.py test or py.test $ tox
To get flake8 and tox, just pip install them into your virtualenv.
Commit your changes and push your branch to GitHub:
$ git add . $ git commit -m "Your detailed description of your changes." $ git push origin name-of-your-bugfix-or-feature
Submit a pull request through the GitHub website.
Pull Request Guidelines¶
Before you submit a pull request, check that it meets these guidelines:
The pull request should include tests.
If the pull request adds functionality, the docs should be updated. Put your new functionality into a function with a docstring, and add the feature to the list in README.rst.
The pull request should work for Python 3.6+. Check https://travis-ci.org/andy-z/ged4py/pull_requests and make sure that the tests pass for all supported Python versions.
History¶
0.4.4 (2021-05-01)¶
Add Python3.9 to tox, github test, and classifiers.
0.4.3 (2021-04-30)¶
Extend behavior of
Record.sub_tags()
method.
0.4.2 (2021-04-09)¶
Fix crash in
sub_tag()
with broken files
0.4.1 (2021-04-08)¶
Improve handling of invalid dates
0.4.0 (2020-10-09)¶
Python3 goodies, use enum classes for enums
0.3.2 (2020-10-04)¶
Use numpydoc style for docstrings, add extension to Sphinx
Drop Python2 compatibility code
0.3.1 (2020-09-28)¶
Use github actions instead of Travis CI
0.3.0 (2020-09-28)¶
Drop Python2 support
Python3 supported versions are 3.6 - 3.8
0.2.4 (2020-08-30)¶
Extend dialect detection for new genery.com SOUR format
0.2.3 (2020-08-29)¶
Disable hashing for Record types
Add hash method for DateValue and CalendarDate classes
Improve ordering of DateValue instances
0.2.2 (2020-08-16)¶
Fix parsing of DATE records with leading blanks
0.2.1 (2020-08-15)¶
Extend documentation with examples
Extend docstrings for few classes
0.2.0 (2020-07-05)¶
Improve support for GEDCOM date types
0.1.13 (2020-04-15)¶
Add support for MacOS line breaks (single CR character)
0.1.12 (2020-03-01)¶
Add support for a bunch of illegal encodings (thanks @Tuisto59 for report).
0.1.11 (2019-01-06)¶
Improve support for ANSEL encoded documents that use combining characters.
0.1.10 (2018-10-17)¶
Add protection for empty DATE fields.
0.1.9 (2018-05-17)¶
Improve exception messages, convert bytes to string
0.1.8 (2018-05-16)¶
Add simple integrity checks to parser
0.1.7 (2018-04-23)¶
Fix for DateValue comparison, few small improvements
0.1.6 (2018-04-02)¶
Improve handling of non-standard dates, any date string that cannot be parsed according to GEDCOM syntax is assumed to be a “Date phrase”
0.1.5 (2018-03-25)¶
Fix for exception due to empty NAME record
0.1.4 (2018-01-31)¶
Improve name parsing for ALTREE dialect
0.1.3 (2018-01-16)¶
improve Py3 compatibility
0.1.2 (2017-11-26)¶
Get rid of name formatting options, too complicated for this package.
Describe name parsing for different dialects.
0.1.1 (2017-11-20)¶
Fix for missing modules.
0.1.0 (2017-07-17)¶
First release on PyPI.