ged4py.parser

Module containing methods for parsing GEDCOM files.

Functions

guess_codec(file[, errors, require_char, warn])

Look at file contents and guess its correct encoding.

Classes

GedcomLine(level, xref_id, tag, value, offset)

Class representing single line in a GEDCOM file.

GedcomReader(file[, encoding, errors, …])

Main interface for reading GEDCOM files.

Exceptions

CodecError

Class for exceptions raised for codec-related errors.

IntegrityError

Class for exceptions raised for structural errors, e.g.

ParserError

Class for exceptions raised for parsing errors.

class ged4py.parser.GedcomReader(file, encoding=None, errors='strict', require_char=False)[source]

Bases: object

Main interface for reading GEDCOM files.

Parameters
file

File name or file object open in binary mode, file must be seekable.

encodingstr, optional

If None (default) then file is analyzed using guess_codec() method to determine correct codec. Otherwise file is open using specified codec.

errorsstr, optional

Controls error handling behavior during string decoding, accepts same values as standard codecs.decode method.

require_charbool, optional

If True then exception is thrown if CHAR record is not found in a header, if False and CHAR is not in the header then codec determined from BOM or “gedcom” is used.

Notes

Instance of this class is used to read and parse single GEDCOM file. Records in GEDCOM file are transformed into instances of types defined in ged4py.model module, either ged4py.model.Record class or one of its sub-classes. Main method of access to the data in the file is by iterating over level-0 records, optionally restricted by the tag name. The method which does this is GedcomReader.records0(). Most commonly the code which reads GEDCOM file at the top-level loop will look like this:

with GedcomReader(path) as parser:
    # iterate over each INDI record in a file
    for record in parser.records0("INDI"):
        # do something with the record or navigate to other linked records
Attributes
dialect

File dialect as one of ged4py.model.Dialect enums.

header

Header record (ged4py.model.Record).

index0

List of level=0 record positions and tag names (list[(int, str)]).

xref0

Dictionary which maps xref_id to level=0 record position and tag name (dict[str, (int, str)]).

Methods

GedcomLines(offset)

Generator method for gedcom lines.

read_record(offset)

Read next complete record from a file starting at given position.

records0([tag])

Iterator over level=0 records with given tag.

property index0

List of level=0 record positions and tag names (list[(int, str)]).

property xref0

Dictionary which maps xref_id to level=0 record position and tag name (dict[str, (int, str)]).

property header

Header record (ged4py.model.Record).

property dialect

File dialect as one of ged4py.model.Dialect enums.

GedcomLines(offset)[source]

Generator method for gedcom lines.

Parameters
offsetint

Position in the file to start reading.

Yields
lineGedcomLine

An object representing one line of GEDCOM file.

Raises
ParserError

Raised if lines have incorrect syntax.

Notes

GEDCOM line grammar is defined in Chapter 1 of GEDCOM standard, it consists of the level number, optional reference ID, tag name, and optional value separated by spaces. Chaper 1 is pure grammar level, it does not assign any semantics to tags or levels. Consequently this method does not perform any operations on the lines other than returning the lines in their order in file.

This method iterates over all lines in input file and converts each line into GedcomLine class. It is an implementation detail used by other methods, most clients will not need to use this method.

records0(tag=None)[source]

Iterator over level=0 records with given tag.

This is the main method of this class. Clients access data in GEDCOM files by iterating over level=0 records and then navigating to sub-records using the methods of the Record class.

Parameters
tagstr, optional

If tag is None (default) then return all level=0 records, otherwise return level=0 records with the given tag.

Yields
recordRecord

Instances of Record or its subclasses.

read_record(offset)[source]

Read next complete record from a file starting at given position.

Reads the record at given position and all its sub-records. Stops reading at EOF or next record with the same or higher (smaller) level number. File position after return from this method is not specified, re-position file if you want to read other records.

This is mostly for internal use, regular clients don’t need to use it.

Parameters
offsetint

Position in the file to start reading.

Returns
recordRecord or None

model.Record instance or None if offset points past EOF.

Raises
ParserError

Raised if offsets does not point to the beginning of a record or for any parsing errors.

exception ged4py.parser.ParserError[source]

Bases: Exception

Class for exceptions raised for parsing errors.

exception ged4py.parser.CodecError[source]

Bases: ged4py.parser.ParserError

Class for exceptions raised for codec-related errors.

exception ged4py.parser.IntegrityError[source]

Bases: Exception

Class for exceptions raised for structural errors, e.g. when record level nesting is inconsistent.

ged4py.parser.guess_codec(file, errors='strict', require_char=False, warn=True)[source]

Look at file contents and guess its correct encoding.

File must be open in binary mode and positioned at offset 0. If BOM record is present then it is assumed to be UTF-8 or UTF-16 encoded file. GEDCOM header is searched for CHAR record and encoding name is extracted from it, if BOM record is present then CHAR record must match BOM-defined encoding.

Parameters
file

File object, must be open in binary mode.

errorsstr, optional

Controls error handling behavior during string decoding, accepts same values as standard codecs.decode method.

require_charbool, optional

If True then exception is thrown if CHAR record is not found in a header, if False and CHAR is not in the header then codec determined from BOM or “gedcom” is returned.

warnbool, optional

If True (default) then generate error/warning messages for illegal encodings.

Returns
codec_namestr

The name of the codec in this file.

bom_sizeint

Size of the BOM record, 0 if no BOM record.

Raises
CodecError

Raised if codec name in file is unknown or when codec name in file contradicts codec determined from BOM.

UnicodeDecodeError

Raised if codec fails to decode input lines and errors is set to “strict” (default).

class ged4py.parser.GedcomLine(level: int, xref_id: str, tag: str, value: bytes, offset: int)[source]

Bases: tuple

Class representing single line in a GEDCOM file.

Note

Mostly for internal use by parser, most clients do not need to know about this class.

Attributes
levelint

Alias for field number 0

xref_idstr, possibly empty or None

Alias for field number 1

tagstr, required, non-empty

Alias for field number 2

valuebytes, possibly empty or None

Alias for field number 3

offsetint

Alias for field number 4

Methods

count(value, /)

Return number of occurrences of value.

index(value[, start, stop])

Return first index of value.

property level

Record level number (int)

property xref_id

Reference for this record (str or None)

property tag

Tag name (str)

property value

Record value (bytes)

property offset

Record offset in a file (int)