ged4py.parser¶
Module containing methods for parsing GEDCOM files.
Functions
|
Look at file contents and guess its correct encoding. |
Classes
|
Class representing single line in a GEDCOM file. |
|
Main interface for reading GEDCOM files. |
Exceptions
Class for exceptions raised for codec-related errors. |
|
Class for exceptions raised for structural errors, e.g. |
|
Class for exceptions raised for parsing errors. |
-
class
ged4py.parser.GedcomReader(file, encoding=None, errors='strict', require_char=False)[source]¶ Bases:
objectMain interface for reading GEDCOM files.
- Parameters
- file
File name or file object open in binary mode, file must be seekable.
- encoding
str, optional If
None(default) then file is analyzed usingguess_codec()method to determine correct codec. Otherwise file is open using specified codec.- errors
str, optional Controls error handling behavior during string decoding, accepts same values as standard
codecs.decodemethod.- require_char
bool, optional If True then exception is thrown if CHAR record is not found in a header, if False and CHAR is not in the header then codec determined from BOM or “gedcom” is used.
Notes
Instance of this class is used to read and parse single GEDCOM file. Records in GEDCOM file are transformed into instances of types defined in
ged4py.modelmodule, eitherged4py.model.Recordclass or one of its sub-classes. Main method of access to the data in the file is by iterating over level-0 records, optionally restricted by the tag name. The method which does this isGedcomReader.records0(). Most commonly the code which reads GEDCOM file at the top-level loop will look like this:with GedcomReader(path) as parser: # iterate over each INDI record in a file for record in parser.records0("INDI"): # do something with the record or navigate to other linked records
- Attributes
dialectFile dialect as one of
ged4py.model.Dialectenums.headerHeader record (
ged4py.model.Record).index0List of level=0 record positions and tag names (
list[(int, str)]).xref0Dictionary which maps xref_id to level=0 record position and tag name (
dict[str, (int, str)]).
Methods
GedcomLines(offset)Generator method for gedcom lines.
read_record(offset)Read next complete record from a file starting at given position.
records0([tag])Iterator over level=0 records with given tag.
-
property
index0¶ List of level=0 record positions and tag names (
list[(int, str)]).
-
property
xref0¶ Dictionary which maps xref_id to level=0 record position and tag name (
dict[str, (int, str)]).
-
property
header¶ Header record (
ged4py.model.Record).
-
property
dialect¶ File dialect as one of
ged4py.model.Dialectenums.
-
GedcomLines(offset)[source]¶ Generator method for gedcom lines.
- Parameters
- offset
int Position in the file to start reading.
- offset
- Yields
- line
GedcomLine An object representing one line of GEDCOM file.
- line
- Raises
- ParserError
Raised if lines have incorrect syntax.
Notes
GEDCOM line grammar is defined in Chapter 1 of GEDCOM standard, it consists of the level number, optional reference ID, tag name, and optional value separated by spaces. Chaper 1 is pure grammar level, it does not assign any semantics to tags or levels. Consequently this method does not perform any operations on the lines other than returning the lines in their order in file.
This method iterates over all lines in input file and converts each line into
GedcomLineclass. It is an implementation detail used by other methods, most clients will not need to use this method.
-
records0(tag=None)[source]¶ Iterator over level=0 records with given tag.
This is the main method of this class. Clients access data in GEDCOM files by iterating over level=0 records and then navigating to sub-records using the methods of the
Recordclass.
-
read_record(offset)[source]¶ Read next complete record from a file starting at given position.
Reads the record at given position and all its sub-records. Stops reading at EOF or next record with the same or higher (smaller) level number. File position after return from this method is not specified, re-position file if you want to read other records.
This is mostly for internal use, regular clients don’t need to use it.
- Parameters
- offset
int Position in the file to start reading.
- offset
- Returns
- record
RecordorNone model.Recordinstance or None if offset points past EOF.
- record
- Raises
- ParserError
Raised if
offsetsdoes not point to the beginning of a record or for any parsing errors.
-
exception
ged4py.parser.ParserError[source]¶ Bases:
ExceptionClass for exceptions raised for parsing errors.
-
exception
ged4py.parser.CodecError[source]¶ Bases:
ged4py.parser.ParserErrorClass for exceptions raised for codec-related errors.
-
exception
ged4py.parser.IntegrityError[source]¶ Bases:
ExceptionClass for exceptions raised for structural errors, e.g. when record level nesting is inconsistent.
-
ged4py.parser.guess_codec(file, errors='strict', require_char=False, warn=True)[source]¶ Look at file contents and guess its correct encoding.
File must be open in binary mode and positioned at offset 0. If BOM record is present then it is assumed to be UTF-8 or UTF-16 encoded file. GEDCOM header is searched for CHAR record and encoding name is extracted from it, if BOM record is present then CHAR record must match BOM-defined encoding.
- Parameters
- file
File object, must be open in binary mode.
- errors
str, optional Controls error handling behavior during string decoding, accepts same values as standard
codecs.decodemethod.- require_char
bool, optional If
Truethen exception is thrown if CHAR record is not found in a header, if False and CHAR is not in the header then codec determined from BOM or “gedcom” is returned.- warn
bool, optional If True (default) then generate error/warning messages for illegal encodings.
- Returns
- codec_name
str The name of the codec in this file.
- bom_size
int Size of the BOM record, 0 if no BOM record.
- codec_name
- Raises
- CodecError
Raised if codec name in file is unknown or when codec name in file contradicts codec determined from BOM.
- UnicodeDecodeError
Raised if codec fails to decode input lines and
errorsis set to “strict” (default).
-
class
ged4py.parser.GedcomLine(level: int, xref_id: Optional[str], tag: str, value: bytes, offset: int)[source]¶ Bases:
tupleClass representing single line in a GEDCOM file.
Note
Mostly for internal use by parser, most clients do not need to know about this class.
- Attributes
Methods
count(value, /)Return number of occurrences of value.
index(value[, start, stop])Return first index of value.
-
property
level¶ Record level number (
int)
-
property
xref_id¶ Reference for this record (
strorNone)
-
property
tag¶ Tag name (
str)
-
property
value¶ Record value (
bytes)
-
property
offset¶ Record offset in a file (
int)