Scanner

The Scanner class and related functions perform basic parsing of RFC 822-style header fields, splitting formatted input up into sequences of (name, value) pairs without any further validation or transformation.

Each pair returned by a scanner method or function represents an individual header field. The first element (the header field name) is the substring up to but not including the first whitespace-padded colon (or other delimiter specified by separator_regex) in the first source line of the header field. The second element (the header field value) is a single string, the concatenation of one or more lines, starting with the substring after the first colon in the first source line, with leading whitespace on lines after the first preserved; the ending of each line is converted to "\n" (added if there is no line ending in the actual input), and the last line of the field value has its trailing line ending (if any) removed.

Note

“Line ending” here means a CR, LF, or CR LF sequence. Unicode line separators are not treated as line endings and are not trimmed or converted to "\n".

Scanner Class

class headerparser.Scanner(data: str | Iterable[str], *, separator_regex: str | Pattern[str] | None = re.compile('[ \\t]*:[ \\t]*'), skip_leading_newlines: bool | None = False)[source]

Added in version 0.5.0.

A class for scanning text for RFC 822-style header fields. Each method processes some portion of the input yet unscanned; the scan(), scan_stanzas(), and get_unscanned() methods process the entirety of the remaining input, while the scan_next_stanza() method only processes up through the first blank line.

Parameters:

data – The text to scan. This may be a string, a text-file-like object, or an iterable of lines. If it is a string, it will be broken into lines on CR, LF, and CR LF boundaries.
separator_regex – A regex (as a str or compiled regex object) defining the name-value separator; defaults to [ \t]*:[ \t]*. When the regex is found in a line, everything before the matched substring becomes the field name, and everything after becomes the first line of the field value. Note that the regex must match any surrounding whitespace in order for it to be trimmed from the key & value.
skip_leading_newlines (bool) – If True, blank lines at the beginning of the input will be discarded. If False, a blank line at the beginning of the input marks the end of an empty header section.

get_unscanned() → str[source]

Return all of the input that has not yet been processed. After calling this method, calling any method again on the same Scanner instance will raise ScannerEOFError.

Raises:: ScannerEOFError – if all of the input has already been consumed

scan() → Iterator[tuple[str | None, str]][source]

Scan the remaining input for RFC 822-style header fields and return a generator of (name, value) pairs for each header field encountered, plus a (None, body) pair representing the body (if any) after the header section.

All lines after the first blank line are concatenated & yielded as-is in a (None, body) pair. (Note that body lines which do not end with a line terminator will not have one appended.) If there is no empty line in the input, then no body pair is yielded. If the empty line is the last line in the input, the body will be the empty string. If the empty line is the first line in the input and the skip_leading_newlines option is false (the default), then all other lines will be treated as part of the body and will not be scanned for header fields.

Raises:

ScannerError – if the header section is malformed
ScannerEOFError – if all of the input has already been consumed

scan_next_stanza() → Iterator[tuple[str, str]][source]

Scan the remaining input for RFC 822-style header fields and return a generator of (name, value) pairs for each header field in the input. Input processing stops as soon as a blank line is encountered. (If skip_leading_newlines is true, the function only stops on a blank line after a non-blank line.)

Raises:

ScannerError – if the header section is malformed
ScannerEOFError – if all of the input has already been consumed

scan_stanzas() → Iterator[list[tuple[str, str]]][source]

Scan the remaining input for zero or more stanzas of RFC 822-style header fields and return a generator of lists of (name, value) pairs, where each list represents a stanza of header fields in the input.

The stanzas are terminated by blank lines. Consecutive blank lines between stanzas are treated as a single blank line. Blank lines at the end of the input are discarded without creating a new stanza.

Raises:

ScannerError – if the header section is malformed
ScannerEOFError – if all of the input has already been consumed

Functions

headerparser.scan(data: str | Iterable[str], *, separator_regex: str | Pattern[str] | None = None, skip_leading_newlines: bool = False) → Iterator[tuple[str | None, str]][source]

Added in version 0.4.0.

Scan a string, text-file-like object, or iterable of lines for RFC 822-style header fields and return a generator of (name, value) pairs for each header field in the input, plus a (None, body) pair representing the body (if any) after the header section.

If data is a string, it will be broken into lines on CR, LF, and CR LF boundaries.

All lines after the first blank line are concatenated & yielded as-is in a (None, body) pair. (Note that body lines which do not end with a line terminator will not have one appended.) If there is no empty line in data, then no body pair is yielded. If the empty line is the last line in data, the body will be the empty string. If the empty line is the first line in data and the skip_leading_newlines option is false (the default), then all other lines will be treated as part of the body and will not be scanned for header fields.

Changed in version 0.5.0: data can now be a string.

Parameters:

data – a string, text-file-like object, or iterable of strings representing lines of input
kwargs – Passed to the Scanner constructor

Return type:

generator of pairs of strings

Raises:

ScannerError – if the header section is malformed

headerparser.scan_stanzas(data: str | Iterable[str], *, separator_regex: str | Pattern[str] | None = None, skip_leading_newlines: bool = False) → Iterator[list[tuple[str, str]]][source]

Added in version 0.4.0.

Scan a string, text-file-like object, or iterable of lines for zero or more stanzas of RFC 822-style header fields and return a generator of lists of (name, value) pairs, where each list represents a stanza of header fields in the input.

If data is a string, it will be broken into lines on CR, LF, and CR LF boundaries.

The stanzas are terminated by blank lines. Consecutive blank lines between stanzas are treated as a single blank line. Blank lines at the end of the input are discarded without creating a new stanza.

Changed in version 0.5.0: data can now be a string.

Parameters:

data – a string, text-file-like object, or iterable of strings representing lines of input
kwargs – Passed to the Scanner constructor

Return type:

generator of lists of pairs of strings

Raises:

ScannerError – if the header section is malformed

Deprecated Functions

headerparser.scan_string(s: str, *, separator_regex: str | Pattern[str] | None = None, skip_leading_newlines: bool = False) → Iterator[tuple[str | None, str]][source]

Scan a string for RFC 822-style header fields and return a generator of (name, value) pairs for each header field in the input, plus a (None, body) pair representing the body (if any) after the header section.

See scan() for more information on the exact behavior of the scanner.

Deprecated since version 0.5.0: Use scan() instead.

Parameters:

s – a string which will be broken into lines on CR, LF, and CR LF boundaries and passed to scan()
kwargs – Passed to the Scanner constructor

Return type:

generator of pairs of strings

Raises:

ScannerError – if the header section is malformed

headerparser.scan_stanzas_string(s: str, *, separator_regex: str | Pattern[str] | None = None, skip_leading_newlines: bool = False) → Iterator[list[tuple[str, str]]][source]

Added in version 0.4.0.

Scan a string for zero or more stanzas of RFC 822-style header fields and return a generator of lists of (name, value) pairs, where each list represents a stanza of header fields in the input.

The stanzas are terminated by blank lines. Consecutive blank lines between stanzas are treated as a single blank line. Blank lines at the end of the input are discarded without creating a new stanza.

Deprecated since version 0.5.0: Use scan_stanzas() instead

Parameters:

s – a string which will be broken into lines on CR, LF, and CR LF boundaries and passed to scan_stanzas()
kwargs – Passed to the Scanner constructor

Return type:

generator of lists of pairs of strings

Raises:

ScannerError – if the header section is malformed

headerparser.scan_next_stanza(iterator: Iterator[str], *, separator_regex: str | Pattern[str] | None = None, skip_leading_newlines: bool = False) → Iterator[tuple[str, str]][source]

Added in version 0.4.0.

Scan a text-file-like object or iterator of lines for RFC 822-style header fields and return a generator of (name, value) pairs for each header field in the input. Input processing stops as soon as a blank line is encountered, leaving the rest of the iterator unconsumed (If skip_leading_newlines is true, the function only stops on a blank line after a non-blank line).

Deprecated since version 0.5.0: Use Scanner.scan_next_stanza() instead

Parameters:

iterator – a text-file-like object or iterator of strings representing lines of input
kwargs – Passed to the Scanner constructor

Return type:

generator of pairs of strings

Raises:

ScannerError – if the header section is malformed

headerparser.scan_next_stanza_string(s: str, *, separator_regex: str | Pattern[str] | None = None, skip_leading_newlines: bool = False) → tuple[list[tuple[str, str]], str][source]

Added in version 0.4.0.

Scan a string for RFC 822-style header fields and return a pair (fields, extra) where fields is a list of (name, value) pairs for each header field in the input up to the first blank line and extra is everything after the first blank line (If skip_leading_newlines is true, the dividing point is instead the first blank line after a non-blank line); if there is no appropriate blank line in the input, extra is the empty string.

Deprecated since version 0.5.0: Use Scanner.scan_next_stanza() instead

Parameters:

s – a string to scan
kwargs – Passed to the Scanner constructor

Return type:

pair of a list of pairs of strings and a string

Raises:

ScannerError – if the header section is malformed