Scanner
The Scanner
class and related functions perform basic parsing of RFC
822-style header fields, splitting formatted input up into sequences of
(name, value)
pairs without any further validation or transformation.
Each pair returned by a scanner method or function represents an individual
header field. The first element (the header field name) is the substring up to
but not including the first whitespace-padded colon (or other delimiter
specified by separator_regex
) in the first source line of the header field.
The second element (the header field value) is a single string, the
concatenation of one or more lines, starting with the substring after the first
colon in the first source line, with leading whitespace on lines after the
first preserved; the ending of each line is converted to "\n"
(added if
there is no line ending in the actual input), and the last line of the field
value has its trailing line ending (if any) removed.
Note
“Line ending” here means a CR, LF, or CR LF sequence. Unicode line
separators are not treated as line endings and are not trimmed or converted
to "\n"
.
Scanner Class
- class headerparser.Scanner(data: str | Iterable[str], *, separator_regex: str | Pattern[str] | None = re.compile('[ \\t]*:[ \\t]*'), skip_leading_newlines: bool | None = False)[source]
Added in version 0.5.0.
A class for scanning text for RFC 822-style header fields. Each method processes some portion of the input yet unscanned; the
scan()
,scan_stanzas()
, andget_unscanned()
methods process the entirety of the remaining input, while thescan_next_stanza()
method only processes up through the first blank line.- Parameters:
data – The text to scan. This may be a string, a text-file-like object, or an iterable of lines. If it is a string, it will be broken into lines on CR, LF, and CR LF boundaries.
separator_regex – A regex (as a
str
or compiled regex object) defining the name-value separator; defaults to[ \t]*:[ \t]*
. When the regex is found in a line, everything before the matched substring becomes the field name, and everything after becomes the first line of the field value. Note that the regex must match any surrounding whitespace in order for it to be trimmed from the key & value.skip_leading_newlines (bool) – If
True
, blank lines at the beginning of the input will be discarded. IfFalse
, a blank line at the beginning of the input marks the end of an empty header section.
- get_unscanned() str [source]
Return all of the input that has not yet been processed. After calling this method, calling any method again on the same
Scanner
instance will raiseScannerEOFError
.- Raises:
ScannerEOFError – if all of the input has already been consumed
- scan() Iterator[Tuple[str | None, str]] [source]
Scan the remaining input for RFC 822-style header fields and return a generator of
(name, value)
pairs for each header field encountered, plus a(None, body)
pair representing the body (if any) after the header section.All lines after the first blank line are concatenated & yielded as-is in a
(None, body)
pair. (Note that body lines which do not end with a line terminator will not have one appended.) If there is no empty line in the input, then no body pair is yielded. If the empty line is the last line in the input, the body will be the empty string. If the empty line is the first line in the input and theskip_leading_newlines
option is false (the default), then all other lines will be treated as part of the body and will not be scanned for header fields.- Raises:
ScannerError – if the header section is malformed
ScannerEOFError – if all of the input has already been consumed
- scan_next_stanza() Iterator[tuple[str, str]] [source]
Scan the remaining input for RFC 822-style header fields and return a generator of
(name, value)
pairs for each header field in the input. Input processing stops as soon as a blank line is encountered. (Ifskip_leading_newlines
is true, the function only stops on a blank line after a non-blank line.)- Raises:
ScannerError – if the header section is malformed
ScannerEOFError – if all of the input has already been consumed
- scan_stanzas() Iterator[list[tuple[str, str]]] [source]
Scan the remaining input for zero or more stanzas of RFC 822-style header fields and return a generator of lists of
(name, value)
pairs, where each list represents a stanza of header fields in the input.The stanzas are terminated by blank lines. Consecutive blank lines between stanzas are treated as a single blank line. Blank lines at the end of the input are discarded without creating a new stanza.
- Raises:
ScannerError – if the header section is malformed
ScannerEOFError – if all of the input has already been consumed
Functions
- headerparser.scan(data: str | Iterable[str], *, separator_regex: str | Pattern[str] | None = None, skip_leading_newlines: bool = False) Iterator[Tuple[str | None, str]] [source]
Added in version 0.4.0.
Scan a string, text-file-like object, or iterable of lines for RFC 822-style header fields and return a generator of
(name, value)
pairs for each header field in the input, plus a(None, body)
pair representing the body (if any) after the header section.If
data
is a string, it will be broken into lines on CR, LF, and CR LF boundaries.All lines after the first blank line are concatenated & yielded as-is in a
(None, body)
pair. (Note that body lines which do not end with a line terminator will not have one appended.) If there is no empty line indata
, then no body pair is yielded. If the empty line is the last line indata
, the body will be the empty string. If the empty line is the first line indata
and theskip_leading_newlines
option is false (the default), then all other lines will be treated as part of the body and will not be scanned for header fields.Changed in version 0.5.0:
data
can now be a string.- Parameters:
data – a string, text-file-like object, or iterable of strings representing lines of input
kwargs – Passed to the
Scanner
constructor
- Return type:
generator of pairs of strings
- Raises:
ScannerError – if the header section is malformed
- headerparser.scan_stanzas(data: str | Iterable[str], *, separator_regex: str | Pattern[str] | None = None, skip_leading_newlines: bool = False) Iterator[list[tuple[str, str]]] [source]
Added in version 0.4.0.
Scan a string, text-file-like object, or iterable of lines for zero or more stanzas of RFC 822-style header fields and return a generator of lists of
(name, value)
pairs, where each list represents a stanza of header fields in the input.If
data
is a string, it will be broken into lines on CR, LF, and CR LF boundaries.The stanzas are terminated by blank lines. Consecutive blank lines between stanzas are treated as a single blank line. Blank lines at the end of the input are discarded without creating a new stanza.
Changed in version 0.5.0:
data
can now be a string.- Parameters:
data – a string, text-file-like object, or iterable of strings representing lines of input
kwargs – Passed to the
Scanner
constructor
- Return type:
generator of lists of pairs of strings
- Raises:
ScannerError – if the header section is malformed
Deprecated Functions
- headerparser.scan_string(s: str, *, separator_regex: str | Pattern[str] | None = None, skip_leading_newlines: bool = False) Iterator[Tuple[str | None, str]] [source]
Scan a string for RFC 822-style header fields and return a generator of
(name, value)
pairs for each header field in the input, plus a(None, body)
pair representing the body (if any) after the header section.See
scan()
for more information on the exact behavior of the scanner.Deprecated since version 0.5.0: Use
scan()
instead.- Parameters:
- Return type:
generator of pairs of strings
- Raises:
ScannerError – if the header section is malformed
- headerparser.scan_stanzas_string(s: str, *, separator_regex: str | Pattern[str] | None = None, skip_leading_newlines: bool = False) Iterator[list[tuple[str, str]]] [source]
Added in version 0.4.0.
Scan a string for zero or more stanzas of RFC 822-style header fields and return a generator of lists of
(name, value)
pairs, where each list represents a stanza of header fields in the input.The stanzas are terminated by blank lines. Consecutive blank lines between stanzas are treated as a single blank line. Blank lines at the end of the input are discarded without creating a new stanza.
Deprecated since version 0.5.0: Use
scan_stanzas()
instead- Parameters:
s – a string which will be broken into lines on CR, LF, and CR LF boundaries and passed to
scan_stanzas()
kwargs – Passed to the
Scanner
constructor
- Return type:
generator of lists of pairs of strings
- Raises:
ScannerError – if the header section is malformed
- headerparser.scan_next_stanza(iterator: Iterator[str], *, separator_regex: str | Pattern[str] | None = None, skip_leading_newlines: bool = False) Iterator[tuple[str, str]] [source]
Added in version 0.4.0.
Scan a text-file-like object or iterator of lines for RFC 822-style header fields and return a generator of
(name, value)
pairs for each header field in the input. Input processing stops as soon as a blank line is encountered, leaving the rest of the iterator unconsumed (Ifskip_leading_newlines
is true, the function only stops on a blank line after a non-blank line).Deprecated since version 0.5.0: Use
Scanner.scan_next_stanza()
instead- Parameters:
iterator – a text-file-like object or iterator of strings representing lines of input
kwargs – Passed to the
Scanner
constructor
- Return type:
generator of pairs of strings
- Raises:
ScannerError – if the header section is malformed
- headerparser.scan_next_stanza_string(s: str, *, separator_regex: str | Pattern[str] | None = None, skip_leading_newlines: bool = False) tuple[list[tuple[str, str]], str] [source]
Added in version 0.4.0.
Scan a string for RFC 822-style header fields and return a pair
(fields, extra)
wherefields
is a list of(name, value)
pairs for each header field in the input up to the first blank line andextra
is everything after the first blank line (Ifskip_leading_newlines
is true, the dividing point is instead the first blank line after a non-blank line); if there is no appropriate blank line in the input,extra
is the empty string.Deprecated since version 0.5.0: Use
Scanner.scan_next_stanza()
instead- Parameters:
s – a string to scan
kwargs – Passed to the
Scanner
constructor
- Return type:
pair of a list of pairs of strings and a string
- Raises:
ScannerError – if the header section is malformed