Transformers documentation
Response Parsing
Response Parsing
It is increasingly common for chat models to generate structured outputs, rather than just a single reply string. For example, a reasoning model might emit a chain of thought containing its reasoning trace, while a tool calling model might emit function names and arguments.
The problem with structured outputs, though, is that LLMs outputs are not inherently structured. LLM APIs usually
accept and return message dicts, with keys like role and content and thinking, but internally, LLMs actually
just continue a single sequence of tokens. We use a glue layer to connect the user-facing API to the actual token
stream of the model. To turn inputs into a token stream, we use chat_templates, which are covered in other
documents. This document is about the other half of that glue layer: Response templates, the system for turning the
generated tokens output by the model back into a structured response dict.
In many ways, response templates perform the inverse operation to chat templates. With chat templates, you feed in a list of messages, and you get tokens ready to input to the model. With response templates, you feed in the raw model output tokens, and you get a structured message. Like chat templates, response templates allow users to ignore the messy details of what specific formats and control tokens a model expects, and use a universal API of message dicts that works with any model.
The best way to understand response templates is to see them in action. The main entry point is the parse_response() method, which accepts either a single sequence or a batch:
from transformers import AutoModelForCausalLM, AutoTokenizer
checkpoint = "HuggingFaceTB/SmolLM3-3B"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint, dtype="auto", device_map="auto")
messages = [{"role": "user", "content": "Summarize the end of the Cold War, very briefly."}]
input_ids = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")["input_ids"].to(model.device)
outputs = model.generate(input_ids, max_new_tokens=1024)[0, input_ids.shape[1]:]
out_text = tokenizer.decode(outputs)
print(tokenizer.parse_response(out_text, prefix=input_ids[0]))
# Outputs a structured dict: {"role": "assistant", "thinking": "...", "content": "..."}When a tokenizer has a response_template, the parse_response method will cleanly turn an output message into a
structured dict, ready to append to the chat. Note that we need to pass the prefix (the prompt tokens) to this method as well. This is because many chat templates start
messages or open thinking blocks before letting the model begin its response, and so our parser needs to see the
prompt to understand the message. All of the prefix before the final turn is discarded; we only parse one message
at a time. We just need the prefix to ensure we’re seeing the entire final message, and not miss any prefilled
fields!
If the tokenizer has no response template set, parse_response will raise an error. We’re working on adding
templates to more models as quickly as we can!
Streaming response parsing
In the above example, we parse the model response all at once after generation has finished. Often, though, we may want to parse partial messages as they are generated, especially in user-facing apps where we don’t just want to display a static page for a minute or two until the model is finished.
When you want streaming parsing, call tokenizer.get_response_parser(), which returns a ResponseParser.
As with parse_response, pass the chat prompt as prefix= so the parser knows about any parts of the message that
were prefilled by the chat template. The returned object is a
stateful parser that you can feed text into as the model generates it:
parser = tokenizer.get_response_parser(prefix=input_ids[0])
for event in parser.initial_events:
render(event) # Display the partial message to the user however you want to
for chunk in model_output:
for event in parser.feed(chunk):
render(event)
message, final_events = parser.finalize()
for event in final_events:
render(event)The parser will emit events as text from the generation process is fed in. This indicates which region is currently being generated. When
the region is complete, it will be emitted in a separate event with the fully parsed content. At the end of generation,
the finalize() method flushes any remaining text and emits any final events, as well as the complete message dict.
Note that although parse_response can accept batches, streamed parsing is always single-sequence: each ResponseParser tracks the state of one generation.
If you want to stream multiple generations at once, create one ResponseParser per sequence.
Streaming events
Each streamed parsing event is a dict with a type key. There are three kinds:
| Type | Description | Contents |
|---|---|---|
region_open | Indicates that the model has started a new region, such as content or thinking. | field (str): the field name. |
region_chunk | A chunk of text for the current region. | field (str): the field name. text (str): the new chunk. dirty (bool): True if the chunk is raw text that needs parsing. |
region_close | Indicates that a region has finished, and that key is now finalized. | field (str): the field name. value (any): the fully parsed value for the region |
region_chunk events are emitted for every region as bytes arrive, so a streaming UI can render progress
even for structured regions. For text-like regions (text, int, float, bool) chunks are flagged
dirty=False: each chunk is already part of the final value (modulo trailing whitespace stripped at
close). For structured regions, like JSON-format tool calls, chunks are flagged dirty=True. This means
text is the raw, un-parsed body; it’s safe to display incrementally, but the parsed value (a dict,
list, etc.) only arrives in the matching region_close event. Either way, the finalized value of a
region is always carried by region_close, so consumers that don’t care about intermediate rendering
can simply ignore region_chunk events.
If the chat prefix wrote anything into the message (e.g. the template opened a thinking block, or an
assistant prefill started a response before handing off to the model), the parser exposes those events as
parser.initial_events, a list you can replay into your renderer before feeding any model output. Regions
that were opened and closed inside the prefix produce a full region_open / region_chunk / region_close
sequence and their parsed value lands in the output dict, exactly as if the model itself had written them.
A typical event stream might look like this:
{"type": "region_open", "field": "thinking"}
{"type": "region_chunk", "field": "thinking", "text": "I should ", "dirty": False}
{"type": "region_chunk", "field": "thinking", "text": "greet the user", "dirty": False}
{"type": "region_close", "field": "thinking", "value": "I should greet the user"}
{"type": "region_open", "field": "tool_calls"}
{"type": "region_chunk", "field": "tool_calls", "text": '{"name": "greet_user", ', "dirty": True}
{"type": "region_chunk", "field": "tool_calls", "text": '"arguments": {"greeting": "Hi!"}}', "dirty": True}
{"type": "region_close", "field": "tool_calls", "value": {"type": "function", "function": {"name": "greet_user", "arguments": {"greeting": "Hi!"}}}}Note how thinking is emitted with dirty=False, because fields like thinking and content are usually just raw
text. This means you can treat the chunks as valid “partial output”. However, tool_calls is flagged as dirty because
the raw text needs significant cleanup - tool calls often need to be parsed as JSON or another format and then
restructured to generate the final tool call dict. As a result, the final output for these regions often looks very,
very different from the raw text. This final parsing will only happen when region_close is reached. It’s
up to you what you want to do with the dirty chunks until then - you can display them as-is to show the user the
“raw” output, or you can simply wait until you have something clean to display.
This concludes most of what you need to know to use response templates. The rest of this document is focused on the internals of the parsing system and how to write response templates. This is mostly relevant for developers and model authors. Most people can safely stop here!
Advanced: Writing a response template
The best way to understand how to write a response template is to pick a concrete example. Here’s what a raw
reply from SmolLM might look like:
<think>
I should greet the user
</think>
<tool_call>{"name": "greet_user", "arguments": {"greeting": "Hi!"}}</tool_call>When we parse this output in the standard message dict format, it should look like this:
{
"role": "assistant",
"thinking": "I should greet the user",
"tool_calls": [
{"type": "function", "function": {"name": "greet_user", "arguments": {"greeting": "Hi!"}}}
]
}And here’s the template that parses it. Don’t be intimidated - a lot of it is fairly self-explanatory!
{
"defaults": {"role": "assistant"},
"start_anchor": "<|im_start|>assistant\n",
"fields": {
"thinking": {"open": "<think>", "close": "</think>", "content": "text"},
"tool_calls": {
"open": "<tool_call>",
"close": "</tool_call>",
"repeats": True,
"content": "json",
"transform": {"type": "function", "function": "{content}"},
},
"content": {
"close": "<|im_end|>",
"content": "text",
},
},
}Essentially, the template defines fields and delimiters. Each field corresponds to a key in the
output dict. Fields also include information for parsing the text inside their delimiters. There’s one subtlety: The
content field has no open, because in SmolLM (and several other models), it’s not marked by a special token. Instead,
content is stored in the space after the other regions, but before the end of the sequence. In our template, we
represent this as an implicit / leftover field that picks up any text not claimed by another region.
In addition to fields, the template supports two top-level keys:
defaults(optional) — A dict of values pre-populated in the output (e.g.{"role": "assistant"}). Keys here are always retained in the parsed output, even if no field wrote to them; other keys are dropped when their field captured nothing.start_anchor(str) /start_anchor_pattern(str regex) — Marks where the current assistant message begins inside a chat prompt. When you passprefix=toparse_responseorget_response_parser, the parser right-truncates the prefix past the last occurrence of this anchor before processing it, so earlier turns in a multi-turn conversation don’t pollute the current message’s state. The anchor is applied only to theprefix, never to the response/generation you parse — some formats legitimately re-emit it mid-message (gpt-oss harmony output opens every channel with<|start|>assistant), so stripping the response past the anchor would drop the model’s own reasoning and tool calls. This is why the generation alone is never enough to guard against history bleed: pass the prompt asprefix=. For ChatML-style models the anchor is typically"<|im_start|>assistant\n". Exactly one ofstart_anchororstart_anchor_patternmust be set.
For example, given this multi-turn prefix (note the two assistant turns):
<|im_start|>user Hi<|im_end|> <|im_start|>assistant Hello!<|im_end|> <|im_start|>user Again?<|im_end|> <|im_start|>assistant
the parser truncates everything up to the last <|im_start|>assistant\n, discarding the earlier
"Hello!" turn. The rule is that everything but the final assistant turn is always dropped.
As with chat templates, response templates are stored as tokenizer attributes and saved with the tokenizer. Unlike
chat templates, we save them inside tokenizer_config.json and not as a separate file, because their format fits
naturally in JSON, unlike a chat template Jinja script.
tokenizer.response_template = template
tokenizer.save_pretrained(...) # Written as a key in tokenizer_config.jsonAdvanced: Field API Reference
Each field supports several keys. We can divide these into two types. First, there are the keys that define how the field should be captured:
| Key | Type | Purpose |
|---|---|---|
open | str or list[str] | Literal string that opens this region. A list of strings means “match any of these”. |
open_pattern | str (regex) | Regex alternative to open. Named groups become capture variables available to transform. |
close | str or list[str] | Literal string (or list of strings) that closes this region. Omit to run to end-of-stream. |
close_pattern | str (regex) | Regex alternative to close. Named groups become capture variables available to transform. |
repeats | bool | If true, the field is a list and each match appends. Default false. |
optional | bool | If false and the region never matches, we raise an error. Default true. |
A field should have either open or open_pattern, but not both, and the same is true for close and close_pattern.
A field may omit close/close_pattern entirely, in which case the region stays open until the end of the
generated text. This is useful for a final field that runs to the end of the message.
A field with neither open nor open_pattern is the implicit field: it’s active whenever no explicit
region is open, so it captures leftover text. At most one field can be implicit. This is most often used when content
does not have special token tags, it’s just written as plaintext after the other fields.
In addition to opening and closing delimiters, you can also specify repeats, which indicates that the field is a list
and the delimiters can match multiple times. This is most common for parallel tool calling, when a model emits
multiple tool calls simultaneously:
'<tool_call>{"name": "a", ...}</tool_call><tool_call>{"name": "b", ...}</tool_call>'
# Returns `"tool_calls": [{... "a" ...}, {... "b" ...}]` in a template with repeats: trueFinally, you can specify optional: false for fields that must be present. If such a field is missing,
we raise an error instead of just returning a message dict without it.
The end of generation will close and finalize any open regions, even if their closing delimiter was not seen.
Parsing the content of a field
Once we define how to capture a field, we also need to specify how to parse the raw text inside that capture. There are four keys that control this:
| Key | Type | Purpose |
|---|---|---|
content | str | The content type inside this region. Defaults to "text". Each type has its own parser. |
content_args | dict | Arguments to be passed to the content parser for this region. |
transform | dict/list | Optional post-parse template that reshapes the parsed body (see Transform). |
transform_each | bool | If true, the parsed content must be a list and transform is applied per-element. |
The first (and most important) key is content. This indicates the content type
of the field, which determines the parser that will be used to convert the raw text captured in the field to the final output.
content_args are used to configure the parser, and allow us to support various format quirks without needing custom code.
We’ll take a look at each type of parser and its arguments in turn.
Basic types
text, int, float and bool are the basic types. These content types all just strip whitespace and then do a simple
type conversion if required. They do not have any content_args, except for text which supports the arg strip,
which strips whitespace from the start and end of the captured text, and defaults to true.
field = {"count": {"open": "<n>", "close": "</n>", "content": "int"}}
input = "<n> 42 </n>"
# Returns: {"count": 42}json
The json parser parses the captured text as JSON. It’s the workhorse for tool-call arguments and
anything else with nested structure. It accepts a handful of optional content_args to handle the
various ways models mangle JSON in the wild:
unquoted_keys(bool, defaultfalse): Enable when key names are raw rather than quoted (e.g.{name: "foo"}). Useful for models that emit Javascript-style object literals rather than strict JSON.string_delims(list of[open, close]pairs, optional): for models that wrap string values in custom delimiters instead of"...". Each pair gives an opening and closing marker.allow_non_json(bool, defaultfalse): if parsing fails, return the stripped raw text instead of raising. Useful as a fallback for fields where the model usually emits JSON but occasionally drops to plain text.
unquoted_keys and string_delims both exist to handle models that emit non-standard,
almost-but-not-quite-JSON output, so you should only need them for a handful of models.
field = {"args": {"open": "<args>", "close": "</args>", "content": "json", "content_args": {"unquoted_keys": True}}}
input = '<args>{city: "London"}</args>'
# Returns: {"args": {"city": "London"}}xml-inline
The xml-inline parser is for regions made up of a flat sequence of XML-ish tags, where each tag
becomes one entry in a dict. It’s most often used inside a tool_calls field for models that emit
each argument as its own tag rather than as a JSON blob:
tag_pattern(str, required): regex matching a single tag. Must contain named groupskey(the resulting dict key) andvalue(the raw text that becomes the dict value).value_parser(dict, optional): nested content parser applied to each capturedvalue. A dict withname(the parser, e.g."json","int") and optionalargs(itscontent_args). If omitted, values stay as raw strings.merge_duplicates(bool, defaultfalse): when the same key appears multiple times, collect the values into a list instead of letting later matches overwrite earlier ones.
For example, Qwen3 emits each tool-call argument as its own <parameter> tag, and we parse it
like this:
"tool_calls": {
"open_pattern": r"<tool_call>\s*<function=(?P<name>\w+)>",
"close": "</tool_call>",
"repeats": True,
"content": "xml-inline",
"content_args": {
"tag_pattern": r"<parameter=(?P<key>\w+)>\s*(?P<value>.*?)\s*</parameter>",
"value_parser": {"name": "json", "args": {"allow_non_json": True}},
},
"transform": {"type": "function", "function": {"name": "{name}", "arguments": "{content}"}},
}Note the nested value_parser: each parameter value is itself run through the json parser (with
allow_non_json so plain strings still pass through). Feeding the tool_calls field above this input:
input = "<tool_call><function=get_weather><parameter=city>London</parameter><parameter=units>celsius</parameter></function></tool_call>"
# Returns: {"tool_calls": [{"type": "function", "function": {"name": "get_weather", "arguments": {"city": "London", "units": "celsius"}}}]}kv-lines
The kv-lines parser handles line-delimited key: value pairs (think YAML-ish metadata or .env
files). Each line becomes one entry in the resulting dict. All arguments are optional:
line_sep(str, default"\n"): separator between pairs.kv_sep(str, default":"): separator between a key and its value inside a single line. Only the first occurrence is used as the split point, so values may themselves contain the separator.strip(bool, defaulttrue): strip surrounding whitespace from each key and value.value_parser(dict, optional): nested content parser applied to each value, in the same{"name": ..., "args": ...}format as forxml-inline. If omitted, values stay as raw strings.
Lines that are empty or do not contain kv_sep are silently skipped, so stray blank lines in the
captured region are tolerated.
field = {"metadata": {"open": "<meta>", "close": "</meta>", "content": "kv-lines"}}
input = "<meta>name: alice\nage: 30</meta>"
# Returns: {"metadata": {"name": "alice", "age": "30"}}Note age keeps "30" as a string; add a value_parser of {"name": "int"} to parse it to 30.
Transform
For most fields, the transform key is unnecessary. It’s used when the parsed body needs to be reshaped into the final
output, or when information from the delimiters has to be merged into the result. It most commonly appears in
tool_calls fields, as these often have complex structure.
transform is a template: a dict (or list) that describes the output shape, where any string of the
form "{name}" is replaced with the corresponding value. Values can be accessed from content (the parsed
body of this region) and any named groups captured by open_pattern / close_pattern. A very common use-case is to wrap a tool
call dict in an outer dict with a function key, as these are part of our standard tool call format:
"tool_calls": {
"open": "<tool_call>",
"close": "</tool_call>",
"repeats": True,
"content": "json",
"transform": {"type": "function", "function": "{content}"},
},So this raw output:
<tool_call>{"name": "greet_user", "arguments": {"greeting": "Hi!"}}</tool_call>becomes (note repeats: True makes tool_calls a list):
[{"type": "function", "function": {"name": "greet_user", "arguments": {"greeting": "Hi!"}}}]A whole-string placeholder like "{content}" returns the looked-up value with its type preserved — so above, the
parsed JSON dict slots in directly as the value of function. A placeholder must be the entire string: mixing
text and placeholders ("abc {name} def") is not permitted. They’re not f-strings!
transform is quite versatile, which becomes necessary when the model output has a wildly different format
to our standard API. GPT-OSS is a good example - it embeds the function name in the channel header rather than in
the JSON body, so we have to capture it with a named group in open_pattern and merge it with content inside the
transform. All named groups in open_pattern and close_pattern become available as variables alongside content:
"tool_calls": {
"open_pattern": r"<\|channel\|>commentary to=functions\.(?P<name>\w+).*?<\|message\|>",
"close": "<|call|>",
"repeats": True,
"content": "json",
"transform": {"type": "function", "function": {"name": "{name}", "arguments": "{content}"}},
},The function name lives in the channel header, not the JSON body, so this:
<|channel|>commentary to=functions.get_current_weather <|constrain|>json<|message|>{"location": "San Francisco, CA"}<|call|>becomes:
[{"type": "function", "function": {"name": "get_current_weather", "arguments": {"location": "San Francisco, CA"}}}]Sometimes a field’s parsed content is itself a list of records and you want to reshape each one. The Cohere
template is a good example: It emits all tool calls inside a single JSON array, so we set transform_each: True to
apply the transform per element. Each array element’s keys are unpacked into the template scope, so
"{tool_name}" looks up tool_name in the current element:
"tool_calls": {
"open": "<|START_ACTION|>",
"close": "<|END_ACTION|>",
"content": "json",
"transform_each": True,
"transform": {"type": "function", "function": {"name": "{tool_name}", "arguments": "{parameters}"}},
},This will convert an output like this:
[
{"tool_name": "greet_user", "parameters": {"greeting": "Hi!"}},
{"tool_name": "search", "parameters": {"query": "weather tomorrow"}}
]Into an output like this, which fits our standard API:
[
{"type": "function", "function": {"name": "greet_user", "arguments": {"greeting": "Hi!"}}},
{"type": "function", "function": {"name": "search", "arguments": {"query": "weather tomorrow"}}}
]The transform_each flag is only needed when content is already a list; for the more common case where each
match contributes one element (and repeats: True accumulates them), then the transform will apply to each element
by default.
Framework developers: Regex portability
open_pattern, close_pattern, and start_anchor_pattern are regex strings. For most users, and even for most
model authors, this shouldn’t be a problem, but if you are a developer writing an implementation of response parsing
in another language, you should be aware of our implementation details. This section is dedicated to everyone
who had to implement an entire Jinja parser to get non-Python chat templating to work - we hope that if you
follow the simple guidelines below, then response templates should be much less painful:
- We use Python’s
regexmodule for all regexes used in chat parsing. Since all Python3 strings are unicode, all of our regex matches are unicode-aware. This particularly affects common characters like\w. Make sure you set the relevant unicode flags in your engine. - We compile all regexes with
re.DOTALLenabled andre.MULTILINEdisabled, so.matches\nbut^and$only match the start/end of the whole input, not line breaks. - We use
(?P<name>...)syntax for named groups. Other regex implementations have very different named capture group syntax, so you may need to search for this pattern in regexes and rewrite it to match your local implementation. - We use partial regex matching to decide when we can emit data. This is necessary because the end of a region may consist of multiple tokens,
and we don’t want to emit that end-of-region delimiter, so we hold tokens back until we’re sure that they’re inside the region and not part of the boundary.
This means that whenever a region end regex has a partial match, we hold back data until the regex either matches or fails. If your regex engine
doesn’t support partial matching, you can still implement response templates, but you may need to find another solution to this issue. One simple
approach is just to hold back these regions / emit them as
dirtyuntil you definitively see the ending delimiter. - Your regex engine may (rarely) not support lookarounds like
(?!...). Although these aren’t commonly used in response templates, they can appear and we do support them! You might need to either throw an error in those cases, or manually extract the lookarounds and enforce them in your code when the regex engine finds a possible match. - Other advanced features like backreferences, atomic groups, possessive quantifiers, recursion and so on are generally not used in response templates. We’ll try to dissuade model authors from using them, so you can hopefully safely ignore them.