Transformers documentation

Response Parsing

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v5.12.0).
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Response Parsing

It is increasingly common for chat models to generate structured outputs, rather than just a single reply string. For example, a reasoning model might emit a chain of thought containing its reasoning trace, while a tool calling model might emit function names and arguments.

The problem with structured outputs, though, is that LLMs outputs are not inherently structured. LLM APIs usually accept and return message dicts, with keys like role and content and thinking, but internally, LLMs actually just continue a single sequence of tokens. We use a glue layer to connect the user-facing API to the actual token stream of the model. To turn inputs into a token stream, we use chat_templates, which are covered in other documents. This document is about the other half of that glue layer: Response templates, the system for turning the generated tokens output by the model back into a structured response dict.

In many ways, response templates perform the inverse operation to chat templates. With chat templates, you feed in a list of messages, and you get tokens ready to input to the model. With response templates, you feed in the raw model output tokens, and you get a structured message. Like chat templates, response templates allow users to ignore the messy details of what specific formats and control tokens a model expects, and use a universal API of message dicts that works with any model.

The best way to understand response templates is to see them in action. The main entry point is the parse_response() method, which accepts either a single sequence or a batch:

from transformers import AutoModelForCausalLM, AutoTokenizer

checkpoint = "HuggingFaceTB/SmolLM3-3B"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint, dtype="auto", device_map="auto")

messages = [{"role": "user", "content": "Summarize the end of the Cold War, very briefly."}]
input_ids = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")["input_ids"].to(model.device)
outputs = model.generate(input_ids, max_new_tokens=1024)[0, input_ids.shape[1]:]
out_text = tokenizer.decode(outputs)
print(tokenizer.parse_response(out_text, prefix=input_ids[0]))
# Outputs a structured dict: {"role": "assistant", "thinking": "...", "content": "..."}

When a tokenizer has a response_template, the parse_response method will cleanly turn an output message into a structured dict, ready to append to the chat. Note that we need to pass the prefix (the prompt tokens) to this method as well. This is because many chat templates start messages or open thinking blocks before letting the model begin its response, and so our parser needs to see the prompt to understand the message. All of the prefix before the final turn is discarded; we only parse one message at a time. We just need the prefix to ensure we’re seeing the entire final message, and not miss any prefilled fields!

If the tokenizer has no response template set, parse_response will raise an error. We’re working on adding templates to more models as quickly as we can!

Streaming response parsing

In the above example, we parse the model response all at once after generation has finished. Often, though, we may want to parse partial messages as they are generated, especially in user-facing apps where we don’t just want to display a static page for a minute or two until the model is finished.

When you want streaming parsing, call tokenizer.get_response_parser(), which returns a ResponseParser. As with parse_response, pass the chat prompt as prefix= so the parser knows about any parts of the message that were prefilled by the chat template. The returned object is a stateful parser that you can feed text into as the model generates it:

parser = tokenizer.get_response_parser(prefix=input_ids[0])
for event in parser.initial_events:
    render(event)  # Display the partial message to the user however you want to
for chunk in model_output:
    for event in parser.feed(chunk):
        render(event)
message, final_events = parser.finalize()
for event in final_events:
    render(event)

The parser will emit events as text from the generation process is fed in. This indicates which region is currently being generated. When the region is complete, it will be emitted in a separate event with the fully parsed content. At the end of generation, the finalize() method flushes any remaining text and emits any final events, as well as the complete message dict.

Note that although parse_response can accept batches, streamed parsing is always single-sequence: each ResponseParser tracks the state of one generation. If you want to stream multiple generations at once, create one ResponseParser per sequence.

Streaming events

Each streamed parsing event is a dict with a type key. There are three kinds:

Type Description Contents
region_open Indicates that the model has started a new region, such as content or thinking. field (str): the field name.
region_chunk A chunk of text for the current region. field (str): the field name. text (str): the new chunk. dirty (bool): True if the chunk is raw text that needs parsing.
region_close Indicates that a region has finished, and that key is now finalized. field (str): the field name. value (any): the fully parsed value for the region

region_chunk events are emitted for every region as bytes arrive, so a streaming UI can render progress even for structured regions. For text-like regions (text, int, float, bool) chunks are flagged dirty=False: each chunk is already part of the final value (modulo trailing whitespace stripped at close). For structured regions, like JSON-format tool calls, chunks are flagged dirty=True. This means text is the raw, un-parsed body; it’s safe to display incrementally, but the parsed value (a dict, list, etc.) only arrives in the matching region_close event. Either way, the finalized value of a region is always carried by region_close, so consumers that don’t care about intermediate rendering can simply ignore region_chunk events.

If the chat prefix wrote anything into the message (e.g. the template opened a thinking block, or an assistant prefill started a response before handing off to the model), the parser exposes those events as parser.initial_events, a list you can replay into your renderer before feeding any model output. Regions that were opened and closed inside the prefix produce a full region_open / region_chunk / region_close sequence and their parsed value lands in the output dict, exactly as if the model itself had written them.

A typical event stream might look like this:

{"type": "region_open",  "field": "thinking"}
{"type": "region_chunk", "field": "thinking", "text": "I should ", "dirty": False}
{"type": "region_chunk", "field": "thinking", "text": "greet the user", "dirty": False}
{"type": "region_close", "field": "thinking", "value": "I should greet the user"}
{"type": "region_open",  "field": "tool_calls"}
{"type": "region_chunk", "field": "tool_calls", "text": '{"name": "greet_user", ', "dirty": True}
{"type": "region_chunk", "field": "tool_calls", "text": '"arguments": {"greeting": "Hi!"}}', "dirty": True}
{"type": "region_close", "field": "tool_calls", "value": {"type": "function", "function": {"name": "greet_user", "arguments": {"greeting": "Hi!"}}}}

Note how thinking is emitted with dirty=False, because fields like thinking and content are usually just raw text. This means you can treat the chunks as valid “partial output”. However, tool_calls is flagged as dirty because the raw text needs significant cleanup - tool calls often need to be parsed as JSON or another format and then restructured to generate the final tool call dict. As a result, the final output for these regions often looks very, very different from the raw text. This final parsing will only happen when region_close is reached. It’s up to you what you want to do with the dirty chunks until then - you can display them as-is to show the user the “raw” output, or you can simply wait until you have something clean to display.

This concludes most of what you need to know to use response templates. The rest of this document is focused on the internals of the parsing system and how to write response templates. This is mostly relevant for developers and model authors. Most people can safely stop here!

Advanced: Writing a response template

The best way to understand how to write a response template is to pick a concrete example. Here’s what a raw reply from SmolLM might look like:

<think>
I should greet the user
</think>

<tool_call>{"name": "greet_user", "arguments": {"greeting": "Hi!"}}</tool_call>

When we parse this output in the standard message dict format, it should look like this:

{
    "role": "assistant",
    "thinking": "I should greet the user",
    "tool_calls": [
        {"type": "function", "function": {"name": "greet_user", "arguments": {"greeting": "Hi!"}}}
    ]
}

And here’s the template that parses it. Don’t be intimidated - a lot of it is fairly self-explanatory!

{
    "defaults": {"role": "assistant"},
    "start_anchor": "<|im_start|>assistant\n",
    "fields": {
        "thinking": {"open": "<think>", "close": "</think>", "content": "text"},
        "tool_calls": {
            "open": "<tool_call>",
            "close": "</tool_call>",
            "repeats": True,
            "content": "json",
            "transform": {"type": "function", "function": "{content}"},
        },
        "content": {
            "close": "<|im_end|>",
            "content": "text",
        },
    },
}

Essentially, the template defines fields and delimiters. Each field corresponds to a key in the output dict. Fields also include information for parsing the text inside their delimiters. There’s one subtlety: The content field has no open, because in SmolLM (and several other models), it’s not marked by a special token. Instead, content is stored in the space after the other regions, but before the end of the sequence. In our template, we represent this as an implicit / leftover field that picks up any text not claimed by another region.

In addition to fields, the template supports two top-level keys:

  • defaults (optional) — A dict of values pre-populated in the output (e.g. {"role": "assistant"}). Keys here are always retained in the parsed output, even if no field wrote to them; other keys are dropped when their field captured nothing.
  • start_anchor (str) / start_anchor_pattern (str regex) — Marks where the current assistant message begins inside a chat prompt. When you pass prefix= to parse_response or get_response_parser, the parser right-truncates the prefix past the last occurrence of this anchor before processing it, so earlier turns in a multi-turn conversation don’t pollute the current message’s state. The anchor is applied only to the prefix, never to the response/generation you parse — some formats legitimately re-emit it mid-message (gpt-oss harmony output opens every channel with <|start|>assistant), so stripping the response past the anchor would drop the model’s own reasoning and tool calls. This is why the generation alone is never enough to guard against history bleed: pass the prompt as prefix=. For ChatML-style models the anchor is typically "<|im_start|>assistant\n". Exactly one of start_anchor or start_anchor_pattern must be set.

For example, given this multi-turn prefix (note the two assistant turns):

<|im_start|>user
Hi<|im_end|>
<|im_start|>assistant
Hello!<|im_end|>
<|im_start|>user
Again?<|im_end|>
<|im_start|>assistant

the parser truncates everything up to the last <|im_start|>assistant\n, discarding the earlier "Hello!" turn. The rule is that everything but the final assistant turn is always dropped.

As with chat templates, response templates are stored as tokenizer attributes and saved with the tokenizer. Unlike chat templates, we save them inside tokenizer_config.json and not as a separate file, because their format fits naturally in JSON, unlike a chat template Jinja script.

tokenizer.response_template = template
tokenizer.save_pretrained(...)  # Written as a key in tokenizer_config.json

Advanced: Field API Reference

Each field supports several keys. We can divide these into two types. First, there are the keys that define how the field should be captured:

Key Type Purpose
open str or list[str] Literal string that opens this region. A list of strings means “match any of these”.
open_pattern str (regex) Regex alternative to open. Named groups become capture variables available to transform.
close str or list[str] Literal string (or list of strings) that closes this region. Omit to run to end-of-stream.
close_pattern str (regex) Regex alternative to close. Named groups become capture variables available to transform.
repeats bool If true, the field is a list and each match appends. Default false.
optional bool If false and the region never matches, we raise an error. Default true.

A field should have either open or open_pattern, but not both, and the same is true for close and close_pattern.

A field may omit close/close_pattern entirely, in which case the region stays open until the end of the generated text. This is useful for a final field that runs to the end of the message.

A field with neither open nor open_pattern is the implicit field: it’s active whenever no explicit region is open, so it captures leftover text. At most one field can be implicit. This is most often used when content does not have special token tags, it’s just written as plaintext after the other fields.

In addition to opening and closing delimiters, you can also specify repeats, which indicates that the field is a list and the delimiters can match multiple times. This is most common for parallel tool calling, when a model emits multiple tool calls simultaneously:

'<tool_call>{"name": "a", ...}</tool_call><tool_call>{"name": "b", ...}</tool_call>'
# Returns `"tool_calls": [{... "a" ...}, {... "b" ...}]` in a template with repeats: true

Finally, you can specify optional: false for fields that must be present. If such a field is missing, we raise an error instead of just returning a message dict without it.

The end of generation will close and finalize any open regions, even if their closing delimiter was not seen.

Parsing the content of a field

Once we define how to capture a field, we also need to specify how to parse the raw text inside that capture. There are four keys that control this:

Key Type Purpose
content str The content type inside this region. Defaults to "text". Each type has its own parser.
content_args dict Arguments to be passed to the content parser for this region.
transform dict/list Optional post-parse template that reshapes the parsed body (see Transform).
transform_each bool If true, the parsed content must be a list and transform is applied per-element.

The first (and most important) key is content. This indicates the content type of the field, which determines the parser that will be used to convert the raw text captured in the field to the final output. content_args are used to configure the parser, and allow us to support various format quirks without needing custom code. We’ll take a look at each type of parser and its arguments in turn.

Basic types

text, int, float and bool are the basic types. These content types all just strip whitespace and then do a simple type conversion if required. They do not have any content_args, except for text which supports the arg strip, which strips whitespace from the start and end of the captured text, and defaults to true.

field = {"count": {"open": "<n>", "close": "</n>", "content": "int"}}
input = "<n> 42 </n>"
# Returns: {"count": 42}

json

The json parser parses the captured text as JSON. It’s the workhorse for tool-call arguments and anything else with nested structure. It accepts a handful of optional content_args to handle the various ways models mangle JSON in the wild:

  • unquoted_keys (bool, default false): Enable when key names are raw rather than quoted (e.g. {name: "foo"}). Useful for models that emit Javascript-style object literals rather than strict JSON.
  • string_delims (list of [open, close] pairs, optional): for models that wrap string values in custom delimiters instead of "...". Each pair gives an opening and closing marker.
  • allow_non_json (bool, default false): if parsing fails, return the stripped raw text instead of raising. Useful as a fallback for fields where the model usually emits JSON but occasionally drops to plain text.

unquoted_keys and string_delims both exist to handle models that emit non-standard, almost-but-not-quite-JSON output, so you should only need them for a handful of models.

field = {"args": {"open": "<args>", "close": "</args>", "content": "json", "content_args": {"unquoted_keys": True}}}
input = '<args>{city: "London"}</args>'
# Returns: {"args": {"city": "London"}}

xml-inline

The xml-inline parser is for regions made up of a flat sequence of XML-ish tags, where each tag becomes one entry in a dict. It’s most often used inside a tool_calls field for models that emit each argument as its own tag rather than as a JSON blob:

  • tag_pattern (str, required): regex matching a single tag. Must contain named groups key (the resulting dict key) and value (the raw text that becomes the dict value).
  • value_parser (dict, optional): nested content parser applied to each captured value. A dict with name (the parser, e.g. "json", "int") and optional args (its content_args). If omitted, values stay as raw strings.
  • merge_duplicates (bool, default false): when the same key appears multiple times, collect the values into a list instead of letting later matches overwrite earlier ones.

For example, Qwen3 emits each tool-call argument as its own <parameter> tag, and we parse it like this:

"tool_calls": {
    "open_pattern": r"<tool_call>\s*<function=(?P<name>\w+)>",
    "close": "</tool_call>",
    "repeats": True,
    "content": "xml-inline",
    "content_args": {
        "tag_pattern": r"<parameter=(?P<key>\w+)>\s*(?P<value>.*?)\s*</parameter>",
        "value_parser": {"name": "json", "args": {"allow_non_json": True}},
    },
    "transform": {"type": "function", "function": {"name": "{name}", "arguments": "{content}"}},
}

Note the nested value_parser: each parameter value is itself run through the json parser (with allow_non_json so plain strings still pass through). Feeding the tool_calls field above this input:

input = "<tool_call><function=get_weather><parameter=city>London</parameter><parameter=units>celsius</parameter></function></tool_call>"
# Returns: {"tool_calls": [{"type": "function", "function": {"name": "get_weather", "arguments": {"city": "London", "units": "celsius"}}}]}

kv-lines

The kv-lines parser handles line-delimited key: value pairs (think YAML-ish metadata or .env files). Each line becomes one entry in the resulting dict. All arguments are optional:

  • line_sep (str, default "\n"): separator between pairs.
  • kv_sep (str, default ":"): separator between a key and its value inside a single line. Only the first occurrence is used as the split point, so values may themselves contain the separator.
  • strip (bool, default true): strip surrounding whitespace from each key and value.
  • value_parser (dict, optional): nested content parser applied to each value, in the same {"name": ..., "args": ...} format as for xml-inline. If omitted, values stay as raw strings.

Lines that are empty or do not contain kv_sep are silently skipped, so stray blank lines in the captured region are tolerated.

field = {"metadata": {"open": "<meta>", "close": "</meta>", "content": "kv-lines"}}
input = "<meta>name: alice\nage: 30</meta>"
# Returns: {"metadata": {"name": "alice", "age": "30"}}

Note age keeps "30" as a string; add a value_parser of {"name": "int"} to parse it to 30.

Transform

For most fields, the transform key is unnecessary. It’s used when the parsed body needs to be reshaped into the final output, or when information from the delimiters has to be merged into the result. It most commonly appears in tool_calls fields, as these often have complex structure.

transform is a template: a dict (or list) that describes the output shape, where any string of the form "{name}" is replaced with the corresponding value. Values can be accessed from content (the parsed body of this region) and any named groups captured by open_pattern / close_pattern. A very common use-case is to wrap a tool call dict in an outer dict with a function key, as these are part of our standard tool call format:

"tool_calls": {
    "open": "<tool_call>",
    "close": "</tool_call>",
    "repeats": True,
    "content": "json",
    "transform": {"type": "function", "function": "{content}"},
},

So this raw output:

<tool_call>{"name": "greet_user", "arguments": {"greeting": "Hi!"}}</tool_call>

becomes (note repeats: True makes tool_calls a list):

[{"type": "function", "function": {"name": "greet_user", "arguments": {"greeting": "Hi!"}}}]

A whole-string placeholder like "{content}" returns the looked-up value with its type preserved — so above, the parsed JSON dict slots in directly as the value of function. A placeholder must be the entire string: mixing text and placeholders ("abc {name} def") is not permitted. They’re not f-strings!

transform is quite versatile, which becomes necessary when the model output has a wildly different format to our standard API. GPT-OSS is a good example - it embeds the function name in the channel header rather than in the JSON body, so we have to capture it with a named group in open_pattern and merge it with content inside the transform. All named groups in open_pattern and close_pattern become available as variables alongside content:

"tool_calls": {
    "open_pattern": r"<\|channel\|>commentary to=functions\.(?P<name>\w+).*?<\|message\|>",
    "close": "<|call|>",
    "repeats": True,
    "content": "json",
    "transform": {"type": "function", "function": {"name": "{name}", "arguments": "{content}"}},
},

The function name lives in the channel header, not the JSON body, so this:

<|channel|>commentary to=functions.get_current_weather <|constrain|>json<|message|>{"location": "San Francisco, CA"}<|call|>

becomes:

[{"type": "function", "function": {"name": "get_current_weather", "arguments": {"location": "San Francisco, CA"}}}]

Sometimes a field’s parsed content is itself a list of records and you want to reshape each one. The Cohere template is a good example: It emits all tool calls inside a single JSON array, so we set transform_each: True to apply the transform per element. Each array element’s keys are unpacked into the template scope, so "{tool_name}" looks up tool_name in the current element:

"tool_calls": {
    "open": "<|START_ACTION|>",
    "close": "<|END_ACTION|>",
    "content": "json",
    "transform_each": True,
    "transform": {"type": "function", "function": {"name": "{tool_name}", "arguments": "{parameters}"}},
},

This will convert an output like this:

[
    {"tool_name": "greet_user", "parameters": {"greeting": "Hi!"}},
    {"tool_name": "search", "parameters": {"query": "weather tomorrow"}}
]

Into an output like this, which fits our standard API:

[
    {"type": "function", "function": {"name": "greet_user", "arguments": {"greeting": "Hi!"}}},
    {"type": "function", "function": {"name": "search", "arguments": {"query": "weather tomorrow"}}}
]

The transform_each flag is only needed when content is already a list; for the more common case where each match contributes one element (and repeats: True accumulates them), then the transform will apply to each element by default.

Framework developers: Regex portability

open_pattern, close_pattern, and start_anchor_pattern are regex strings. For most users, and even for most model authors, this shouldn’t be a problem, but if you are a developer writing an implementation of response parsing in another language, you should be aware of our implementation details. This section is dedicated to everyone who had to implement an entire Jinja parser to get non-Python chat templating to work - we hope that if you follow the simple guidelines below, then response templates should be much less painful:

  • We use Python’s regex module for all regexes used in chat parsing. Since all Python3 strings are unicode, all of our regex matches are unicode-aware. This particularly affects common characters like \w. Make sure you set the relevant unicode flags in your engine.
  • We compile all regexes with re.DOTALL enabled and re.MULTILINE disabled, so . matches \n but ^ and $ only match the start/end of the whole input, not line breaks.
  • We use (?P<name>...) syntax for named groups. Other regex implementations have very different named capture group syntax, so you may need to search for this pattern in regexes and rewrite it to match your local implementation.
  • We use partial regex matching to decide when we can emit data. This is necessary because the end of a region may consist of multiple tokens, and we don’t want to emit that end-of-region delimiter, so we hold tokens back until we’re sure that they’re inside the region and not part of the boundary. This means that whenever a region end regex has a partial match, we hold back data until the regex either matches or fails. If your regex engine doesn’t support partial matching, you can still implement response templates, but you may need to find another solution to this issue. One simple approach is just to hold back these regions / emit them as dirty until you definitively see the ending delimiter.
  • Your regex engine may (rarely) not support lookarounds like (?!...). Although these aren’t commonly used in response templates, they can appear and we do support them! You might need to either throw an error in those cases, or manually extract the lookarounds and enforce them in your code when the regex engine finds a possible match.
  • Other advanced features like backreferences, atomic groups, possessive quantifiers, recursion and so on are generally not used in response templates. We’ll try to dissuade model authors from using them, so you can hopefully safely ignore them.
Update on GitHub