Variation 38: “How I utilized the “logit_bias” parameter to combat em dashes in the API and had to suppress 106 tokens—my insights and sample code for achieving a “No Dash” output comparison”
Eliminating Em Dashes in AI Text Generation: A Practical Approach Using logit_bias
Dealing with undesirable punctuation or symbols in AI-generated responses can be a persistent challenge, especially when it comes to em dashes. Recently, I explored a method to suppress em dashes in OpenAI’s GPT models by leveraging the logit_bias parameter within the API, and I’d like to share the insights and techniques I developed — which might be useful for anyone aiming for cleaner, dash-free outputs.
The Challenge with Em Dashes
Despite multiple attempts using custom instructions and context tweaking, preventing GPT-4 from inserting em dashes proved difficult. These punctuation marks seemed deeply ingrained in its language patterns, often reappearing even when explicitly discouraged. Recognizing that tokens for certain symbols may have various sub-tokens or be combined in unpredictable ways, I examined the tokenization process to find a more targeted solution.
Using logit_bias for Token Suppression
The logit_bias parameter allows explicit influence over token probabilities during response generation. Setting a token’s bias to -100 effectively suppresses its likelihood of appearing. My strategy involved identifying all token IDs associated with em dashes and related punctuation, then applying a strong negative bias to each.
Iterative Token Suppression
The process was iterative:
- Initially, I targeted tokens that corresponded directly to
'—'(the em dash). - Next, I expanded the bias to include tokens with the
'—'character attached, covering various contexts and combinations. - Subsequently, I found that GPT occasionally switches to en dashes (
–) or hyphens (-) in place of em dashes, so I included those as well. - Finally, after setting negative biases on 106 tokens, the model’s propensity to use any form of dash was effectively nullified.
For example:
- Starting with just three tokens directly representing
'—'. - Increasing to 40 tokens, including any touchpoints with
'—'. - Expanding further to 62 tokens to cover en dashes.
- Culminating in 106 tokens to also suppress hyphens used as em dash substitutes.
Impact on Response Quality
Surprisingly, this aggressive suppression did not significantly harm the overall quality or coherence of responses. Tests across different models showed that responses shifted toward those that avoided em dashes entirely, favoring more straightforward phrasing.
Sample Evaluations



Post Comment