Remember not to confuse interface with implementation. GPT's interface is a single stream of tokens - so if you want it to see before and after context, that just means you have to encode them into a single stream.
this comment is downvoted but it is correct. while fill-in-the-middle is less obvious in the decoder-only paradigm, it is still possible. one example is code-llama https://ai.meta.com/blog/code-llama-large-language-model-cod... it is a variant of llama 2 (GPT-style decoder-only) but it supports infilling
> [W]e split training documents at the character level into a prefix, a middle part[,] and a suffix with the splitting locations sampled independently from a uniform distribution over the document length. We apply this transformation with a probability of 0.9 and to documents that are not cut across multiple model contexts only. We randomly format half of the splits in the prefix-suffix-middle (PSM) format and the other half in the compatible suffix-prefix-middle (SPM) format described in Bavarian et al. (2022, App. D). We extend Llama 2’s tokenizer with four special tokens that mark the beginning of the prefix, the middle part or the suffix, and the end of the infilling span
yes but code llama also found that the PSM format was inferior to the SPM format presumably because those hard cuts lose context. the "real" fill-in-the-middle of BERT is i think more likely to model language compared to the "faux" F-i-t-m of flinging prefixes and suffixes around
where is that reported? in Table 14 I see PSM performing much better than SPM. I also see a note about the SPM performance which attributes the degradation to the tokenizer edge cases
> As an example, our model would complete the string 'enu' with 'emrate' instead of 'merate' which shows awareness of the logical situation of the code but incomplete understanding of how tokens map to character-level spelling.
that doesn't really feel like a failure of language modeling to me
> Note, however, that the results in
random span infilling are significantly worse in suffix-prefix-middle (SPM) format than in prefix-suffix-middle
(PSM) format as it would require token healing (Microsoft, 2023),
yeah, I hear you that the decoder-only infilling approach is 'weird' -- I just don't know if I agree that it's manifestly worse at language understanding / performance than the BERT appraoch