I built and trained a BERT on my gaming laptop (3070 RTX) to ~94% of BERT-base's performance in ~17 hours* (BERT-base was trained on 4 TPUs for 4 days). This notebook goes over the whole process, from implementing and training a tokenizer, to pretraining, to finetuning. One feature that makes this BERT different from most (though not unique) is the use of relative position embeddings.
Edit - for anyone unsure about what "BERT" is or its relevance, it's a transformer based natural language model just like GPT. However, where GPT is used to generate text, BERT is used to generate embeddings for input text that you can then use for predictive models (e.g. sentiment prediction), and that process is also demonstrated in the notebook.
*Edit 2 - The 17 hours are pretraining only, not including the time to train the tokenizer, or finetuning.
> it's a transformer based natural language model just like GPT
its an encoder-decoder model whereas GPT is decoder only. feels like a pretty big difference, though in practice i honestly still dont have a strong grasp of how encoder-decoder is deficient to decoder-only when it comes to text generation. i get that BERT was designed for translation but why cant we scale it up and use it for textgen just the same
BERT is encoder only and was designed for classification and natural language inference problems. The original Transformer was encoder-decoder and was designed for translation.
BERT can't be used in an autoregressive way because it doesn't output a new token, it simply generates embeddings from the existing tokens (you get one for each input token).
> where GPT is used to generate text, BERT is used to generate embeddings for input text that you can then use for predictive models (e.g. sentiment prediction)
Ok, but isn't text generation more general? E.g. you could ask it to predict the sentiment of a sentence and write the result as a sentence?
Yeah my explanation was definitely a lossy summary. You can do similar things with GPT, but BERT is bidirectional, so for a given token it can take into account both tokens before and after it. GPT would only take into account tokens before it. Looking both ways can be helpful. Another comment in this thread explains the same (maybe clearer).
Yeah, he's glossing over some things, but with good reason.
Might be more accurate to say BERT is a discriminant model, while GPT is a generative model. BERT was trained using Masked Language Model process, which is different from the decoder-only process used for the first GPT. Sentiment prediction seems like more of a particular thing BERT is capable of. There are many more capabilities, but GPT has sort of steered the industry towards generative models.
https://huggingface.co/docs/transformers/main/tasks/masked_l...
GPT and BERT were actually the first models published after Attention was published by Google.
Haha fair question. I didn't make any special changes, I just left the lid open and put the laptop in a ventilated spot. I'm actually in the tropics, so I guess Lenovo scores some points here (the laptop is a Legion 5 Pro).
If you want to really cook your lap, try running Gentoo! Emerging (compiling) Firefox, glibc, gcc, Libre office and a few of their friends will soon show you how good the cooling is.
A few years back cough I upgraded gcc from 3 to 4 and emerged system and then world. That was over 1200 packages. It took about a week. That was in the days when I used a Windows wifi driver and some unholy magic to get a connection. I parked the laptop on a table with the lid open and two metal rods lifting it up 6" for airflow. I left the nearby window open a bit too.
Going off on a tangent: I used to use Gentoo in the past. I suspect, if you use one of the common processors, compiling your own binaries doesn't really give you any performance benefits, does it?
(I have to admit I stopped using Gentoo mostly because it encouraged me to endlessly fiddle with my system, and it would invariably end up broken somehow. That's entirely my fault, and not Gentoo's. I switched to Archlinux as my distribution of choice, and I manage to hold myself back enough not to destroy my installation.)
Quite a bit of performance. Generally Linux distro (and really all general purpose OS's) need to limit the CPU features they can to a minimum common baseline. Your machine support AVX512 instructions? Those instructions won't be used by the compiler because it's not available everywhere the software will run. By compiling yourself, you can specialize the compilation to the features on the machine.
Beyond that the big win over performance even is customization. The most secure code, and the fastest code, is the code that isn't there at all. Do you really need your entire system to support LDAP authentication? Maybe... What about your local email daemon? Do you need that? Because cron does and since your mail daemon also has MySQL support built in, installing cron gets you the MySQL libraries.
I don't use it anymore because of the overhead, but there are a lot of performance and security benefits to be had there.
> Realistically, though, most software that can benefit from specialized instructions already detects their availability at runtime and uses those, even if the code was compiled with -march=x86-64.
It's all behind the submission link! I've set it up so that you can run it start to end, if you want. The only thing I'm not 100% sure about is resource requirements - I have an 8GB GPU and 32GB of RAM, it could be that if you have less than that you'd run into out of memory errors. Those would be fairly straightforward to fix, though (honestly I'd be happy to help if someone runs into this).
The biggest distinction in architecture between BERT and GPT is that BERT looks both ways from a given token. This helps give context to a token. This is what made BERT great at the time because the surrounding text, before and after, could change the meaning of the token we are at. You could essentially fill in the middle, or rather correct what's in the middle after it's been said. I believe this is why Apple is using it for iOS 17's auto-correct.
GPT predicts the next word by only look back at what we have seen so far. In other words, it's auto regressive.
Remember not to confuse interface with implementation. GPT's interface is a single stream of tokens - so if you want it to see before and after context, that just means you have to encode them into a single stream.
this comment is downvoted but it is correct. while fill-in-the-middle is less obvious in the decoder-only paradigm, it is still possible. one example is code-llama https://ai.meta.com/blog/code-llama-large-language-model-cod... it is a variant of llama 2 (GPT-style decoder-only) but it supports infilling
> [W]e split training documents at the character level into a prefix, a middle part[,] and a suffix with the splitting locations sampled independently from a uniform distribution over the document length. We apply this transformation with a probability of 0.9 and to documents that are not cut across multiple model contexts only. We randomly format half of the splits in the prefix-suffix-middle (PSM) format and the other half in the compatible suffix-prefix-middle (SPM) format described in Bavarian et al. (2022, App. D). We extend Llama 2’s tokenizer with four special tokens that mark the beginning of the prefix, the middle part or the suffix, and the end of the infilling span
yes but code llama also found that the PSM format was inferior to the SPM format presumably because those hard cuts lose context. the "real" fill-in-the-middle of BERT is i think more likely to model language compared to the "faux" F-i-t-m of flinging prefixes and suffixes around
where is that reported? in Table 14 I see PSM performing much better than SPM. I also see a note about the SPM performance which attributes the degradation to the tokenizer edge cases
> As an example, our model would complete the string 'enu' with 'emrate' instead of 'merate' which shows awareness of the logical situation of the code but incomplete understanding of how tokens map to character-level spelling.
that doesn't really feel like a failure of language modeling to me
> Note, however, that the results in
random span infilling are significantly worse in suffix-prefix-middle (SPM) format than in prefix-suffix-middle
(PSM) format as it would require token healing (Microsoft, 2023),
yeah, I hear you that the decoder-only infilling approach is 'weird' -- I just don't know if I agree that it's manifestly worse at language understanding / performance than the BERT appraoch
Not defining "BERT" is a little weird, especially since something like MLM is explained. A good rule for writing, which I learned from The Economist, is to explain what something is the first time you mention it. It does lead to some funny explanations sometimes. One of my favorites, again from The Economist, is "HSBC, a bank". It sort of let them know that even if they see themselves as being big and important, they are "just a bank".
No. There are at least two kinds of costs. First, It takes time to search 'adjacent' domains. Second, by reducing your available acronyms/initialisms, you make it harder to map your architecture name onto those letters.
It is fun to think of some of the alternative BERT names that "could have been", such as BIDET = BIDirectional Encoder representations from Transformers.
This is a fantastic notebook, thanks very much for sharing.
"If you want to run the full notebook on a full size model, expect training the tokenizer to take ~15 hours, pretraining with the MLM objective to take ~17 hours (on a 3070 RTX, adjust expectations for your own system), and finetuning to take about an hour"
I wonder how hard it would be to modify this code to run on a 64GB M2 Mac.
It's frustrating how much potential that platform has for this kind of thing (given the way the GPU shares memory with the CPU) that isn't yet harnessed because most of the ecosystem is built around NVIDIA and CUDA.
> I wonder how hard it would be to modify this code to run on a 64GB M2 Mac.
It isn't that hard, I was able to run in on M1.
The changes are:
remove or modify multiprocessing - it doesn't work on Mac the same way as in the code;
replace `device = "cuda"` with `device = "mps"`
In this line ` att_idxs = (torch.clamp(torch.arange(context_size)[None, :] - torch.arange(context_size)[:, None], -pos_emb_radius, pos_emb_radius-1) % pos_emb_size).to("cuda")` replace cuda with "mps"
in `optim.AdamW` remove `fused=True` - we can't do it without CUDA
Replace
```with autocast(device_type='cuda', dtype=torch.float16):
_, loss = mlm_head(bert(batch_data_torch_xs[mb_start_idx:mb_end_idx]), batch_data_torch_ys[mb_start_idx:mb_end_idx])
```
with simply `_, loss = mlm_head(bert(batch_data_torch_xs[mb_start_idx:mb_end_idx]), batch_data_torch_ys[mb_start_idx:mb_end_idx])`
replace `scaler.scale(corrected_loss).backward()` with `corrected_loss.backward()`
replace
```
scaler.unscale_(optimizer)
scaler.step(optimizer)
scaler.update()
```
with `optimizer.step()`
> It's frustrating how much potential that platform has for this kind of thing (given the way the GPU shares memory with the CPU) that isn't yet harnessed because most of the ecosystem is built around NVIDIA and CUDA.
I'm sure it's frustrating from a consumer perspective, but it should be no surprise why Nvidia won here. CUDA shipped unified memory addressing ten years before the M1 hit shelves. On top of that, their architecture and OS support is top-notch, you can ship your CUDA code on anything from a $250 Jetson to a $300,000 DGX system, and their hardware is relatively ubiquitous.
The frustrating thing is how companies like Apple and Nvidia insist on being each other's enemies. Only consumers feel the pain when researchers discover cool stuff like this and want to share.
You can add an Nvidia card to basically every kind of hardware, desktop, server or most importantly rent in the cloud. You must buy a Mac to use a M1/M2. Given the cost of some cards it could make sense but then everybody using that software will have to buy a Mac too.
A lot of people have M1s or M2s already, who who would have to pay for access to an NVidia card. I think that's the disconnect. It's more about making use of what a lot of people (and let's not forget a lot of developers) already have.
I've personally got an 8GB M1 Macbook as my work development machine, and while I'm having a lot of fun with llama.cpp it does feel somewhat disconnected from the bulk of the ML ecosystem.
This is awesome. Long time I wanted to pre-train a language model in Brazilian Portuguese and just got a 24GB 3090. This is the perfect timing. Thank you for sharing your notebook.
Nice! You'd get a long way just by swapping out the wikipedia dataset being pulled in. Though you might need to still accommodate for some special characters as well, I don't know how important those are in Brazilian Portuguese.
I've been thinking about doing something like this but only inference and written in rust. I thought it might be a good excuse to learn SIMD but it seemed like a big undertaking so I've been putting it aside for a long time
Thanks for sharing, one thing that I really like is that your code is very clean, formatted and commented. Thats very rare to see when people share their jupyter notebooks.
Edit - for anyone unsure about what "BERT" is or its relevance, it's a transformer based natural language model just like GPT. However, where GPT is used to generate text, BERT is used to generate embeddings for input text that you can then use for predictive models (e.g. sentiment prediction), and that process is also demonstrated in the notebook.
*Edit 2 - The 17 hours are pretraining only, not including the time to train the tokenizer, or finetuning.