Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Next-token prediction in JavaScript (github.com/bennyschmidt)
88 points by bschmidt1 6 months ago | hide | past | favorite | 102 comments
What inspired this project today was watching this amazing video by 3Blue1Brown called "But what is a GPT?" on Youtube (https://www.youtube.com/watch?v=wjZofJX0v4M - I highly recommend watching it). I added it to the repo for reference.

When it clicked in my head that "knowing a fact" is nearly synonymous with predicting a word (or series of words), I wanted to put it to the test, because it seemed so simple. I chose JavaScript because I can exploit the way it structures objects to aid in the modeling of language.

For example:

"I want to be at the beach", "I will do it later", "I want to know the answer", ...

becomes:

  {
    I: { 
      want: { 
        to: { 
          be: { ... }, 
          know: { ... }
        }
      },
      will: { ... }
    },
    ...
  }
in JavaScript. You can exploit the language's fast object lookup speed to find known sentences this way, rather than recursively searching text - which is the convention and would take forever or not work at all considering there are several full books loaded in by default (and it could support many more).

Accompanying research yielded learnings about what "tokens" and "embeddings" are, what is meant by "training", and most of the rest - though I'm still learning jargon. I wrote a script to iterate over every single word of every single book to rank how likely it is that word will appear next, if given a cursor, and extended that to rank entire phrases. The base decoder started out what I'll call "token-agnostic" - didn't care if you were looking for the next letter... word... pixel... it's the same logic. But actually it's not, and it soon evolved into a text (language) model. But I have plans to get into image generation next (next-pixel prediction), using this. Overall the concepts are similar, but there are differences primarily around extraction and formatting.

Goals of the project:

- Demystify LLMs for people, show that it's just regular code that does normal stuff

- Actually make a pretty good LLM in JavaScript, with a version at least capable of running in a browser tab




I might be misinterpreting the blurb above, but isn't this a JavaScript implementation of a Markov Chain [1]?

[1] https://towardsdatascience.com/text-generation-with-markov-c...


Exactly my thoughts, this isn’t really an LLM. I understand it’s just a personal project but it’s probably good to know LLMs are not just lookup tables.

Also, I’m not sure I understand the explanation of why js objects are so great for this, you could do the same thing with a python dict for example or equivalent mapping types in any programming language.

This is also missing an attention mechanism which is very important to understand even for a toy LLM as it’s part of what makes the responses so accurate to the context, rather than it just predicting what word usually comes next to another word (or sequence of words)


> you could do the same thing with a python dict for example or equivalent mapping types in any programming language.

Yes, any language with a hashmap of sorts with O(1) lookups should be equivalent.


Yes the time complexity is how it can all be held in memory without crashing (because it isn't really all held in memory, like it would be with text) but that's only part of it. You could not choose "any programming language" for it because not all of them are good at lookups, and not all languages have callback functionality or the ability to natively enqueue and fork code flow. It would take forever to do this and probably not work in most languages.


The library never claims to be an LLM - it's a next token prediction library that you can use to create LLMs.

Python - one nice thing is all the "keys" retain their primitive type, where as in JS they all turn into strings. If Python wasn't so slow compared to JS that would matter a lot and I might use it, but the speed comparison isn't even close.


You must be joking, the title of this post literally says "build fast LLMs from scratch" but the code is neither an LLM, nor particularly fast.

Python dicts are actually different from JS objects, Python uses a hashmap behind the scenes while most fast js vms will apply some heuristics to decide whether to use a hashmap or to make the object as a static struct with a known offset for each key, this explains it better than I can: https://v8.dev/blog/fast-properties

Nevertheless, for your specific use-case I would be really surprised if there was any significant difference between the performance of python and js – hashmaps are fast enough for this – plus with js objects you might be trading insertion performance for access performance in some cases, as there is overhead creating a v8-style "fast property" object


Not joking. You seem to be confused in thinking LLMs are built with other LLMs, but that's not the case. Otherwise why would you say "it says 'build fast LLMs from scratch' but the code is not an LLM" ?? Why would the code of a library to build LLMs need to also be an LLM?

Getting off-topic but Python is incredibly slow at lookups (and most things) compared to JavaScript and it isn't even close, not all dynamic languages are the same. This is pretty widely known and a quick Google search yields plenty of benchmarks and articles! Give it a try.

Python is used in AI/ML for its libraries (convenience) not because it's a fast language. There are 3 main reasons: 1) Time complexity of data structures is lower in JS, that's the primary exploit at play here 2) V8 compiles to machine code in less steps than Python and 3) Process forking - the concept that functions can run in parallel in the same thread.

Thanks for your comment!


Ok, not to belittle your work or anything, but I think you are either not using the words correctly, or trolling us here.

If we used your "non-LLM" library to build an LLM, it wouldn't be "from scratch" would it? Therefore it is natural to assume that you meant that your code is supposed to be "an LLM written from scratch".


Not trolling, though I feel the same way about you with this post haha.

To answer your question, this library doesn't provide something like a `chat` method and doesn't provide anything like Stable Diffusion's `img2img` API either - that's not what it is - it's a library for predicting the next token (letter, word, pixel, etc.) based on data you train it on.

The most typical use cases for this model are: Auto-completion, auto-correct, spell check, search, etc.

However, if you watched the incredible video I shared in the original post you'd know that this completion concept can be applied to other interfaces: Chat, image generation, audio analysis/generation, etc. and in fact it is applied to LLMs like GPT.

This library doesn't get into any chat-specific methods like `ask` (question & answer), it just completes token sequences. If your goal is to create a ChatGPT clone, you would have to add methods like `chat`, `ask`, etc. and provide code for ongoing conversations - probably using an NLP library to parse queries into cursors like I'm trying in another project, or use "MULTI-HEADED ATTENTION BLOCKS" (lmfao) if you fancy that. Or if you wanted to create a Stable Diffusion clone, you'd have to provide those image-related methods needed. As far as the meaning of "from scratch" - I mean compared to using Ollama etc. to run local models or using OpenAI - just trying to enable you to build whatever model suits your data and use case.


I feel you must be trolling now by how confidently incorrect you are, but in case you are not:

> Time complexity of data structures is lower in JS

What data structures? All of them? You know you can implement your own data structures if you need them to be optimal

> V8 compiles to machine code in less steps than Python

The "steps" or time it takes to compile has no bearing on runtime performance. v8 is generally faster on micro-benchmarks, but in python you spend most of your time calling out to libraries that are heavily optimised, the javascript library ecosystem is a complete joke for ML/AI compared to python's – for example there is nothing that can compare to numpy in js.

> Process forking - the concept that functions can run in parallel in the same thread

This is a OS feature and has nothing to do with js or python. I have to point out though that when you fork a process you are creating a new thread.

This will be my last comment, it is clear to me now that you posted this thread to stroke your own ego and have no interest in actually learning anything. Good luck with your GPT-killer :)


Not joking, not trolling, thought it was widely known that JavaScript is generally significantly faster than Python. Haven't had this debate in 15 years, but let me explain:

> What data structures?

Everything in JavaScript is an object, even Arrays. So even an Array lookup is O(1) in JavaScript - not true in Python where it has to search the Array. Only if you created key/val pairs in Python (via list/hash) could you exploit the O(1) lookup for accessing data, but I can't use Python list for a large model like an LLM without hitting memory errors (see: https://stackoverflow.com/questions/5537618/memory-errors-an...)

> Python is run-time

Both languages are dynamic (not compiled) lol, what are you trying to say here? The point is that the number of steps it takes to go from high-level dynamic code (scripts) to machine-readable instructions is 1 step in JS, but 2 steps in Python, that's literally why it's slower to RUN than JavaScript. Literally runs slower as in number of executions.

> Multi-process is an OS feature that has nothing to do with JS or Python

Not true in the slightest. It's a language feature. I'll use my favorite word of the day and say it's an "exploit" more than a feature, when you can run what would be blocking IO in parallel. Python on the other hand is "single flow" executing statements one-at-a-time.

What a toxic comment, I said thanks to everyone else but not you. I retract my thank you! I hope you learned something today at least. This made me LOL: "JS uses some heuristics to decide whether to use a hashmap or to make the object as a static struct" neither hashmap nor struct exists in JS, just funny. There's ES6 Map, but that is really just an Object, not the other way around lol


> Everything in JavaScript is an object, even Arrays. So even an Array lookup is O(1) in JavaScript - not true in Python where it has to search the Array.

Huh? Array and list lookups are O(1) in Python too. Who told you that? Searching is different, and it's O(n) in both JS and Python. Can you really believe two mainstream programming languages can have such a drastic difference in time complexity?

Also, not my comment, but:

> JS uses some heuristics to decide whether to use a hashmap or to make the object as a static struct" neither hashmap nor struct exists in JS, just funny.

I'm pretty sure they meant that Javascript internally uses either a hashmap or a struct (you can do something close in Python using __slots__) to represent an object. Those are standard data structure names, even though Javascript doesn't use the same terminology. Python doesn't call them hashmaps either.


My bad it's on the insertion side, you can O(1) insert into the middle of Array-likes like Set in JS (which I use in this lib), where I think Python is O(n) for all it's Array-likes. Think you can O(1) append to an Array though.

But there are at least a few other reasons Python is generally slower than JavaScript. It's popular in ML because it was popular in math, and it was popular in math because it's easy to teach in schools. People use Python for its ML libraries, and now its ML communities, not because it's a fast language - it isn't really. And has no browser implementation, so there are a few reasons I didn't choose Python.

Reading it again I think I understand what the other poster meant now, but they also said "multi-process has nothing to do with the language" when it does have a lot to do with scaling Node especially to handle concurrency. I did a little demo here a while ago to show what I mean, check out the browser tab example: https://github.com/bennyschmidt/simple-node-multiprocess

Thanks for your comments, I hate language vs language debates! :D


> you can O(1) insert into the middle of Array-likes like Set in JS

Sets are supposed to retain the insertion order. I don't think you can insert into the middle, let alone in O(1) time (as they are hashmaps, not linked lists).

> I hate language vs language debates!

I wasn't trying to compare JS to Python. My aim was to clarify a misunderstanding re: data structures used.


In the case of `Set` and `Map` they're keys though. It's sorted in that you can reference `data[1]` and get that element, but I don't know the keys are necessarily sorted.

This guy says "V8 team explains how some memory optimization was done on its hashtable implementation for Map, Set, WeakSet, and WeakMap: Optimizing hash tables: hiding the hash code

Based on 1 and 2: V8's Set and Map's get & set & add & has time complexity practically is O(1)."

https://stackoverflow.com/questions/33611509/es6-map-and-set...


I wouldn't bother arguing with this guy, he deliberately misquoted me on every point to make it seem like what he's saying is correct

Indeed I was talking about the internals of v8, and even linked a blog post explaining it, v8 is written in c++ so it doesn't use javascript data structures anyway.


Haha dude you think Python is fast, and said multi-process has nothing to do with JavaScript, so of course I'm going to write off anything you think at this point because you know nothing about dynamic languages.

I missed the V8 bit because I was replying to a lot of people - your overall point is still terrible because JavaScript is a lot faster than Python. Think about the larger point you're making.

Btw sometimes you say "it's just an OS feature" when convenient but you don't say "it's just an interpreter feature" (regarding V8) - I wonder at what point in the debate that will come out. That's why I hate these kinds of debates, they're all signaling and changing subjects and losing sight of the point being made.

Whenever I'm talking about time complexity, I'm in a terrible conversation that I can't wait to get out of. The bottom line is everybody in the world except you knows Python is slow as hell, and for more than 1 reason.


> but it’s probably good to know LLMs are not just lookup tables.

But there are almost literally lookup tables actually!

The thing is they don't perform the lookup on a single word as a key, but on the entire context, which is what makes them special. But besides that it's “just” a lookup table.


A deep attention network is absolutely not a lookup table


An attention head is quite literally a lookup table!


Multiple lookup tables


Bit of an oversimplification though, no one would look at Postgres and go “it’s just a couple lookup tables put together”


As an implementation? No.

For fundamental understanding of the logical model? That its one big lookup table with a particular form of key nesting is... actually a pretty good model.


That's exactly what DB indexes are though ;)


Exactly right. The others here are confusing LLM with "chat bot", and also they seem to be confusing token prediction with LLM. I have a feeling the mainstream won't really get it until a ChatGPT clone is online and ready to use lol and still it will be "This isn't a true chat bot, this is actually a Markovian language interface!"


It's a next token prediction library. "Markov chain" basically means finite state machine and is a looser concept. If you want to call the token prediction methodology Markovian you can - sounds cool! The implementation another person linked here ranking words would also qualify any LLM as using Markovian dynamics, but what is the point of calling it something so abstract? More accurately, it's literally a language model:

  {
    I: { 
      want: { 
        to: { 
          be: { ... }, 
          know: { ... }
        }
      },
      will: { ... }
    },
    ...
  }
Every word of every sentence is modeled and ranked, and there are methods to perform operations on it. If you added a lot more words and phrases to the model, it would be a "large" language model. It also supports non-words though, so it's more accurately a "next token prediction library" that can be used to create language models.


The language is not really related to this; people have been doing this type of thing in Prolog and Lisp since before the 90s. Actually, with predicates to represent facts that you learn in your first Prolog class, you can do what you did but in any order, so not just following the hashmap from start to end but create any sentence starting from any point without effort. (Note also there the language doesn’t matter, you can do this in any language, just for Prolog this is the first thing you will learn as it’s a fundament and strength of the language).

I recommend a read of symbolic ai, or better, just a generic ai book: I was raised on Norvig (90s) but no idea how up to date that is kept/still is.


Yep it can work for "next-pixel prediction" too where the token is a pixel. If I have enough samples of arrangements of pixels, it's not fundamentally different than samples of arrangements of words. I'll check out that book - thanks!


Please change the title; it seems very misleading. It has nothing to do with LLMs.

https://github.com/bennyschmidt/next-token-prediction/blob/d...


Ok, we've taken the reference to LLMs out of the title now. (Submitted title was "Show HN: Next-token prediction in JavaScript – build fast LLMs from scratch")


Thank you!


It's a next token prediction library, not an LLM. The title says "Build LLMs from scratch" and that's exactly what you can do with it :) Keep in mind an LLM is not "ChatGPT" though ChatGPT uses an LLM.


> "Build LLMs from scratch" and that's exactly what you can do with it

Can you pls explain what you just said as if I were a 5-year-old?


Instead of using someone else's language library pre-trained on a bunch of text, you can create your own language library trained on your own text.

Use cases include: Auto-completion, auto-correct, spell checking, search/lookup, conversation simulation (chatbot), and more.

"Trained" means that every word of every sentence of every book loaded in has an embedded score representing how likely it is to appear following a user-provided cursor. The more text you have trained on (see /training/embeddings/) the more scored words it has available, so the more robust and accurate it can be with language.

What would make your implementation a "Large" Language Model is if it has many, many, many words and phrases modeled and ranked, not just a small amount.

Chat Bots:

For something like a realistic conversation simulator (AKA a chat bot) you usually need a Large Language Model (LLM) so that any user-provided cursor yields relevant responses. To filter out the noise of less relevant responses, a good LLM-based chatbot will also use parts-of-speech and sentiment analysis ("attention") to match to the most relevant one.

Example cursor: "Paris is", you might get "in France" or "really nice" as equally ranked predictions, what would make a Chat-specific model choose one or the other would be analyzing the input - example if the question was "Where is Paris?" the Chat model would be biased toward answers that mention locations, and respond with "in France".

User: "Where is Paris?"

Convert to cursor: "Paris is"

Get completions: "in France", "really nice", "a first name", etc.

Analyze input: "Where is" refers to a location

Analyze completions: "in France" refers to a location

Answer: "Paris is in France"

Hope this helps!


Out of curiosity, why did you link the comment block of the `train` function? Can't figure out the significance of linking this or how it supports your claim.


He ain't gonna answer is he? Kinda crazy how karma/reputation works on these sites, you just get what you want and have to explain nothing


You've made something interesting.

Many people have already answered this here [1], so there's no point in me explaining it again.

Dropping some pointers that will be helpful for you:

  - https://karpathy.ai/zero-to-hero.html

    - Highly recommend this one https://www.youtube.com/watch?v=PaCmpygFfXo

  - https://github.com/karpathy/llm.c

  - https://ig.ft.com/generative-ai

1: ex: https://news.ycombinator.com/reply?id=40007096&goto=item%3Fi...


Thanks but I'm still curious why you linked to the transformer or why it's relevant to your claim of "has nothing to do with LLMs" which held enough weight to get the title of the post changed.


Or just keep dodging the question lol


Calling this an LLM and claiming:

> With enough training data and a good chat interface, this can be used instead of well-known decoder-only models like GPT, Mistral, etc.

Seems to be very misleading...


> enough training data > good chat interface

I'm going to try to do that here: https://github.com/bennyschmidt/llimo

I totally understand your skepticism, and am also surprised it works so well as a sentence-finisher, even with hardly any data (a handful of books). Think about it like this:

If you had a file with billions of sentences:

"Paris is in France" "Apples grow on trees" "Gold is worth more than silver" "I know where to go!" etc...

Then you can complete virtually any sentence. Say the user enters "Paris" - you can easily find and return "is in France" by simply searching for "Paris" in the text then slicing out the rest of the sentence, "is in France" -- or just give the next word "is" for more of an auto-suggest feel.

But if there were 4 sentences starting with "Paris", it gets slightly more complex, now I have to rank them to know which one is the best suggestion:

"Paris is in France." "Paris is nice this time of year." "Paris Hilton liked my post!" "Paris was my favorite city overall."

In this case, "is" is still the best because it scores higher than "Hilton" and "was" - because it follows "Paris" more frequently. So in addition to the text file of billions of sentences, I need to make a ranking system to give a point score to every possible word in the file at every possible position it might be in.

To make it all faster (because a super massive text file is not feasible), the billions of sentences are not actually represented in a single text file, but as a deeply nested JavaScript object our computers are more optimized to traverse and lookup, along with the point values for each word.

At this point, _with enough sentences on file_ with ranked suggestions and fast lookups, you can complete almost anything a user could input. This is what I shared today.

> good chat interface

To put the sentence-finisher to use as a chat bot, all you need to do is convert the user's question or "prompt" into the beginning of a sentence. For example:

"Where is Paris?" -> "Paris is" and let the completer give you "in France". So the answer is: "Paris is in France."

Does this make sense?


This thing you are doing is called a "Markov chain", search it up and read about it.


It's a next token prediction library. "Markov chain" basically means finite state machine and is a looser concept. If you want to call the token prediction methodology Markovian you can - sounds cool! The implementation another person linked here ranking words would also qualify any LLM as using Markovian dynamics, but what is the point of calling it something so abstract?

More accurately, it's literally a language model:

  {
    I: { 
      want: { 
        to: { 
          be: { ... }, 
          know: { ... }
        }
      },
      will: { ... }
    },
    ...
  }
Every word of every sentence is modeled and ranked, and there are methods to perform operations on it. If you added a lot more words and phrases to the model, it would be a "large" language model. It also supports non-words though, so it's more accurately a "next token prediction library" that can be used to create language models.


Markov chain would be more sophisticated/advanced.

This is unoptimised, naïve implementation of ngram language model, idea from seven decades ago [0].

[0] "A Mathematical Theory of Communications" CE SHANNON, 1948


How many comments are you going to leave, what is it Day 3 for you?

Markov chain is equivalent to "state machine" and I can't believe the number of braindead people on this page who don't know this basic fact.

> "The Markov Property states that the probability of future states depends only on the present state."

> "A Markov chain is a type of Markov process that has either a discrete state space or a discrete index set (often representing time), but the precise definition of a Markov chain varies."

https://en.wikipedia.org/wiki/Markov_chain

^ You could have spent the last 3 days learning this basic fact instead of trolling Hacker News. Notice it has nothing to do with token prediction specifically. It's just a loose philosophical concept that means "finite state machine" (AKA deterministic/predictable sequence of states). The React library "XState" is said to implement a Markov chain. Think about what the value would be in saying "this library is trash! All it is is a Markov chain!" totally missing the point of what it does. GPT uses next-token prediction too - from sEvEn dEcAdEs aGo~ (probably more tbh, that's all you found?)

For the sake of your hilarious argument - the data structure I use to model language is not "either ngram or Markov chain" - Markov chains use ngrams in the form of unigrams, bigrams, and trigrams (or if an unknown number: "ngrams"). They're not concepts at odds lol. I hope you learned something here, but I doubt it.

Finally, the data structure the next-token-prediction lib uses is really none of those concepts, it's more accurately a "language model", it's not a state machine at all. One guy said "Markov" and people parroted them in Reddit fashion, and now I get to deal with the bottom-of-the-barrel (you). You really should educate yourself, it would do wonders.


You're conflating state machines with markov chains. Markov chains are stochastic, xstate library is not meant for markov chains - I doubt it has any support for state transitions from probability distributions.

Your library is ngram based model.


Day 4 :D

Re: "stochastic" flawless copy pasta from Google but a Markov chain is still an example of a (finite) state machine and is not itself an implementation of anything.

ngram Language Models are an implementation though, and is not a competing concept:

With a language model, you could talk about "a three word Markov chain" or you can simply say "a trigram". You can say "A Markov chain of variable length" or you can say "an ngram". That is all that is meant regarding those 2.

If a Markov assumption is that you can predict the next word based on knowing all the previous words, then a bigram assumption would be that you can predict the next word based on the previous 1 word. A trigram assumption is that you can predict a word with 2 previous words, because they're all 3 part of the same trigram.

More from Stanford on language models (LM):

> "Models that assign probabilities to sequences of words are called language models or LMs. In this chapter we introduce the simplest model that assigns probabilities to sentences and sequences of words, the n-gram."

> "Markov models are the class of probabilistic models that assume we can predict the probability of some future unit without looking too far into the past. We can generalize the bigram (which looks one word into the past) to the trigram (which looks two words into the past) and thus to the n-gram."

https://web.stanford.edu/~jurafsky/slp3/old_jan23/3.pdf

Looking forward to your next comment!


LLMs can handle questions and answers they have never seen word by word before though.


I think this is the key. A system based on just counting the words in its database will give 0% for sentence continuations not in the database, but LLMs don't look up, they extrapolate, and can provide good sentence completions for sentences that are entirely new.


Two things: 1) This library will continue predicting words too for every word that exists in the model, it just makes less sense as you keep going. That's also true of any LLM. 2) An LLM is not synonymous with "chat bot".


Stick with it and eventually you'll invent your own subset of Prolog.


Nice that the topic gets you enthusiastic! If you want to learn about LLMs, I suggest you try educational material that doesn’t simplify it too much (i.e. not videos), before implementing something.

Personal recommendation: https://web.stanford.edu/~jurafsky/slp3/ Chapters 6-10 should give you a good understanding of these concepts, also from the mathematical point of view!


Amazing resource thank you!

Agree regarding videos for learning technical concepts - strangely when things are over-simplified they can become harder to understand on a technical level, I guess just being so abstracted. Example: The embeddings in the video I mentioned were always visualized spatially, where I simply think about it in terms of numerical point values. Like in the movie Hackers where they're traveling through 3D "cities" of "data" or whatever is going on. On one hand it simplifies the domain for a wider audience, yet it turns it into something else entirely!

Appreciate you sharing these.


wow Thanks!


This is an extremely inefficient trie. It's just a data structure, not an LLM.

It's used for autocomplete, but it's a bad way to represent large amounts of text because the fanout is insane. Markov chains would be an improvement on this.

https://en.wikipedia.org/wiki/Trie


For chat, there's basically a missing NLP layer - what pop culture calls "multi-headed attention blocks", but for next token prediction it's pretty good. Maybe the philosophical concept of a Markov chain abstractly describes this code, but a "language model" is more closely what it is.


I'm telling you this is literally a trie dude.


Yep, it's also "just a Markov chain", and "just a data structure", and "just lookups" but what I'm telling you is that it's really (fr fr) a next-token prediction library that you can use like this:

  // Predict the next word

  agent.getTokenPrediction('what');

  // Predict the next 5 words

  agent.getTokenSequencePrediction('what is', 5);

  // Complete the phrase

  agent.complete('hopefully');

  // Get a top k sample of completion predictions

  agent.getCompletions('The sun');
Hope this helps!


Yes, I know what it is and what you can use it for. I have built my own trie before and like I said this data structure is used for autocompletion. A Markov chain is different.

It's just not going to be able to scale up like an LLM because storing all the possibilities in a tree structure like this will take way too much memory.

You've stumbled upon something you think is new, when it's in fact really, really old and has nothing to do with LLMs.


Take it up with the Markov chain people on this page.

It would be too much memory if I kept the data in strings, or even if I used a Python list, but it's not when I use a JavaScript object. It's one of the main points I explained in the original post.

You and 2 or 3 others have said "has nothing to do with LLMs" when if you watched the 3Blue1Brown video you would know that GPT uses next-token prediction and it's fundamental to prediction.

I never said it was new, and of course have used and built auto-complete in the past, the NPC dialog in one of my games is structured this way. What's novel is crawling huge amounts of data and creating a much larger structure I knew was possible, and realizing at a certain size it can predict further out.

Some of you people here are impossible. One guy went as far to say "Auto-complete is not possible using the concepts used in your project." when he can literally `npm i next-token-prediction` and try it himself, I even put an example project in the directory. It's really sad, I expect this level of in-total-denial pretentiousness on Reddit, but not HN, should be more tinkery and evidence-driven here :) Take care.


I think this rather misleads people about what an LLM actually is and why they are so capable.


Yes adding a human interface layer (like ChatGPT) will make it more capable, but keep in mind LLM does not mean "chat bot".


Nice work! I did something in a similar vein, using JavaScript and Markov chains back in 2015 or so and ended up giving a fun talk about it at a coding bootcamp. [1]

I based it off a Ruby module called “Twitter Ebooks”, which I believe was based on a Python module called “horsey books”, which could create a bot that would make posts using Markov chains, given a corpus of text. [2]

(I’m not dropping the date to be a jerk, more to point out that Markov chains / n-grams are an interesting concept for experimenting with language models and have been around for a while.)

In my case, I fed a corpus of data (my Twitter history at the point, which was something like 20K tweets) and tried to create a Twitter bot to make intelligent sounding posts.

It worked out, sometimes. But, it was a lot of fun to figure out!

[1] “Building intelligent robots using Node.js that can conquer Twitter” - https://www.youtube.com/watch?v=rMmXdiUGsr4

[2] https://github.com/longears/horsey-books


Awesome video, and really cool. I loved the ATTCares convo haha. I first heard about Markov chains from a philosopher named Donald Hoffman, but this demo makes it more clear! Thanks for sharing


Man watch video from Karpathy and learn about prefix trees (tries)


Seems like an Elon Musk influencer puke but I liked the headlines of some of the videos. Unfortunately they're so incredibly boring I couldn't stand to watch another second. Thanks for sharing!



This is awesome, thanks. I've been messing with wink's NLP library (https://winkjs.org/wink-nlp/) to transform user queries and format responses so I can make a proper chat bot - will see what I can learn from these!



Very nice, thanks for sharing! I noticed their token sequence length is fixed:

  Constructing 7-gram LM ...
  Created trie with 21,502,513 nodes.
People on here will be happy to say that I do a similar thing, however my sequence length is dynamic because I also use a 2nd data structure - I'll use pretentious academic speak: I use a simple bigram LM (2-gram) for single next-word likeliness and separately a trie that models all words and phrases (so, n-gram). Not sure how many total nodes because sentence lengths vary in training data, but there are about 200,000 entry points (keys) so probably about 2-10 million total nodes in the default setup.

"Constructing 7-gram LM": They likely started with bigrams (what I use) which only tells you the next word based on 1 word given, and thought to increase accuracy by modeling out more words in a sequence, and eventually let the user (developer) pass in any amount they want to model (https://github.com/google-research/google-research/blob/5c87...). I thought of this too at first, but I actually got more accuracy (and speed) out of just keeping them as bigrams and making a totally separate structure that models out an n-gram of all phrases (e.g. could be a 24-token long sequence or 100+ tokens etc. I model it all) and if that phrase is found, then I just get the bigram assumption of the last token of the phrase. This works better when the training data is more diverse (for a very generic model), but theirs would probably outperform mine on accuracy when the training data has a lot of nearly identical sentences that only change wildly toward the end - I don't find this pattern in typical data though, maybe for certain coding and other tasks there are those patterns though. But because it's not dynamic and they make you provide that number, even a low number (any phrase longer than 2 words) - theirs will always have to do more lookup work than with simple bigrams and they're also limited by that fixed number as far as accuracy. I wonder how scalable that is - if I need to train on occasional ~100-word long sentences but also (and mostly) just ~3-word long sentences, I guess I set this to 100 and have a mostly "undefined" trie.

I also thought of the name "LMJS", theirs is "jslm" :) but I went with simply "next-token-prediction" because that's what it ultimately does as a library. I don't know what theirs is really designed for other than proving a concept. Most of their code files are actually comments and hypothetical scenarios.

I recently added a browser example showing simple autocomplete using my library: https://github.com/bennyschmidt/next-token-prediction/tree/m... (video)

And next I'm implementing 8-dimensional embeddings that are converted to normalized vectors between 0-1 to see if doing math on them does anything useful beyond similarity, right now they look like this:

  [nextFrequency, prevalence, specificity, length, firstLetter, lastLetter, firstVowel, lastVowel]
For example, here's the embedding for "dog":

  [0.6666666666666666, 0.6, 0.45714285714285713, 0.15, 0.12, 0.24, 0.56, 0.56]
where it used to just be 1D:

  { dog: 23 }
Up until now I've only effectively used nextFrequency to predict the next word, but with these other 7 metrics I'm able to store more info about each word and being vectors am able to store it efficiently, and do what all the smart people keep saying like dot product similarity index etc. The big boys are using like 1536D arrays etc. but with just 7 it still works, and it's weird that it works.

Thanks again for sharing!


Their library was actually made for dasher.. http://www.inference.org.uk/dasher/ - there was a web version being made (https://github.com/dasher-project/dasher-web We hit a bottleneck with the graphics driving. Note in dasher pretty much the entire tree is in dynamic view). Now this may help to understand the use case. Dasher is for people with disabilities who cant speak. It needs to be a personalised LM that trains on the fly and and keeps track of new words/sentences. But in truth too, utterances are usually small.

Don't get too knocked back by comments. A) If it works - it works. B) Your learning is as valuable as the outcome.

Oh have a look at https://imagineville.org/software/ for some other things that may be of interest..


The visual tool on their site is trippy! Very cool.

I appreciate the words of wisdom/motivation :)

Since my last comment the embedding vector is now 16-dimensions, but I'm not quite getting "King - Woman = Queen" from it yet. It actually does find similar words if I score word features, normalize the scores to a min/max spectrum between 0-1, then multiply 2 vectors (dot products) to get similiarity. But in the middle of the experiment I was like "what is this for..?" so I just kinda stopped working on the embedding refactor until I have a real world use case.

I know for example text-embedding-ada-002 uses 1536 length arrays, and others use over 3k, so maybe at some point the usefulness of having very complex embeddings emerges.

For now, the original approach still seems superior for next token prediction, bigram frequency (how often a word follows another) just need to have enough sentences modeled and scored.


This is what HN is all about. This probably isn't your next big LLM project but its a wonderfully simple example that shows what is happening underneath the hood. Do you plan on writing up something like a blog post to go with it?


It is a fun and interesting project, but it doesn't really show what is happening under the hood of LLMs since they work on a completely different principle.


> LLMs work on a completely different principle

News to me - can you tell me what principle you think an LLM works on?


LLMs are artificial neural networks. What you seem to be describing is a trie. You dont use tries at all when building neural networks, you use matrices, and typically algorithms like backpropagation to train the weights of the network.

Now, I haven't gone through your code, I'm relying mainly on your description, so what you've said may have confused me and be different from what you've done, but what you seem to be calling an 'embedding' is a trie storing a dataset of sentences. The actual embedding used in LLMs is a mapping from words (a word/token list is effectively a 1 dimensional space) into a much higher dimensional space that captures aspects of the meaning of the words being embedded (e.g. similar concepts may map to similar vectors in the space, or you might be able to do classic word arithmetic like king - man + woman = queen with the vectors that each word maps to).

Then the neural network works, ultimately resulting in a vector in that space, which we map into the next token.

This is a very different process to looking up a set of options in a trie, and I would expect it to outperform the trie method by an enormous amount on longer outputs, or where outputs that are not in the training data are needed.


Sure, artificial neural network is a great analogy. But that leaves out what it literally is, which is a next-token prediction library that you can use like this:

  // Predict the next word

  agent.getTokenPrediction('what');

  // Predict the next 5 words

  agent.getTokenSequencePrediction('what is', 5);

  // Complete the phrase

  agent.complete('hopefully');

  // Get a top k sample of completion predictions

  agent.getCompletions('The sun');
And it has this ability to predict because it's based on a language model that turns mere sentences like: "I want to be at the beach", "I will do it later", "I want to know the answer" into a single model:

  {
    I: { 
      want: { 
        to: { 
          be: { ... }, 
          know: { ... }
        }
      },
      will: { ... }
    },
    ...
  }
This model - call it a prefix tree (trie), call it a Markov model or "chain", or "just lookups", anything you like - but it's more specifically than all those a language model.

Embeddings: Where my embeddings differ is that they're kept in a structure reflective of their source context. That's how I know what the meaning of the word is. OpenAI's embedding is like this:

  Text: "The dog"

  Embedding:
  
    {
      "object": "embedding",
      "index": 0,
      "embedding": [
        -0.006929283495992422,
        -0.005336422007530928,
        -4.547132266452536e-05,
        -0.024047505110502243,
        ...
      ],
    }
And those values, particularly those for "dog" will be - like you said - computable with others to find things like similarity, etc. And using that math, they can tell you what the next word will be - and probably a lot of other useful chat-specific things I haven't gotten into.

My method is also commonly used in machine learning, and would be TF-IDF vectorization except that I don't fully vectorize the semantics, because it's just based on frequency and token structure:

  {
    "the": {
      "dog": {
        "was": 834,
        "is": 721,
        "will": 36
        "has": 442
      }
    }
  }
   
This system predicts "was" as the next word because it usually is the next word after "dog" (in the source data). This library was built to ultimately provide completions, not have a conversation, so no doubt OpenAI's approach works better for chat.

I am however already making a chat model. Here's my approach if anyone cares: The completer already gives great completions and fast, but some of them make no sense to what was asked. The chat model I'm working on here (https://github.com/bennyschmidt/llimo/pull/1) can just get all completions and use parts-of-speech codes to match a completion to the cursor. I don't have this fully implemented yet, but you can get the idea in this PR. This is like an NLP layer specific to chat - has nothing to do with the next-token prediction in general, and there are no NLP libraries in `next-token-prediction` (the npm). The example I've been using to explain this is:

  User: "Where is Paris?"

  Convert to cursor: "Paris is"

  Get completions: "in France", "really nice", "a first name", etc.

  Analyze input: "Where is" refers to a location

  Analyze completions: "in France" refers to a location

  Answer: "Paris is in France"
I hope it makes sense! Thanks for your comment.

Edit: And yes, again, the library does not claim to "be an LLM" - it's not that at all, it's actually a next-token prediction library that has methods like `getTokenPrediction` etc. Though you are set up to build an LLM with this if you train enough data :) The library is not going to provide a `chat` function like GPT, not going to provide an img2img API like Stable Diffusion, that would be whatever chat-based or image-based LLM is being made with this library.


> Sure, artificial neural network is a great analogy. But that leaves out what it literally is

It's not an analogy - an artificial neural network is literally what an LLM is. A trie is not an LLM, because it's not a neural network. If someone asked how many parameters a trie loaded with many sentences has, that question wouldn't even make sense, because it doesn't model weights between neurons, unlike an LLM.

I think perhaps you're getting confused between the class of 'next token predicting algorithms' (which is a class with lots of different algorithms in it) and 'LLMs'. LLMs are a specific kind of next token predicting algorithm that work in a particular way. You can make a next token predictor out of a trie (as you have), but it is not an LLM.


Again, the library does not claim to "be an LLM" - it's not that at all, it's actually a next-token prediction library that has methods like `getTokenPrediction` etc. Though you are set up to build an LLM with this if you train enough data :) The library is not going to provide a `chat` function like GPT, not going to provide an img2img API like Stable Diffusion, that would be whatever chat-based or image-based LLM is being made with this library.

(reply was not visible before, so I appended as an Edit)


> Though you are set up to build an LLM with this if you train enough data

But you aren't. No amount of data loaded into a trie will turn the trie into a neural net.

> The library is not going to provide a `chat` function like GPT,

Providing a 'chat' function is nothing to do with what a LLM is. Here's the first sentence from the second paragraph of wikipedia on Large Language Models: "LLMs are artificial neural networks." When someone points out that you wouldn't use this kind of library to create an LLM, you generally say something about how you think they're complaining that it doesn't chat with you. It's not about chat, it's about the fact that the functionality you've written about has nothing to do with the way that LLMs work. LLMs are artificial neural networks. What you've written is a trie, not an artificial neural network. Artificial neural networks need fast matrix multiplication, not tries.


The entire library is not just a data structure, it just uses it to model language. You still haven't npm installed have you? If you spent 5 minutes with it, it would answer your questions. You use it like this:

`agent.getTokenPrediction('when')`

It's not "bro it's just a trie" like you keep saying no matter what I tell you or try to show you. "A trie" is not a JavaScript library you can install with methods like "getTokenPrediction" or "complete" that help you auto-complete sequences.

And since you love the neuron model so much, know that you can train on data that begets the same node/weight based "learning" that an ANN does, I just haven't gone all-in on the "neuron" analogy. You know - other people here want me to call it a Markov chain, not a trie at all, and don't want me to call the semantic score "embeddings" because they aren't stored as numbers, so one thing I've learned is this phrasing ("branding") matters more than functionality.

Your last paragraph - yeah we agree, and I've been trying to tell a lot of other people that "LLM !== chat bot". Providing a `chat` method doesn't make it equivalent to an LLM so I didn't provide anything related to chat. I have no idea what kind of AI model you might build - could be next-pixel prediction and you get into Stable Diffusion-like functionality, not chat at all.

"What you've written is a trie" I didn't write a trie, they're automatically generated during training. The library has nothing to do with that, but you keep saying it does. What I did write is a next-token prediction library for giving you the next likeliest letter, word, pixel, etc.

Hope this makes sense!


Hell yeah :) At first I thought a simple Show HN titled "Next token prediction in 100 lines of JavaScript" would be cool, but the project grew and now it's a few different modules and things going on.

A couple people have basically said "but what does it do?", "what value does it add?"

So I am planning a series of examples with videos for use cases like: Auto-completion, auto-correct, spell checking, search/lookup, and if I can get around to making an ImageDecoder, could get into "next-pixel prediction".

Beyond prediction, I want to make a YouTube video when I can show a working example of a really good chatbot than anyone can just npm install and use, would be funny to give some of the open models a run for their money - but you're right, it's just a hobby project for now. Working on that in a separate project: https://github.com/bennyschmidt/llimo

Thanks for the comment and motivation to write about it!


> inb4 "just a Markov Model"

A Markov chain is a much looser concept than next token prediction, and would apply to any finite state machine where you can predict the next state. The React state management library XState is a Markov chain.

If you want to call a token-prediction library that trains on .txt files Markovian you can - sounds cool! But token prediction is more specific to what the library does.

Use cases include: Auto-completion, auto-correct, spell checking, search/lookup, conversation simulation (chatbot), and more.

I appreciate all the comments! Especially from those who watched the video and get it.


Sorry OP, you seem to have a wrong understanding of what an LLM is, does and how it's trained. And I know that an LLM is not a chatbot, no worries.

At most, your project is a simple language model, but definitely not a large language model.

After looking at your code, you also seem to have a wrong understanding of embeddings. Other people in this thread have already shared great resources that offer good explanations on these topics.

You seem to be very dismissive of Markov chains (or HMMs) while they've been used in NLP for years and produce really great results. Your project seems to use some concepts of Markov chains but more simplified. Here [1] is a random article that does next-token prediction using Markov chains. It's basically what your project does but in a more scalable and probabilistic way.

>Use cases include: Auto-completion, auto-correct, spell checking, search/lookup, conversation simulation (chatbot), and more.

These use cases are not possible using the concepts used in your project.

[1] https://bespoyasov.me/blog/text-generation-with-markov-chain...


Sorry you feel I haven't been grateful enough when people share resources and comments, I really do appreciate it - and I appreciate what you just shared here too. Thanks!

I see a lot of parallels to this exercise and what I did - their use of generators is exceptionally slick. I bet theirs is way slower though. Mine has 1 dep (lodash) just for its deep merge function. I didn't even use Wink NLP for text splitting and other tasks because it was 10x or 20x slower than my simpler stuff from Stack Overflow. I actually like your description of "Markov chains but simplified". But really think the devx of `npm i simple-markov-chain` is worse than `npm i next-token-prediction`. I want people to actually use it for fast prediction.

AFAIK the only difference between a language model and a "large" one is the amount of data it's trained on. Because that translates into a much larger model.

Yeah this concept of embeddings is different than, say OpenAI's. They capture a broad spectrum of semantics, where mine only focuses on frequency and structure. However, I find that very often frequency and structure allows you to bypass semantic analysis and just get to the answer. The frequency method I'm using is I guess a bastardization of tf-IDF vectorization, let's call it "tf-IDF JSON.stringify!"

"These use cases are not possible using the concepts used in your project."

^ not sure what you mean by this one, feel free to try it yourself or watch the video demo


Go forth and ignore the naysayers. Take advice and criticism that can help you further your project. Godspeed!

If entire models come to weights and big a%√ json, whose to say this isn't useful.


I really appreciate this, thanks! Sea of haters on HN lol some are just confused or perpetually angry. I took some of them too seriously, shouldn't have even replied, especially those mentioning "Markov chains" as if that narrows it down further than "next-token prediction" which is exactly what it is and what the npm is called.

Usefulness: My goal by next week is to share a decent AI chat bot like ChatGPT using this library. I'm working with dang right now over email - who is one of the most cool and helpful moderators of all time - to get it unflagged, or at the very least to un-ruin my account so I can actually share future progress.

Turns out people really hate any definition of LLM that doesn't involve a black box of magic.

Thanks!


The only thing it shows is that the author has a lot to learn.


Of course I do, who doesn't? Can you pick 1 thing I need to focus on most so I can get some value out of this comment? :D


You can't "Demystify LLMs for people" if you don't know how it works.

And it definitely isn't a Marcov model (https://en.wikipedia.org/wiki/Hidden_Markov_model) like you implied it is.

As far as the "emergent" apparent intelligence of SoTA LLM goes, it's not just counting the number of words that appear after a phrase. I don't think anyone has a good explanation of how it works so far, only that empirically a bunch of techniques combined together seems to do the trick and LLMs do seem to acquire intelligence far beyond what a Marcov model could acquire after the training process.

Therefore, see eg. https://www.reddit.com/r/LocalLLaMA/comments/1bgh9h4/the_tru...

A joke, but it's the truth as we know it.

> Of course I do, who doesn't?

Implying you (who apparently just watched a single video) are as ignorant as people who has read extensively on the subject (not me, but presumably many others here) isn't helpful. I mean, there's nothing stopping you from believing you know enough to explain what LLMs are to other people, but the GP is honestly telling you what you need to hear IMHO.


I never said or implied it's a Markov model, someone else did in this thread, and someone parroted them. A Markov chain is a much looser concept than next token prediction, and would apply to any finite state machine where you can predict the next state. The React state management library XState is a Markov chain. Has nothing to do with language at all.

I don't appreciate you calling me ignorant, but I think the source of your projection is this phrase: "I don't think anyone has a good explanation of how it works so far" which translates to you don't know the basics. I get it, makes you mad for some reason lol But maybe it's you that needs to read up on what a language model actually is - what next token prediction means, and how it applies to LLMs you know and love. It really is not rocket science :)

Thanks for your comment!


hnfong was referring to emergent abilities, not to "the basics" – the definition of emergent abilities as they apply to language models is that they are not present in smaller models, only in large ones.

That also implicitly defines that a model without "emergence" would not be large.

If you happen to know exactly why this happens, I'll definitely read your paper!


Prepare to be underwhelmed - no need for a paper for this one: The only emergent ability of an LLM is that the model is more robust when there are more samples. When the number of samples is in the trillions instead of thousands, there are a lot more complete concepts available for it to match to your query.

The so-called "emergent ability" of a chat-focused LLM is its accuracy, which is only possible with enough sample data, and also only possible if it's good at matching a query to the right samples, and wording a response in a way that's pleasing to the end user - something even the mainstream LLMs struggle to do well.

IME, pretty much only GPT-4 and Mistral are even that good. Most of them aren't that good at anything yet, and half the time don't follow what I ask at all. Being a very large model is only part of what makes it good.

The sad state of Hacker News is not how many people conflate "LLM" with "chatbot" but that an academic paper holds more weight than a library you can npm install right now and try. Seems we lost our way if abstract appeals to authority hold more weight than evidence before your eyes.

Thanks for your comment! I do appreciate the discussion.


Love doesn't always come in the form of flattery or praise, and if you took that as an assault or even an insult (which TBH it wasn't) instead of pointing out the limits of your knowledge, then that's a shame.

I mean, it's pointless to compare the size of our intellectual capabilities on the internet since we don't know each other, but perhaps I was wrong for a second that you might have actually wanted to "get some value" of the comment of "author has a lot to learn".


What is love? Baby, don't hurt me Don't hurt me no more

xD

Typically falsely accusing then calling a person ignorant is somewhat insulting! But now that you said it was in the name of love I'm tearing up (in a good way).

And you're right - what was I thinking? I'm in the presence of a literal genius here. THANK YOU for pointing out the "limits of my knowledge", sometimes it takes a really smart person like you to point it out, even though you got the wrong guy (Markov) and have yet to make a single point about LLMs or next-token prediction, beyond:

"Nobody knows how LLMs work" and a link to a Reddit meme. I can't be too far behind your advanced state of intellectual genius with zingers like these.

Finally, it may not seem like it, but I have derived value from your comments, just not any realizations you'd likely care to hear :)


^ This comment being downvoted, as innocent as it is, makes HN basically Reddit, the guy called me ignorant for sharing an auto-completion library trained on a lot of data, I deal with the immature comment humorously, kindly really, and it's downvoted lol whatever /thumbsdown Hacker News get rid of these people, promote tinkering and intelligence again


I would suggest first removing the reference to LLM in the description:

"build fast LLMs from scratch!"


Why? If I build a chat bot from it would it seem more like an LLM to you?


You're cargo culting "LLM".


You don't know what an LLM is - there's a difference.


You should start from chapter 1.


You've been coming back to this post for 3 days to leave more negative comments, how sad.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: