Hacker News new | past | comments | ask | show | jobs | submit login
ChatLZMA – text generation from data compression (pepijndevos.nl)
122 points by bschne on Aug 30, 2023 | hide | past | favorite | 18 comments



This reminds me of an interesting exeriment I did earlier this year with ChatGPT.

First, I came upon this reddit post [1] which describes being able to convert text into some ridiculous symbol soup that makes sense to ChatGPT.

Then, I considered the structure of my Typescript type files, ex [2], which are pretty straightforward and uniform, all things considered.

Playing around with the reddit compression prompt, I realized it performed poorly just passing in my type structures. So I made a simple script which essentially turned my types into a story.

Given a type definition:

    type IUserProfile {
        name: string;
        age: number;
    }
It's somewhat trivial to make a script to turn these into sentence structures, given the type is simple enough:

"IUserProfile contains: name which is a string; age which is a number; .... IUserProfiles contains: users which is an array of IUserProfile" and so on.

Passing this into the compression prompt was much more effective, and I ended up with a compressed version of my type system [3].

Regardless of the variability of the exercise, I can definitely say the prompt was able to generate some sensible components which more or less correctly implemented my type system when asked to, with some massaging. Not scalable, but interesting.

[1] https://www.reddit.com/r/ChatGPT/comments/12cvx9l/compressio...

[2] https://github.com/jcmccormick/wc/blob/c222aa577038fb55156b4...

[3] https://github.com/keybittech/wizapp/blob/f75e12dc3cc2da3a41...


I’m curious, did you actually run it through the tokenizer and see if it was less tokens vs uncompressed? I have seen a lot of people try these “compression” schemes and token usage can be higher.


It's definitely less tokens at least in my contrived case. Looking at the compressed text, I can make out what is what, and see that it's just minimizing words to their root parts.

Typescript (22 tokens):

    export type IAssist = { id: string; prompt: string; promptResult: string[]; };
Story (26 tokens):

    IAssist contains: id which is a string; prompt which is a string; promptResult which is an array of strings.
Compressed (13 tokens):

    IAsst{id,prompt,promptR}
And again I'll just call this interesting, because is it really going to know promptResult is a string array in most cases? Definitely not unless it gets some help in the component description, maybe.


I was lately playing Disgaea PC, which unlike a lot of games these days, has good text FAQs like

https://gamefaqs.gamespot.com/pc/183289-disgaea-pc/faqs/2623...

and thought about a question I'd thought about for a while which is extracting facts from that sort of thing and one notable thing is that certain named entities appear over and over throughout the document (say "Cave of Ordeal") and how both attention and compression-based approaches can draw a line between those occurrences.


Actually a neural network is just that: data compressed with losses. A transformer makes multiples queries to a large loss-y and stochastically compressed database to determine the next token to generate. The PAQ archiver is famous for being just that: a neural network to predict the next symbol.


this paper by the university Waterloo go in similar direction Text Classification: A Parameter-Free Classification Method with Compressors https://aclanthology.org/2023.findings-acl.426.pdf

i post also this short video because correct few numbers who where miss calculated in the paper and show little less optimize and simpler implementations who still works well https://youtu.be/jkdWzvMOPuo?si=K1VRtJ5BtqREa2mz source code implementations of the video https://github.com/Sentdex/Simple-kNN-Gzip


The compressor idea is really clever, but wouldn't it be nice to have 100% direct control over everything?

This got me thinking about the possibility of building a series of simple context/token probability tables in SQLite and running the show that way. Assuming we don't require massive context windows, what would prevent this from working?

It's not like we need to touch every row in the database all at the same time or load everything into RAM. Prediction is just an iterative query over a basic table - You could have a simple key-value pair of context & the next most likely token for the given context. All manner of normalization and database trickery available for abuse here. Clearly a shitload of rows, but I've seen some 10TB+ databases still satisfy queries in seconds. You could even store additional statistics per token/context for online learning scenarios (aka query-time calculation of token probabilities). You could keep multiple tokenization schemes online at the same time and combine them with various weightings.

What would be more efficient/cheaper than this if we could make it fit? Wouldn't it be easier to iterate basic tables of data and some SQL queries than to trip over python ML toolchains and GPU drivers all day?


Weighted Finite State Transducers in speech recognition: https://scholar.google.com/scholar?q=finite+state+transducer...

Modified Kneser-Ney smoothing: https://en.m.wikipedia.org/wiki/Kneser%E2%80%93Ney_smoothing....

We've been here before, neural LMs replaced that generation of models.


and you can still combine them for tasks that require strict output control (e.g. alphanumeric sequence recognition, noisy keyword spotting, strict grammars, etc).


GPT4[0] is actually very good with base64 to the point where it makes perfect sense.

I'd be interested in how well you could finetune 3.5 to use different compression.

[0] - https://platform.openai.com/playground/p/hfLUBCTE8RrRYPIRxEe...


Wasn't there some work a while back on training LLMs on compressed data?


Might you be thinking of this? https://news.ycombinator.com/item?id=36732430


A much more efficient implementation than mine at https://github.com/Futrell/ziplm

Instead of sampling strings character-by-character, this one adds random bytes to the compressed text and then decodes.


I immediately thought of your project. Thanks for explaining the difference!


A few days ago I explored adding beam search and other features to the Gzip Language Model: https://github.com/thomasahle/ziplm/

Turns out you can get a lot better text out of these simple models of you add some basic LLM features.


- So, how do you build ChatGPT with data compression?

ChatGPT is already built with data compression, the training loss is cross entropy which means the explicit goal of the training is to compress the training dataset to the fewest bits.


Reminds me of trying to read Gullivers Travels


You might be thinking of some other literary work; Gulliver's Travels isn't known for being particularly hard to read.

Myself, I was reminded of Finnegans Wake by James Joyce.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: