I used this approach extensively over the past couple of months with GPT-4 and GPT-4o while building https://hotseatai.com. Two things that helped me:
1. Prompt with examples. I included an example image with an example transcription as part of the prompt. This made GPT make fewer mistakes and improved output accuracy.
2. Confidence score. I extracted the embedded text from the PDF and compared the frequency of character triples in the source text and GPT’s output. If there was a significant difference (less than 90% overlap) I would log a warning. This helped detect cases when GPT omitted entire paragraphs of text.
One option we've been testing is the 'maintainFormat` mode. This tries to return the markdown in a consistent format by passing the output of a prior page in as additional context for the next page. Especially useful if you've got tables that span pages. The flow is pretty much:
I think so. I'd normalize the text first: lowercase it and remove all non-alphanumeric characters. E.g for the phrase "What now?" I'd create these trigrams: wha, hat, atn, tno, now.
A different approach from vanilla OCR/parsing seems to be this mixed ColPali integrating a purposed small vision models and a ColBERT type indexing for retrieval. So - if search is the intended use case - it can skip the whole OCR step entirely.
It detects if a message contains the ”Final Answer” substring preceded by a specific emoji. The emoji is there to make the substring relatively unique.
You're right that sections reference each other, and sometimes reference other regulations. By creating the "plan for the junior lawyer", the LLM can reference multiple related sections at the same time. In the second step of the example plan in the post there's a reference to "Articles 8-15", meaning 7 articles that should be analyzed together.
The system is indeed limited in the way that it cannot reference other regulations. We've heard it's a problem from users too.
One of the applications are ZK-Rollups [1] which allow developers to move heavy computation off a blockchain. The blockchain receives the results and only validates proofs that they are valid. This is especially useful on Ethereum because its computational throughput is pretty low.
There's also ZCash [2], which is a cryptocurrency that lets you make untraceable transactions. This is in stark contrast to Bitcoin or Ethereum, where transaction information is available publicly to everyone. They have a series of blog posts [3] on the math that actually makes it work under the hood.
We've been using https://github.com/electric-sql/electric for real-time sync for the past month or so and it's been great. Rather than make you think about CRDTs explicitly, Electric syncs an in-browser sqlite db (WASM powered) with a central postgres instance. As a developer, you get local-first performance and real-time sync between users. And it's actually faster to ship an application without writing any APIs and just using the database directly. Only downside is Electric is immature and we often run into bugs, but as a startup we're willing to deal with it in exchange for shipping faster.
I've been wondering how well Electric's been working for people ever since I heard about it; good to hear that it's been useful for you.
Couple of questions:
- How big is the WASM blob that you need to ship for in-browser SQLite? Have you had any noticable issues from shipping a large payload to the browser?
- What are you using to persist the SQLite database on clients? Have you been using the Origin Private File System?
Gotcha, interesting. 1.1 MB isn't too bad, especially with Cloudflare providing a local PoP. And if this is for Hocus, I'm guessing your frontend isn't used much on mobile devices with iffy connections.
That writeup on different SQLite VFS's for in-browser use is helpful, thanks for linking that.
Every postgres migration is done through an Electric proxy and it converts it into a corresponding sqlite migration that it can apply later on the client. In case of a migration that would be somehow breaking you can also drop the client-side sqlite database and resync state from postgres.
We have run into queries that corrupted the database client-side, but fortunately that doesn't propagate into postgres itself. In that case we had to drop the client-side db and resync from a clean state.
The corruption was also caught by sqlite itself - it threw a "malformed disk image" error and stopped responding to any further queries.
SQLite had 2 bugs[1] where batch atomic writes would corrupt your DB if you used IndexedDB to back your VFS. It has been patched in SQLite so rolling a new electric release that pulls in the latest SQLite build should fix that.
Any idea on what the root cause of the sqlite corruption was? There's some discussion on the SQLite forums about corruption with wasm (I've encountered it myself on a personal project), but from what I understand no one has identified a cause yet.
There's a workaround - if a table has an "electric_user_id" column then a user with that id (based on their JWT) can only read rows which have the same id. It's basic but it works for us. https://electric-sql.com/docs/reference/roadmap#shapes
Oh yes, in this post I was not trying to. Hocus gives you a web interface that lets you spin up a dev env with a single click of a button. We also implemented a git-integrated CI system that prebuilds your dev env on new commits. It’s basically a self-hosted Gitpod or GitHub Codespaces.
Nix solves a different problem than Hocus. Nix lets you define a development environment, Hocus gives you a way to run it on a remote server. Right now we use Dockerfiles to let users define the packages they need in their dev env, but we would like to support Nix in the future too. Interestingly, you can use custom BuildKit syntax https://docs.docker.com/build/dockerfile/frontend/ to build Nix environments with Docker https://github.com/reproducible-containers/buildkit-nix, and that's probably what we will end up supporting.
I think Nix is relevant here, because being able to run software across different machines reproducibly is one of its major selling point. I particularly like that it doesn't rely on virtualization or containerization to do that. It's up to the user to decide how to isolate the runtime environment from the host or whether they even should. Alternatively, tools building upon Nix can make that decision for them. Either way, it allows for a more flexible approach when you have to weigh the pros and cons of different isolation strategies. Development environments defined by Nix tend to compose well too, as a result of this design.
Other than making sure we release unused memory to the host, we didn't customize QEMU that much. Although we do have a cool layered storage solution - basically a faster alternative to QCOW2 that's also VMM independent. It's called overlaybd, and was created and implemented in Alibaba. That will probably be another blog post. https://github.com/containerd/overlaybd
We do, and we'd love to use it in the future. We've found that it's not ready for prime time yet and it's missing some features. The biggest problem was that it does not support discard operations yet. Here's a short writeup we did about VMMs that we considered: https://github.com/hocus-dev/hocus/blob/main/rfd/0002-worksp...
1. Prompt with examples. I included an example image with an example transcription as part of the prompt. This made GPT make fewer mistakes and improved output accuracy.
2. Confidence score. I extracted the embedded text from the PDF and compared the frequency of character triples in the source text and GPT’s output. If there was a significant difference (less than 90% overlap) I would log a warning. This helped detect cases when GPT omitted entire paragraphs of text.