How OpenElections uses LLMs

karel-3d · 2025-06-19T23:50:27 1750377027

I was thinking LLMs can be long-term regressive?

As the "proper solution" here is of course not using PDFs that are hard-to-parse, but force elections to have machine parseable outputs. And LLMs can "fix in place" stupid solutions.

That's not a hate on the author though. I needed to do some PDF parsing for bank statements before; but also; the proper long-term solution is force banks (by law or by public interest) to have parseable statements, not parse it!

Like putting LLMs to understand bad codebase will not fix the bad codebase, but will build on top of it.

oh well c'est la vie

dwillis · 2025-06-20T00:00:46 1750377646

Totally reasonable view, and one of our volunteers actually got the law in Kansas changed to mandate electronic publishing of statewide precinct results in a structured format! But finding legislative champions for this issue isn't easy.

ghghgfdfgh · 2025-06-20T07:47:19 1750405639

I’ve tried using LLM’s to do the same exact thing (turning precinct-level election results into a spreadsheet) and in my experience they worked rather poorly. Less accurate than traditional OCR, and considering how many fixes I had to make, altogether slower than manual entry. The resolution of the page made an outsized difference. It’s nice that you got it to work, but I am skeptical of it as a permanent solution.

Tangentially- I appreciate what OpenElections does- however, I wish there was a similar organization that did not limit themselves to officially certified results. There are already other organizations who collect precinct results post-2016, and using only official results basically limits you to 2008 and afterwards, but historical election results are the real intrigue. Not to mention that I have noticed many blatant errors in election results that have supposedly been “certified” by a state/county government. The precinct results Pennsylvania publishes, for example, are riddled with issues.

dwillis · 2025-06-20T12:05:51 1750421151

Skepticism is a necessary trait in this type of work, for sure. I will say that the performance has improved substantially in the past year, and there are still PDFs that require a lot of work.

We went with official precinct results for two main reasons: there are differences between election night and final results (some of them non-trivial) and to make the work more manageable. Agree that historical results are a real problem, and as a PA native I know only too well the errors that the state data contains, which is why we go county-by-county there.

model-15-DAV · 2025-06-20T00:07:17 1750378037

I think that we should encourage elections to _not_ be standardized. The problems among various polities in the USA have many different issues and should not be forced to conform to a specific way that elections should be done. This is a social problem and we should not cram it into a technical solution. Legibility of elections should be maintained at the local level, trying to make things legible at a national level is in my opinion unwanted. As much as I would like the data to be clean, people are not so clean. Even if they used slightly more structured formats than PDFs, the differences between polities must be maintained as long as they are different polities.

The way that OpenElections handles this, with 'sources' and 'data' directories I think is a good way to bridge the gap.

missingcolours · 2025-06-20T01:38:03 1750383483

Not being standardized is fine and even a positive (diversity of technology vendors is a security feature and increases confidence in elections). But producing machine readable outputs of some sort, instead of physical paper and PDFs, is clearly a positive as well.

mmcwilliams · 2025-06-21T11:57:56 1750507076

Physical paper outputs can be machine readable if they are made using fonts with that are more-accurately OCR readable.

shash · 2025-06-20T07:09:11 1750403351

How is it unwanted to have a standardized database of _results_? They're partly going to be used in a federal context, right?

We do this pretty decently in India - the results of pretty much every election run by the Election commission is updated on https://results.eci.gov.in/# and it's the same for the whole country.

okayishdefaults · 2025-06-20T00:14:04 1750378444

Just breaking down the thought a little, we truly can't say elections shouldn't have standards, right?

model-15-DAV · 2025-06-20T00:20:37 1750378837

Elections at the local level should be governed by the locality. I do not see the need for standards at a higher level, other than for democracy to be maintained in some fashion. External data reporting certainly need not be standardized at t̶h̶e̶ ̶l̶o̶c̶a̶l̶ [sic] a higher level.

IgorPartola · 2025-06-20T04:08:33 1750392513

I have had to do some bank statements to CSV conversions before and still do occasionally and https://tabula.technology/ has been invaluable for this.

In other news, any bank that does not produce a standard CSV file for their bank statements should be fined $1m per day until they do. It's ridiculous that this isn't the first option when you go to download them.

karel-3d · 2025-06-20T09:18:02 1750411082

I did it with one of the Go's PDF parsers (I think rsc has one, and ... some guy... forked it and added some features.. it's still kind of manual but worked great)

Normal_gaussian · 2025-06-20T02:00:23 1750384823

I'm not convinced.

I had Gemini convert a bunch of charity forms yesterday, and the deviation was significant and problematic. Rephrasing questions, inventing new questions, changing the emphasis; it might be performing a lot better for numerical data sets, but it's rare to have one without a meaningful textual component.

timschmidt · 2025-06-20T02:13:23 1750385603

I've seen similar. I wonder if traditional organizational solutions, like those employed by the US Military or IBM, might be applicable. Redundancy is one of their tools for achieving reliability from unreliable parts. Instead of asking a single LLM to perform the task, ask 10 different LLMs to perform the same task 10 different times and count them like votes.

Normal_gaussian · 2025-06-22T10:12:03 1750587123

Yeah, what I did to "solve" my issue was to use several models (4), then where there was any disagreement farm out to humans (2). 60% went to humans in the end.

I suspect if I'd done some corrective transformations before LLM scanning the success rate would have been higher, but the cost threshold of the project didn't warrant it.

latentpot · 2025-06-20T02:42:42 1750387362

Why complicate? One LLM works, another reflects and then a decision engine to review would be cheaper.

nojito · 2025-06-20T17:06:44 1750439204

Not sure I believe this.

I just quickly took a scanned document and the transcription looks good.

https://19january2021snapshot.epa.gov/sites/static/files/201...

https://g.co/gemini/share/d315b4047224

It even got the faded partial date stamp.

Normal_gaussian · 2025-06-22T10:09:28 1750586968

Well bully for you accusing people of lying.

Thats one of the best scanned documents I've seen in years. Most scanning now is via phone.

simonw · 2025-06-20T02:28:48 1750386528

Did you out as much work into it as Derek did? He spent a full hour with Gemini to process the longer document.

7moritz7 · 2025-06-20T08:05:13 1750406713

Use 2.5 Pro on ai studio, not the gemini app

Normal_gaussian · 2025-06-22T10:10:12 1750587012

I did. I was scanning about 400 forms.

dwillis · 2025-06-20T16:06:13 1750435573

That's what I did.

fasthands9 · 2025-06-19T21:06:41 1750367201

In college (about 15 years ago) I worked for a professor who was compiling precint level results for old elections. My job was just to request the info and then do manual data entry. It was abysmally slow.

This application seems very good - but still a bit amazing that lawmakers haven't just required that all data be uploaded via csv! Even if every csv was slightly different format, it would be way easier for everyone (LLM or not).

xp84 · 2025-06-19T21:47:13 1750369633

I could be wildly off-base, but I wonder if some of these systems are airgapped, and the only way the data comes off of the closed system is via printing, to avoid someone inserting a flash drive full of malware in the guise of "copying the CSV file." Obviously there are or should be technical ways to safely extract data in a digital format, but I can see a little value in the provable safety that airgapping gives you.

dwillis · 2025-06-19T22:10:48 1750371048

In some cases that's true, but for many jurisdictions the results systems are third-party vendor platforms, too.

arlort · 2025-06-20T09:34:50 1750412090

You could always just print a QR code as well if that's the issue

xp84 · 2025-06-21T06:21:22 1750486882

Sure, but it'd take a literal Act of Congress to force all these states to force all their independent vendors to do a thing, so good luck. And each vendor would probably charge about a million dollars to each state to do the work, in government contracting world. So, probably better to just use AI to OCR them.

wizzwizz4 · 2025-06-20T10:31:30 1750415490

Or print in the OCR-B font.

simonw · 2025-06-19T17:43:59 1750355039

This is such an excellent example of a responsible and thorough application of vision LLMs to a gnarly data entry problem.

polskibus · 2025-06-19T19:11:05 1750360265

It’s also an excellent example on how lack of forced machine-readable format for gov publishing is a PITA.

Mtinie · 2025-06-19T20:59:41 1750366781

If I was in power and wanted to continue said rule, I’d definitely discourage the adoption of any standardized formatting for election results.

Not, you know, for any nefarious purpose…but because what we’ve used forever was good enough for grandpappy, so it’s obviously good enough for us.

/cough

sitkack · 2025-06-19T19:28:26 1750361306

json to qr code would be a good start. PRIOR ART inb4 a troll.

o11c · 2025-06-20T01:48:35 1750384115

You know, not ignoring the percentage column would mean you can do math checks yourself.

antonkar · 2025-06-20T00:31:09 1750379469

Related: Interesting mockups to turn X/open source Bsky into direct democratic massive "prothetic" polls in each post.

And paid polls that the author claims will replace prediction markets:

https://x.com/MelonUsks/status/1929660387995115713

GardenLetter27 · 2025-06-19T20:07:39 1750363659

Why is the original source data not available anywhere digitally?

Since it's printed it is clearly already in a database somewhere. Why can't that just be made public too.

Seems bizarre to OCR printed documents (although I am aware of many companies doing this to parse invoices, etc.)

simonw · 2025-06-19T20:13:33 1750364013

Welcome to government data.

One key problem is that the US has tens of thousands of local governments, and each of them get to solve problems in their own way.

Digital literacy of the kind that understands why releasing a CSV file is more valuable than a PDF is rare enough that most of them won't have someone with that level of thinking in a decision making role.

codingdave · 2025-06-19T22:23:10 1750371790

> most of them won't have someone with that level of thinking

That is an unfair take on it. Come out to the midwest and talk to some of the clerks in the small townships and counties out here. They do know the value of improved data and tech. And they know that investing in better tech can result in a little less money in the bank, which results in less gas to plow the roads, less money to pay someone to mow the ditches, which means on more car wrecked by hitting a deer. So the question is often not about CSV vs. PDF. It is about overall budget to do all the things that matter to the people of their town. Tech sometime just doesn't make the cut.

Besides, elections tend to have their own tech provided by the county or state, so there is standardization and additional help on such critical processes.

People running the smallest of government entities in this country tend to have pretty good heads on their shoulders. They get voted out pdq when they don't.

simonw · 2025-06-19T22:27:20 1750372040

I'm not convinced by that argument. The data is clearly already in a spreadsheet of some sort already. I don't think "click export as CSV" v.s. "print out as paper and scan as PDF" is a cost decision.

This isn't meant as shade! I have full respect for people working in those roles. Knowing the difference between a CSV file and a PDF file - and understanding why there are people out there who curse the existence of PDFs and celebrate CSVs - is pretty arcane knowledge.

Also note that I blamed people in "a decision making role" - changing procedures requires buy-in from management. People in management roles are less likely to be thinking about CSVs v.s. PDFs than the people actually executing on the work.

As Derek pointed out in https://news.ycombinator.com/item?id=44320001#44322987 this may often be a vendor limitation - in which case there is a cost factor to consider, and the blame can also be shared between the vendor and the person who made the purchasing decision without understanding the difference between PDF and CSV export.

cwmoore · 2025-06-20T00:02:31 1750377751

Shade where the sunlight should fall. Let’s be honest. Then there’s less to remember.

bastawhiz · 2025-06-19T23:30:32 1750375832

> elections tend to have their own tech provided by the county or state, so there is standardization and additional help on such critical processes.

There's fifty states and almost 4000 counties in the US, not to mention territories. Even if it was only fifty different standards, that's still an overwhelming amount of work and exactly the problem you're replying about.

cwmoore · 2025-06-20T00:03:29 1750377809

Is it so impossible to expect compliance from government entities?

bastawhiz · 2025-06-20T00:43:17 1750380197

To get all the states and counties using the same standard? Very impossible. That's the very crux of the tenth amendment. We don't even have consistent traffic laws from state to state.

cwmoore · 2025-06-21T17:29:33 1750526973

No, by members of the constituencies under their overlapping jurisdictions.

nxrabl · 2025-06-19T18:13:54 1750356834

Very interesting! Is this the state of the art for accurate OCR of tabular PDFs, or is there other work in the space to compare against?

SnooSux · 2025-06-19T18:32:52 1750357972

There's lots of posts on HN for developments and companies doing OCR and Document Extraction. It's a classic CV problem but still has come a long way in the past couple years

dwillis · 2025-06-19T19:58:31 1750363111

Yeah, this is a very well-traveled road, but LLMs have made some big improvements. If you asked me (the guy who wrote the original piece linked above) what I'd use if accuracy alone was the goal, probably would be AWS Textract. But accuracy and structure? Gemini.

benob · 2025-06-19T19:29:54 1750361394

I wonder how difficult it would be to bias a model so that it subtly corrupts election results when performing OCR.

grues-dinner · 2025-06-20T08:14:21 1750407261

Sounds like an IOCCC challenge (but a much bigger haystack in which to hide the hack).

croemer · 2025-06-19T19:46:40 1750362400

Surely not hard but why?

bilbo0s · 2025-06-19T20:02:15 1750363335

Easier to steal elections?

Don't have to bother with gerrymandering, or slick legal ways to arrest people for voting with the wrong documents. Or just good old fashioned intimidation, like making the polling place the police station or the ICE detention facility.

It's just a lot smoother process when you can simply write some software to manipulate the count.

Who's gonna check?

(No, seriously, Who's gonna check? Because you also need to layoff everyone in that department once you're in power.)

simonw · 2025-06-19T20:15:24 1750364124

Corrupted OCR won't help you steal elections. The result counting is a different process, with well designed checks and safeguards.

The problem is that once the counts are done and have been reported a lot of places then print those results out on paper and then scan those papers into a PDF for anyone who asks for a copy!

dwillis · 2025-06-19T20:28:27 1750364907

Many jurisdictions do risk-limiting audits using the original ballots, so futzing with the results wouldn't necessarily make that easier. Also, cast vote records are public in many states - those are records of each ballot cast. So people can check.

philips · 2025-06-19T20:30:34 1750365034

I think you mean risk limiting, right?

dwillis · 2025-06-19T22:11:36 1750371096

Yes, thanks! Fixed.

bilbo0s · 2025-06-19T21:29:11 1750368551

Freudian Slip?

philips · 2025-06-19T20:30:09 1750365009

You may consider reading about risk limiting audits. https://www.voting.works/audits