Syncopate | NYC (Hybrid ~3d/week) | Full-time | Senior Full Stack Engineers / Focus on AI + Finance
Syncopate builds tools to help automate financial diligence and management of long-tail financial assets.
We've found product market fit with ETL/analysis tools for niche financial data, starting with music rights, and we're looking to build out our capabilities across more Excel + PDF-based workflows.
What we're looking for: A full-stack engineer with experience building data-heavy applications. Experience with analytics databases like Clickhouse and data pipelining is a plus. Required to be knowledgeable in Typescript.
Big bonus points for:
1) High agency (previously a founder or built side-projects to completion)
2) Some knowledge of finance
3) Skill in Rust
It seems to me like chunking (or some higher order version of it like chunking into knowledge graphs) is the highest leverage thing someone can work on right now if trying to improve intelligence of AI systems like code completion, PDF understanding etc. I’m surprised more people aren’t working on this.
Chunking is less important in the long context era with most people just pulling in top 20 K. You obviously don’t want to butcher it, but you’ve got a lot of room for error.
We still want chunking in practice to avoid LLM confusion, undifferentiated embeddings, and handling large datasets at lower cost + large volumes. Large context means we can now tolerate multi-paragraph/page, so more like chunk by coherent section.
In theory we can do entire chapter/book, but those other concerns come in, so I only see more niche tools or talk-to-your-PDF do that.
At the same time, embedding is often a significant cost in above scenarios, so I'm curious about the semantic chunking overheads..
In our use-case we have many gigabytes of PDFs that contain some qualitative data but also many pages of inline pdf tables. In an ideal world we’d be “compressing” those embedded tables into some text that says “there’s a table here with these columns, if you want to analyze it you can use this <tool>, but basically the table is talking about X, here are the relevant stats like mean, sum, cardinality.”
In the naive chunking approach, we would grab random sections of line items from these tables because they happen to reference some similar text to the search query, but there’s no guarantee the data pulled into context is complete.
Trueish - for orgs that can't use API models for regulatory or security reasons, or that just need really efficient high throughput models, setting up your own infra for long context models can still be pretty complicated and expensive. Careful chunking and thoughtful design of the RAG system often still matters a lot in that context.
It splits an input text into equal sized chunks using DFS and parallelization (rayon) to do so relatively quickly.
However, the goal for me is to use a n LLM to split text by topic. I’m thinking I will implement it as an API saas service on top of it being OSS. Do you think it’s a viable business? You send a library of text, and receive a library of single topic context chunks as output.
If I were to guess, most (adult) humans could not add two 3 digit numbers together with 100% accuracy. Maybe 99%? Computers can already do 100%, so we should probably be trying to figure out how to use language to extract the numbers from stuff and send them off to computers to do the calculations. Especially because in the real world most numbers that matter are not just two digits addition
It looks like Deepseek had a subdomain called "openai-us1.deepseek.com". What is a legitimate use-case for hosting an openai proxy(?) on your subdomain like this?
Not implying anything's off here, but it's interesting to me that this OpenAI entity is one of the few subdomains they have on their site
Could just be an OpenAI-compatible endpoint too. A lot of LLM tools use OpenAI compatible APIs, just like a lot of Object Storage tools use S3 compatible APIs.
Artists do not get paid per stream on Spotify + many other DSPs. The platform sums up all of the ad revenue and divides it pro rata among all of the streamed artists. So the fraudulent streams dilute the pie for legitimate streams.
The Monty hall problem is a great example of something I’ve been educated into believing, rationalizing, whatever you want to call it…but I would still never claim I “understand it.” I think that’s maybe the source of disagreement here, there are many truly unintuitive outcomes of statistics that are not “understood” by most people in the most respectful sense of the word, even if we’ve been educated into knowing the formula, knowing how to come to the right answer, etc.
It’s like in chess, I know that the Sicilian is a good opening, that I’m supposed to play a6 in the najdorf, but I absolutely do not “understand” the Najdorf, and I do think it’s fundamentally past the limit of most humans understanding.
can you run the whole task as a postgres transaction? like if i want to make an idempotent job by only updating some status to "complete" once the job finishes.
Not a Hatchet user, but this doesn’t sound like a Hatchet-specific question. Long running transactions could be problematic depending on the details. I handle idempotency by not holding a transaction and instead only upserting records in jobs and using the job record itself to get the status. For example, if you want to know if a PDF has had all of its pages OCR’d, look at all of the job records for the PDF and aggregate them by status. If they’re all complete you’re good to go.
No, the whole task doesn't execute as a postgres transaction. Transactions will update the status of a task (and higher-order concepts like workflows) and assign/unassign work to workers, but they're short-lived by design.
For some more detail -- to ensure we can't assign duplicate work, we track which workers are assigned to jobs by using the concept of a WorkerSemaphore, where each worker slot is backed by a row in the WorkerSemaphore table. When assigning tasks, we scan the WorkerSemaphore table and use `FOR UPDATE SKIP LOCKED` to skip any locked rows help by other assignment transactions. We also have a uniqueness constraint on the task id across all WorkerSemaphores to ensure that no more than 1 task can be acquired by a semaphore.
This is slightly different to the way most pg-backed queues work, where `FOR UPDATE SKIP LOCKED` is done on the task level, but this is because not every worker maintains its own connection to the database in Hatchet, so we use this pattern to assign tasks across multiple workers and route the task via gRPC to the correct worker after the transaction completes.
Long running transactions can easily lock up your database. I'd definitely avoid those. You're better off writing status records to the DB and using those to determine whether something is running, failing, etc.
From the article it seems like the content wasn’t entirely AI generated, it “digitally superimposed the faces of child actors onto nude bodies”, which makes this a lot worse than purely AI generated content
Why specifically? The child actors aren’t harmed, nor is anyone else as far as I can see. So what exactly makes this worse than say AI art or a painting?
It's not explicitly stated, but also isn't ruled out that the nude bodies they were real photos of real children in which case, the deepfake element would be irrelevant.
Sure, that’s a reasonable reaction assuming it applies. But people seem to have a strong reaction even if it didn’t, which is more what I was trying to understand.
I could see using adult models for source material as a sign of mental illness and proactively institutionalizing someone for that is on the table, but prison IMO implies some kind of harm.
> But people seem to have a strong reaction even if it didn’t, which is more what I was trying to understand.
There's some images and concepts where humans generally have a strong disgust reflex; I'm not sure how much any of these various reflexes are innate vs. learned, but in either case overcoming them is likely to also have severe negative consequences.
One of the common tropes I've seen in every debate about human sexual desires and what should be forbidden, there's always someone who treats it as a slippery slope. I don't think they've been right, but I can see the possibilities for how they might be.
But we may have to be, despite those possibilities.
Unfortunately, given how easy it now is to fake such material and how this is absolutely going to impact elections going forwards because people are going to use the tech to create fake images of politicians.
20-odd years ago I saw a Photoshopped picture where someone had put the faces of George W. Bush and Osama bin Laden onto a gay porn photo, I think it was to protest the American invasion of Afghanistan; more recently, we've already had a newsworthy generated image of Trump resisting arrest, and I'd be shocked if nobody's yet put him into an image with Stormy Daniels.
There's a possibly apocryphal but widely believed quote from Trump, "If Ivanka weren't my daughter, perhaps I'd be dating her", and at the (supposed) time of the quote Ivanka was 13 — and remember, this is believed because, to a first approximation, someone put some text on a photo and tweeted it — someone is almost certainly going to use AI to generate that scene.
As Trump has (despite the legal battles which needed a much higher standard for their evidence even to get started) a reasonable chance of winning the next election, there's a real chance of people seeing that generated image, taking it seriously, and repeating the Jan 6 2020 attempt to prevent the transfer of power. Only this time, without needing a charismatic leader to lead them.
And because that Ivanka quote isn't well-evidenced yet is widely believed, that's also a problem that almost every democracy will have to face forever, no matter what you think of Trump himself.
> There's a possibly apocryphal but widely believed quote from Trump, "If Ivanka weren't my daughter, perhaps I'd be dating her"
Not even slightly apocryphal. You can see the clip on YouTube, e.g. [1] But she was 25 at the time, not 13.
« When Trump was the star of the reality TV show “The Apprentice,” he appeared on the ABC talk show “The View” with his daughter in 2006 and said, “If Ivanka weren’t my daughter, perhaps I’d be dating her. Isn’t that terrible? How terrible? Is that terrible?” »
> And because that Ivanka quote isn't well-evidenced
I've heard a different quote that was allegedly made when she was 13 (I guess taking an proven quote and transposing the age to another unproven quote is an easy step, possibly even accidentally) but Snopes currently rates that as "unproven".
(Given the things he's actually on record as saying, it wouldn't surprise me though.)
By that argument, deep fake porn shouldn’t exist at all. But it does, so clearly that’s not the way these people see the world. Some faces are simply “better” than others, at least in so far as those with perverted pornographic desires are concerned.
To amend my prior comment, the law says that the face actors in fact are harmed as well, reputationally. Fair enough; I’m not here to play “defend the pedo”.
It exists because face is what gets visibly aged on people even if the body isn't easily distinguishable from children. Why is it called "child sexual abuse material" if no child was abused in its creation?
Ad reputational damage - OK, I agree if they distribute it, that's fair. But if they don't distribute and the police finds it only after they take their hard drives?
Syncopate builds tools to help automate financial diligence and management of long-tail financial assets.
We've found product market fit with ETL/analysis tools for niche financial data, starting with music rights, and we're looking to build out our capabilities across more Excel + PDF-based workflows.
What we're looking for: A full-stack engineer with experience building data-heavy applications. Experience with analytics databases like Clickhouse and data pipelining is a plus. Required to be knowledgeable in Typescript.
Big bonus points for: 1) High agency (previously a founder or built side-projects to completion) 2) Some knowledge of finance 3) Skill in Rust
You can reach out to me here https://www.linkedin.com/in/michael-markell-377b4221a/ or via email (michael at syncopate dot ai)
More about Syncopate (geared towards our music rights segment): https://syncopate.notion.site/
reply