Hacker News new | past | comments | ask | show | jobs | submit | more petercooper's comments login

I was hoping to link to this from somewhere technically relevant but having scoped out the Copyright, Designs and Patents Act 1988 S296 it actually seems unwise to do so as a British person. Meanwhile Starmer is proud of Britain's "freedom of speech" (yes, I know freedom of speech sensibly has limits, but the statute is overly broad in this case).


If you personally work on developing LLMs et al, know this: I will never work with you again, and I will remember which side you picked when the bubble bursts.

Does this statement have ramifications for sourcehut and the type of projects allowed there? Or is it merely personal opinion?


At the top of the article:

> This blog post is expressing personal experiences and opinions and doesn’t reflect any official policies of SourceHut.


Different platform but you can do that with this: https://replicate.com/recraft-ai/recraft-v3-svg (or on Recraft itself.)


Cool I’ll try it out. Thanks


Interesting that the Wikipedia entry says it was designed to be a “major tourist attraction.” I live in the north of England and this is the first time I’ve ever heard of it(!)


Apparently it'll use the equivalent of about 12,000 homes worth of water but have technology to recycle about 65% of their use, so ultimately ~5000 homes worth of water? https://eu.azcentral.com/story/opinion/op-ed/joannaallhands/... A lot but about 10% of current household growth in Arizona so they'll surely "find a way."


No direct recommendation for that use case, but one strategy I've heard being used and that works with complex documents (or where hallucinations are Very Bad™ - like invoice processing) is using multiple techniques and models at once in a quorum approach. For example, direct ingestion of PDFs into Gemini, OCR and ingestion of text, plus perhaps using another model like GPT. If they all agree on a fact, you're (probably) good. If not, it can be bumped up to human correction.


I love the model of it being free to scan and see if you'd get any benefit, then paying for the actual results. I, too, am a packrat, ran it, and got 7GB to reclaim. Not quite worth the squeeze for me, but I appreciate it existing!


He’s talked about it on the podcast he was on. So many users would buy this, run it once, then save a few gigs and be done. So a subscription didn’t make a ton of sense.

After all how many perfect duplicate files do you probably create a month accidentally?

There’s a subscription or buy forever option for people who think that would actually be quite useful to them. But for a ton of people a one time IAP that gives them a limited amount of time to use the program really does make a lot of sense.

And you can always rerun it for free to see if you have enough stuff worth paying for again.


For me the value in a dedup app like this isn't as much the space savings, since I just don't generate huge amounts, but it's the lack of duplicated files, some of which or all in aggregate may be large. There are some weird scenarios where this occurs, usually due to having to reconcile a hard drive recovery with another location for the files, or a messy download directory with an organized destination.

For example, I discovered my time machine backup kicked out the oldest versions of files I didn't know it had a record of and thought I'd long since lost, but it destroyed the names of the directories and obfuscated the contents somewhat. Thousands of numerically named directories, some of which have files I may want to hang onto, but don't know whether I already have them or not, or where they are since it's completely unstructured. Likewise, many of them may just have one of the same build system text file I can obvs toss away.


am I really that old that I remember this being the default for most of the software about 10 years ago? Are people already that used to the subscription trap that they think this is a new model ?


I grew up with shareware in the 90s that often adopted a similar model (though having to send $10 in the mail and wait a couple weeks for a code or a disk to come back was a bit of a grind!) but yes, it's refreshing in the current era where developers will even attempt to charge $10 a week for a basic coloring in app on the iPad..


I also really like this pricing model.

I wish it were more obvious how to do it with other software. Often there's a learning curve in the way before you can see the value.


it’s very refreshing compared to those “free trials” you have to remember to cancel (pro tip: use virtual credit cards which you can lock for those so if you forget to cancel the charges are blocked)

however has anyone been able to find out from the website how much the license actually costs?


Doesn’t the Mac App Store listing list the IAP SKUs like it does on iOS?


It does. It's reasonably clear for this app but I wish they made it clearer for other apps where the IAP SKUs often have meaningless descriptions.


Why spend your personal time and effort for someone else with a deeper pocket to automatically extract value from your work.

People releasing their code under MIT or BSD licenses might be able to give good answers to this.


Good answers like "it looks cool in my CV that big company XYZ uses my MIT licensed script"


It's extremely dishonest to compare someone voluntarily releasing their work under a permissive license with someone who is involuntarily having their content and effort stolen by an organization training an AI.


And I think it's erroneous to say having publicly disseminated content being read in an LLM training process is "stealing."

If I read a publicly distributed, copyrighted blog post of yours, learn something, then use that knowledge later on, did I steal your content?

If an author distributes something in public, the public is allowed to read it and learn from it, whether with their eyes, a screen reader, AI agent, or whatever. Any copyright violation occurs if they attempt to republish your content, not in the reading of it.

However, scraping illegally obtained non-public material - such as books an author is trying to sell or blog posts behind a paywall - could well be a violation unless access is obtained legally.


> And I think it's erroneous to say having publicly disseminated content being read in an LLM training process is "stealing."

It's clearly theft-adjacent. You're free to use "piracy" if you want, as long as it's clear that it's illegal and morally on the level of theft.

> If I read a publicly distributed, copyrighted blog post of yours, learn something, then use that knowledge later on, did I steal your content?

It's also extremely dishonest to compare AI to humans like this. AI are not people - morally, socially, biologically, or legally. What a human does with a piece of content is utterly irrelevant to the process of training an AI.

> If an author distributes something in public, the public is allowed to read it and learn from it, whether with their eyes, a screen reader, AI agent, or whatever.

Again - very dishonest to conflate a pre-trained AI agent (such as OpenAI's Operator) with the training process.

> Any copyright violation occurs if they attempt to republish your content, not in the reading of it.

OK, this is just factually incorrect. It is a violation of copyright law to make copies of copyrighted content, with very limited and case-by-case fair use exceptions - the claim that violation only happens in the republishing case is completely false.

This entire defense is a mix between deceptive and flat-out factually incorrect statements.


Your repeated use of the word 'dishonest' seems odd to me. I infer you think I'm making arguments disingenuously and without believing in them and/or manipulating the truth. I can reassure you this is not the case. I sincerely believe you are making your own arguments honestly also, and am engaging with them in that spirit.

as long as it's clear that it's illegal and morally on the level of theft.

It's not clear. I do not consider training an LLM on publicly disseminated text to be "morally on the level of theft." Stealing my car, or even a pen off my desk, is a much more reprehensible action than slurping everything I've shared in public into an LLM training process, purely IMHO.

It's also extremely dishonest to compare AI to humans like this. AI are not people - morally, socially, biologically, or legally. What a human does with a piece of content is utterly irrelevant to the process of training an AI.

People or corporations (which are usually treated as person-like) operate training processes and are morally and legally responsible for them. I believe training an LLM is "a human/corporation doing something" with a piece of content.

Again - very dishonest to conflate a pre-trained AI agent (such as OpenAI's Operator) with the training process.

Again, I am being honest. Whether I let an AI agent read your blog post or whether I write a program to read it into an LLM fine tuning process seems immaterial to me. I am open to being convinced otherwise, of course.

It is a violation of copyright law to make copies of copyrighted content, with very limited and case-by-case fair use exceptions

One of those exceptions (in many jurisdictions) is making temporary copies of data to use in a computational process. For example, browser caching, buffering, or transient storage during compression/decompression.

While many of the "pile" style of permanently stored and redistributed datasets are more than likely engaging in copyright violation, that's not inherent to the process of training an LLM, the topic of this thread. I believe that if copyright holders want to go after anyone and have success in doing so, they should go after those redistributing their content in such datasets, not those merely training LLMs which is not, in and of itself, violating any laws I can establish.


> It's not clear. I do not consider training an LLM on publicly disseminated text to be "morally on the level of theft." Stealing my car, or even a pen off my desk, is a much more reprehensible action than slurping everything I've shared in public into an LLM training process, purely IMHO.

The theft is that of effort, in the exact same (or a worse) sense as pirating media or stealing IP from a company.

It takes effort to write. That effort is being stolen by an LLM during the training process - the LLM cannot possibly exist without the work done by the authors who wrote content that it is being trained by, and the LLM can also be used to automate away those authors' ability to do work (and jobs) by replacing them. Which is worse - to have your car stolen (which is very bad, I'm not arguing that it isn't), or to lose your job, and not being able to afford anything?

Alternatively, if you believe that it's not bad to take someone's effort without their consent and without compensating them for it, then you shouldn't object to your employer withholding wages from you, or a client refusing to pay you, on the same principle.

> People or corporations (which are usually treated as person-like) operate training processes and are morally and legally responsible for them. I believe training an LLM is "a human/corporation doing something" with a piece of content.

That's not reasonable, and most people do not share your opinion (including the relevant group, which is the authors of the content being trained on). That's equivalent to saying that a human writing a program to perfectly reproduce a copyrighted work (e.g. print out the complete text of Harry Potter) is a human "doing something" with that copyrighted work (in the same class as reading Harry Potter).

> Whether I let an AI agent read your blog post or whether I write a program to read it into an LLM fine tuning process seems immaterial to me.

Those are categorically different. The vast majority of the population (again, including those writing the works that are being trained on without their consent) will agree that they are categorically different and incomparable, and they are logically, legally, and morally distinct.

> One of those exceptions (in many jurisdictions) is making temporary copies of data to use in a computational process. For example, browser caching, buffering, or transient storage during compression/decompression.

To use in specific computational processes for which you do not store the output because the output is subject to the same copyright laws. The implicit premise when you talk about training is that you're going to save the trained model, so this obviously doesn't apply, in the same sense that if you take a copyrighted work and transcode it, the transcoded output is subject to the exact same set of copyright laws as the original.

> not those merely training LLMs which is not, in and of itself, violating any laws I can establish.

That's the "law is morality" fallacy. Morally, this is clearly wrong, the point of the copyright system is to prevent exactly things like this from happening. The courts have not yet decided whether training an LLM is "copying" a copyrighted work, but if they do, then it's clearly illegal.


I appreciate your arguments and know they are in good faith. I think we would have an edifying debate in person!

I'm not going to reply to everything as I think our viewpoints are tricky to reconcile, since we find different things to be moral/immoral. That's fine, but it might not be productive. However, I acknowledge your position and know it reflects much popular sentiment; I cannot dispute that.

if you believe that it's not bad to take someone's effort without their consent and without compensating them for it, then you shouldn't object to your employer withholding wages from you

I think this gets to the crux of our difference. Employment is an explicit contract that binds two parties to honor their obligations. If someone posts a blog post openly, busks in the street, or does some graffiti art, I don't think observers have any obligations beyond an implicit idea of "experience this in any way you like as long as it's legal". Whether you prefer 'legal' or 'moral' there, it brings us back to the problem that we disagree on the morality/legality of the core issue. Given the constraints of this venue, not to mention our time, I'm happy to recognize this difference and leave it unsettled.

That's the "law is morality" fallacy. Morally, this is clearly wrong, the point of the copyright system is to prevent exactly things like this from happening. The courts have not yet decided whether training an LLM is "copying" a copyrighted work, but if they do, then it's clearly illegal.

If that should come to pass, I agree. However, your suggested fallacy then comes into play the other way around. Merely because a legal precedent may be set does not change my opinion that it is not immoral. That is a point on which we clearly differ and one I think would be fascinating to debate if only in a more appropriate venue as I may even be won over but have not been by any arguments so far.


If you look at other countries/regions that impose high tariffs, their companies continue to buy and use American technologies and absorb the cost (to their local customers' detriment).

I'd certainly enjoy the case studies of European enterprises jumping from full-scale Azure and AWS deployments to OVHcloud or Hetzner, though. That'd make for some interesting reading.


But what if they outright ban it, as the US was going to do with TikTok (for national security reasons)? This it the tech services version of Nord Stream.


It's not really workable. The real-world impact of a TikTok ban, even if it outright stopped working on every American device overnight is pretty minimal; people stop watching videos, and some influencers lose their jobs.

If my (Canadian) government decides to ban Azure in a year, my critical infrastructure company ignores it for 11 months because they figure it won't actually happen, and then goes to the government to tell them that if the ban actually goes through, our infrastructure stops working because we'd actually need a multi-year timeframe to migrate off of Azure.


Impossible, even in the current crazy atmosphere. An actual ban would mean an all-out commercial war and a very serious dent in globalization.



The EU hasn't even got a home-built social network with significant market reach, let alone the wherewithal to pull off ditching Microsoft and Google. It'd be nice to see that change, but there's surely some sort of blocker after 25 years of the Web being a mainstream technology.


The used to exist (e.g. Hyves, StudiVZ), but they are murdered by FAANG. However, there are still locally successful companies that could expand to the rest of Europe if US companies were dropped. E.g. just speaking of The Netherlands, Bol.com is much more popular than Amazon, Marktplaats is more popular than eBay (which is pretty much non-existent here) and owned by a Nordic company, etc., iDEAL is much more popular for payments than PayPal, Stripe, etc. (and works far better). Such companies can fill the void.

Microsoft will be tough to replace. There are good alternatives, but retraining personnel, etc. will take years. Google, I am not sure. Their cloud services are replaceable. Search may be tougher, but the quality of Google Search has become so bad that it's often easier to ask an LLM.


Takeaway (thuisbezorgd) and Zalando are some pretty large players in the EU markets. Spotify of course.


Booking.com. Adyen. ASML. Messagebird. TomTom. To name a few from a tiny speck of land in Europe. It's not like we lack capabilities.


Is Marktplaats not bought out by eBay?

See also: https://mergr.com/transaction/ebay-acquires-marktplaats-bv


eBay sold Marktplaats in 2015: https://nl.m.wikipedia.org/wiki/Marktplaats.nl


Tuenti?


Tuenti was huge in Spain.


With social networks or any EU startup problem is you have to deal with different languages right at the start.

Being US startup with English only you have access to 300m people right away.

There were country specific social networks but then all cool kids were on FB so everyone moved there.

The same with LinkedIn, our country specific business social network closed down finally last year. First 3-5 years it was growing then everyone moved to LinkedIn so that network was ghost town for 15 years someone kept it alive just in case but seems like they stopped wasting money.


I think the language problem will become less of a problem in the future due to (1) more (young) people living in citys and (2) all young people in cities speaking english. At least compared to previous generations imo. This could be my subjective view based on luxembourg, netherlands, and visiting other european cities.


Don't overestimate "young people speaking english" especially with current demography you still need to tap ones that are excluded from English as there will be much more of those.

I do see opportunities with LLMs as making all kind of platforms language agnostic - you should be able to write your own language and read your own language even if other person is from different country using different language.


Network effect is also hugely important.


Maybe so called social network is not something to reproduce. Who cares who runs them if they deteriorate sociality, generate addictive consumption of things detrimental to mental health and favor extremists point of view?


And that's why we need to stop being dependent on the US: everything in there is described in terms of « market share », and not in terms of usefulness, ethics, or independence.


Mastodon is German:

https://joinmastodon.org/about

(So is SAP, for that matter.)


There is an active effort currently to have the EU contribute towards funding https://freeourfeeds.com/ (to enable a distributed, global AT Proto network). Does the EU need the network to be home grown or have the valuation matter? I argue no, it is a utility, not a business to be captured and squeezed by investors or other potential controlling interests.

(as of this comment, Bluesky has ~32M users and counting)


They can fork phpbb. You didn’t really think these social networks are anything more than that?

We just need to see if phpbb can scale to a billion, and if not, why not.


Well, I'm all for the return of the classic forum experience!

The UK's largest "social" sites are pretty much forums (e.g. Mumsnet, The Student Room, DigitalSpy, MoneySavingExpert) and while they're good for their respective topics, they don't cover the Reddit/Facebook/Instagram use cases (they could be arguably considered on a par with individual sub-reddits).


Well, I'm all for the return of the classic forum experience!

If you make each individual bulletin board receive broadcasts from a central server, then you get the network effects of Facebook and Reddit. Individual boards can just sub to the central server keeping them connected to the hivemind or not. Your community can remain isolated or throttled (only 30% of global updates get through). We do this manually here, where not all global posts get through (you'd be hard pressed to push a Reddit post to the top here). It's the simplest way to federate using existing technology.

This model is already at play. X, Bluesky, Reddit, Truth Social, and Rumble are basically heavily funded private message boards with a large mindshare subscriber base.

Taking our message boards back is proving to be difficult, especially because trying to move the userbase off of it is the same as trying to move people off drugs.


> If you make each individual bulletin board receive broadcasts from a central server

Your're doing this with phpBB? Doesn't happen to be open-source somewhere?

Would be interesting to have a look, I think I a bit like this opt-in partial federation / hivemind. Would be even more interesting if it was possible to sync comments between such forums.

**

Developing forum software myself, Talkyard. Based in Europe (Sweden).

Started thinking even more about using some European cloud, as an option. There's a Swedish hosting provider that looks interesting (I think)


sync comments

I guess you could do syncing kind of like how CCing email is done. CC my home server and global server. This gives you agency to remain detached from the hivemind, and vice versa. This is not some idea out of left field, it's roughly my workflow between Reddit or HN or other sites. I manually do the filtering in my mind when I move through different channels.

Phpbb is open source, but I mostly brought it up to show that Facebook is just that, and nothing more. Forking Reddit will also give you a Facebook clone (and a Reddit clone).


I was wondering if you're using a phpBB extension you've built yourself, and if it's on GitHub or somewhere (the extension), or ... It's not a built-in feature?

Websearched for "phpBB federation" and "phpbb subscribe rss broadcasts", found this:

"Feed post bot: This extension enables you to read any RSS, ATOM or RDF feed. It looks for new items every half hour and post them to a specified forum." https://www.phpbb.com/community/viewtopic.php?f=456&t=241159...

Intering way to use RSS


It doesn’t exist. I was contemplating how to connect all existing message boards together via a central server(s), mimicking Reddit and FB news flow.


https://matrix.org/ is partly funded by French government.


> We just need to see if phpbb can scale to a billion

No need for that, we are just half a billion in Europe.


PeerTube is made in France, Mastodon AFAIK in Germany.


Too many trade barriers, stifling rules and general hostility to growing tech companies for the EU to compete with US companies, and only looks to get more restrictive. I’d bet against the EU pulling it off unless there’s a big coordinated realignment of priorities.


Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: