Copilot regurgitating Quake code, including sweary comments

rgbrenner · on July 2, 2021

So this makes it official... this post[0] and the comments on the announcement[1] concerned about licensing issues were absolutely correct... and this product has the possibility of getting you sued if you use it.

Unfortunately for GitHub, there's no turning back the clocks. Even if they fix this, everyone that uses it has been put on notice that it copies code verbatim and enables copyright infringement.

Worse, there's no way to know if the segment it's writing for you is copyrighted... and no way for you to comply with license requirements.

Nice proof of concept... but who's going to touch this product now? It's a legal ticking time bomb.

0. https://news.ycombinator.com/item?id=27687450

1. https://news.ycombinator.com/item?id=27676266

bryant · on July 2, 2021

Adding to this:

I run product security for a large enterprise, and I've already gotten the ball rolling on prohibiting copilot for all the reasons above.

It's too big a risk. I'd be shocked if GitHub could remedy the negative impressions minted in the last day or so. Even with other compensating controls around open source management, this flies right under the radar with a c130's worth of adverse consequences.

fragmede · on July 2, 2021

Do you also block stack overflow and give guidance to never copy code from that website or elsewhere on the Internet? I'm legitimately curious - my org internally officially denounces the copying of stack overflow snippets. Thankfully for my role it's moot as I mostly work with an internal non-public language, for better or worse, and I have no idea how well that's followed elsewhere in the wider company.

gunapologist99 · on July 2, 2021

Apples and oranges: Stack overflow snippets are explicitly granted under a permissive license, as long as you attribute.

https://stackoverflow.com/help/licensing

It appears that the code that copilot is using is created under a huge variety of licenses, making it risky.

On the other hand, a small snippet in a function that is derived from many existing pieces of other code may fall under fair use, even if it is not under an open source license of some sort.

throw_2021-07 · on July 3, 2021

Stack Overflow and Copilot are similar. Usage of both routinely violates licenses. Stack Overflow content is licensed under CC-BY-SA. Terms [1]:

* Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.

* ShareAlike — If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.

In over a decade of software engineering, I've seen many reuses of Stack Overflow content, occasionally with links to underlying answers. All Stack Overflow content use I've seen would clearly fail the legal terms set out by the license.

I suspect Copilot usage will similarly fail a stringent interpretation of underlying licenses, and will similarly face essentially no enforcement.

[1] https://creativecommons.org/licenses/by-sa/4.0/

lumost · on July 3, 2021

The difference here is that it's hard to sue a company for sporadic, difficult to track down usages of SO content written by their own engineers.

One can now trivially coerce copilot to regurgitate copyrighted content without attribution. Copilot's basic premise violates the CC-BY-SA terms, and this will continue until no party can demonstrate a viable method of extracting copyrighted code.

There is now a single party backed by a company with a 2 Trillion dollar market cap that can be sued for flagrant copyright violations.

moyix · on July 3, 2021

Surely you would have to sue the people using the tool to produce verbatim copies of code, not the creator of the tool?

Spivak · on July 3, 2021

I would think it's more complicated when the tool is the thing spitting out the verbatim copies of code. Both the tool and the developer are independently distributing copyrighted code that neither of them have the rights to distribute.

lumost · on July 3, 2021

why? one could easily claim that if the tool is reproducing the contents of copyrighted works they are a "distributor". Subjecting the makers of the tool/distributor too much higher copyright infringement claims.

throw_2021-07 · on July 3, 2021

Let's differentiate legal risk by the party it affects:

* Companies with engineers using Copilot. Risk here is negligible, like that of copying Stack Overflow answers, or any code that isn't under a truly permissive license like CC0 [1]. Prohibiting use of Copilot in a company based on this risk has no merit.

* GitHub and Microsoft. Risk for them is higher yet worthwhile. Copilot is more like Stack Overflow than Napster. Affected copyright holders added their works to GitHub and agreed to their terms, so GitHub has a legal basis to show that content in Copilot. In terms of facilitating copyright infringement, far more violations occur by engineers manually searching and copying code on GitHub; lawsuits against GitHub due to that would be dismissed. Determining provenance is slightly harder in Copilot than in search, but GitHub could minimize risk to itself by noting in Copilot terms that users must review Copilot's suggestions for underlying license concerns. Engineers rarely will -- they routinely violate licenses of Stack Overflow and code copied from elsewhere -- but that shifts responsibility from GitHub, and legal risk to companies using Copilot remains negligible.

[1] https://creativecommons.org/share-your-work/public-domain/cc...

aasasd · on July 2, 2021

In addition to other licensing gotchas, a ton of SO snippets are copied wholesale from elsewhere—docs or blog posts. So it's pretty likely that the poster can't license them in the first place because they never checked the source's license requirements.

rorykoehler · on July 2, 2021

It just seems bizarre that this wasn’t flagged internally at Microsoft. They have tons of compliance staff.

mustacheemperor · on July 2, 2021

Maybe we’ll even get a sneak peak at Windows 11’s source code. Time to start writing a Win32 API wrapper and see what the robot comes up with!

snicker7 · on July 2, 2021

That's because Microaoft doesn't dare use this for production code (presumably).

They are 100% okay with letting their competitors get into legal hot water.

lumost · on July 3, 2021

isn't it copilot's liability for "distributing" the copyrighted code?

rorykoehler · on July 2, 2021

It’s surely a bit of a liability grey area?

ngcazz · on July 2, 2021

Could bet they baked in the legal fees and are taking a calculated risk

comex · on July 2, 2021

Except that CC-BY-SA is not a permissive license; the SA part is a form of copyleft. It's just that nobody enforces it. From the text [1]:

- "[I]f You Share Adapted Material You produce [..] The Adapter’s License You apply must be a Creative Commons license with the same License Elements, this version or later, or a BY-SA Compatible License."

- "Adapted Material means material [..] that is derived from or based upon the Licensed Material" (emphasis added)

- "Adapter's License means the license You apply to Your Copyright and Similar Rights in Your contributions to Adapted Material in accordance with the terms and conditions of this Public License.'

- "You may not offer or impose any additional or different terms or conditions on, or apply any Effective Technological Measures to, Adapted Material that restrict exercise of the rights granted under the Adapter's License You apply."

A program that includes a code snippet is unquestionably a derived work in most cases. That means that if you include a Stack Overflow code snippet in your program, and fair use does not apply, then you have to license the entire program under the CC-BY-SA. Alternately, you can license it under the GPLv3, because the license has a specific exemption allowing you to relicense under the GPLv3.

For open source software under permissive licenses, it may actually be okay to consider the entire program as licensed under the CC-BY-SA, since permissive licenses are typically interpreted as allowing derived works to be licensed under different licenses; that's how GPL compatibility works. But you'd have to be careful you don't distribute the software in a way that applies any Effective Technological Measures, aka DRM. Such as via app stores, which often include DRM with no way for the app author to turn it off. (It may actually be better to relicense to the GPL, which 'only' prohibits adding additional terms and conditions, not the mere use of DRM. But people have claimed that the GPL also forbids app store distribution because the app store's terms and conditions count as additional restrictions.)

For proprietary software where you do typically want to impose "different terms or conditions", this is a dead end.

Note that copying extremely short snippets, or snippets which are essentially the only way to accomplish a task, may be considered fair use. But be careful; in Oracle v. Google, Google's accidental copying of 9 lines of utterly trivial code [2] was found to be neither fair use nor "de minimis", and thus infringing.

Going back to Stack Overflow, these kinds of surprising results are why Creative Commons itself does not recommend using its licenses for code. But Stack Overflow does so anyway. Good thing nobody ever enforces the license!

See also: https://opensource.stackexchange.com/questions/6777/can-i-us...

[1] https://creativecommons.org/licenses/by-sa/4.0/legalcode

[2] https://majadhondt.wordpress.com/2012/05/16/googles-9-lines/

wrs · on July 2, 2021

Yes. In a past life, after researching the situation, we had to find and remove all the code copied from Stack Overflow into our codebase. I can’t fathom why SO won’t fix the license.

What makes it even worse is if you try to do the right thing by crediting SO (the BY part) you’re putting a red flag in the code that you should have known you have to share your code (the SA part).

Gaelan · on July 3, 2021

> I can’t fathom why SO won’t fix the license.

They tried to relicense code snippets to MIT a while back, it was a big mess.

mediaman · on July 2, 2021

Who really copies stack overflow snippets verbatim? It's usually just easier to refer to it for help figuring out the right structure and then adapt it for your own needs. Usually it needs customization for your own application anyway (variables, class instances, etc).

blooalien · on July 2, 2021

I don't think I've ever copied code directly from any of the Stack* sites. I generally read all the answers (and comments) and then use what I learn to write my own (hopefully better) code specific to my needs.

corobo · on July 2, 2021

Yeah my experience has always been "ohhh that solution makes sense" then I go write it myself

If nothing else this whole copilot thing is helping ease some chronic imposter syndrome

bartread · on July 2, 2021

Ha! Well, I think a lot of people copy code from StackOverflow verbatim once at least - including me.

Of course it turned out the code I'd blindly inserted into my project contained a number of bugs. In one or two cases, quite serious ones. This, even though it was the accepted answer.

It was probably more effort to fix up the code I'd copy pasta'd than write it from scratch. Since then I've never copied and pasted from StackOverflow verbatim.

canadev · on July 2, 2021

Yeah! I've uh, ... never copied a bit of code into my repo verbatim, right?

yeah right. I wish.

(Not saying every dev does this)

TillE · on July 2, 2021

I've copied plenty of Microsoft sample code verbatim, because the Win32 API sucks and their samples usually get the error handling right.

But, I can't think of a single scenario where I've copied something from Stack Overflow. I'm searching for the idea of how to solve a problem, and typically the relevant code given is either too short to bother copying, or it's long and absolutely not consistent with how I want to write it.

Noumenon72 · on July 2, 2021

"Too short to bother copying"? I copy single words of text to avoid typing and typos. I would never type out even a single line of code when I could paste and edit.

vanous · on July 3, 2021

> "Too short to bother copying"? I copy single words of text to avoid typing and typos. I would never type out even a single line of code when I could paste and edit.

Very honest suggestion: learn how to touch type. You can still copy if needed, but your typed input will be much faster.

Noumenon72 · on July 3, 2021

I'm somewhere between 45-75 wpm. But Ctrl+C Ctrl+V can type 300wpm!

Typing when you could paste is like having that Github Copilot put the right sentence right in front of you and you decide to type over it instead. Not only does it feel like wasted and robotic effort, typing everything leads to RSI.

I'm not sure why people disagree. Another symptom is that I insist on aliases for everything while others type out all the commands every time. Maybe I get distracted by the words when I type and lose my train of thought?

Retric · on July 3, 2021

You need to highly the correct test first and move the curser to the correct location to paste text. I bet you can type 123123 several times faster than you can highlight that text in this comment and past it into a reply.

ec109685 · on July 3, 2021

Double click to select a word is fast, and then you are in per word selection mode.

Retric · on July 3, 2021

Sure, move mouse to text, double click, ctrl-c, ctrl-v it’s still slower than touch typing one word.

alisonkisk · on July 2, 2021

That's fair use.

dvfjsdhgfv · on July 8, 2021

Same here. I copy boilerplate code for new projects etc. regularly. But I don't remember copying anything verbatim from SO. Function, argument and variable names rarely fit the scheme used in the particular project I'm working on at that moment and usually I do a better job at adapting the code thinking what I'm doing rather than just copy and paste and then wonder what went wrong.

baud147258 · on July 2, 2021

I think I did a few times, usually for languages that I wasn't going to spend to much time with (so no benefits in figuring how to do it from the answers) and for specific tasks.

samtheprogram · on July 2, 2021

Anything posted to Stack Overflow has a specific (Creative Commons IIRC) license associated with it. The same is not true of GitHub Copilot, and in fact their FAQ doesn’t specify a license at all, probably because they are technically unable to since it is trained on a wide variety of code from differing licenses (and code not written by a human is currently a grey area for copyright). The FAQ simply says to use it at your own risk.

summerlight · on July 2, 2021

Google (and most of other big techs I guess?) also explicitly prohibit employees from use of stack overflow code snippets.

Noumenon72 · on July 2, 2021

I tried Googling this and couldn't find it. I also don't want to believe it because it seems like the world suddenly turned into an apocalyptic hellscape with no place for developers like me. Do you have a source?

summerlight · on July 3, 2021

First, I work at Google and its onboard training explicitly mentions Stack Overflow as a forbidden example due to CC-BY-SA license (SA is the problematic part). The following link is the official reference.

https://opensource.google/docs/thirdparty/licenses/#restrict...

xen0 · on July 3, 2021

I work at Google.

SO definitely comes up during copyright/IP training.

The basic idea is 'reading SO answers to learn how to solve a problem is fine, copying/transcribing the code is not'.

Google is quite paranoid re. copyright and licenses.

jhugo · on July 3, 2021

I don't have a source to link, but I've also been told this by someone who works at Google. Is copy-pasting stuff verbatim from SO really that much of a thing? I use SO plenty, but have never considered taking anything verbatim.

ljm · on July 3, 2021

That's actually an attack vector: mirror SO using their open-sourced DB and inject malware into the suggestions, or change the text before it enters the clipboard. People blindly copy/pasting aren't going to notice.

alblue · on July 2, 2021

Same here. I’ve directed our teams and infra managers that we must be able to block the use of copilot for our firm’s code.

Id be very surprised if the other large enterprises that I have worked at downs doing exactly the same thing. Too much legal risk, for practically no benefit.

Kiro · on July 2, 2021

No-one cares about this. People have no clue about licenses and just copy-paste whatever. If someone gets access to their code and see all the violations they're screwed anyway.

jerf · on July 2, 2021

Ask your legal department about that. Sure, engineers don't care about licensing at all, but we are not the only players here.

syshum · on July 3, 2021

Are legal departments in the habit of reviewing all code line by line? Seems like that would be cost prohibitive...

chipotle_coyote · on July 3, 2021

Obviously they aren't, but just as obviously, "the legal department didn't review this, therefore it's safe to assume it's legal" would not pass muster with said legal department. :) Kiro's comment ("if someone gets access to their code and sees all the violations they're screwed anyway") is probably technically accurate, even if in practice you're unlikely to get caught. As other people have noted elsewhere in the comments here, the Google v. Oracle case over Java definitely suggests that verbatim copying of just a few lines, even for trivial functions, is enough to get you in trouble if those lines aren't licensed in a way that lets you do that.

harry8 · on July 3, 2021

No legal come up to you and say stuff like:

"You guys aren't using any free software are you? Because you can't do that."

"You mean copying software source code without respecting the license, right? Because we absolutely respect all licenses fully."

"No I mean you can't use Free software! It's a clear management directive! What are you doing?!?"

"Is that an apple laptop you're using there? Ever had a look at the software licenses for it?"

Legal are generally idiotically ignorant about the real issues. Whose fault that is we can argue about.

edanm · on July 3, 2021

This is absolutely not true. While some individuals might not care and might not always conform to their companies' policies, most companies have policies, and most employees are aware of and mindful of these policies.

It's absolutely the case that before using certain libraries, most engineers in large corporations will make sure they are allowed to use that library. And if they don't, they are doing their job very badly IMO.

noobermin · on July 2, 2021

This kind of sucks honestly, copy and pasting without understanding has lead to all sorts of issues in IT. Not to mention legal issues as mentioned by another reply.

jpswade · on July 2, 2021

Not only this but a huge amount of publicly available code is truly terrible and should never really be used other than a point of reference, guidance.

thesz · on July 3, 2021

I think that proper coding assistant should help with not writing code (and I stress that it is "not writing code") - how to rearrange your code base for new requirements, for example.

Code not written does not have defects, does not need support and, as you point it out, is not a liability.

eximius · on July 2, 2021

Seems like the liability should also be on Copilot itself, as a derivative work.

root_axis · on July 3, 2021

The practical utility will outweigh the legal concerns. Engineers using this are going to be more productive and this is a competitive advantage that companies won't eschew.

voakbasda · on July 3, 2021

If the legal concerns are well-known, then what you are describing might be viewed as criminal negligence (at worst) and or insufficient duty of care (at best). Such engineers should be held fully responsible and accountable for their actions.

skybrian · on July 2, 2021

It seems like the risk is somewhat exaggerated because even when people get bad autocomplete results, they mostly won’t use them.

Cyberdog · on July 2, 2021

That's optimistic. The people who would rely heavily on this sort of thing are going to be the worst at detecting what a "bad autocomplete result" would look like. But even if you are capable of judging that you've got a good one, it still doesn't inform you of the obvious potential licensing issues with any bit of code.

Surely somebody working on this project foresaw this problem…

sktrdie · on July 2, 2021

If they get rid of licensed stuff it should be ok no? I really want to use this and seems inevitable that we'll need it just as google translate needs all of the books + sites + comments it can get a hold of.

mmastrac · on July 2, 2021

Well... the whole training set is licensed, so you can't really get rid of it. I think that the technology they are using for this is just not ready.

fragmede · on July 2, 2021

Just retrain the model using properly licensed code? ("just" is doing a ton of heavy lifting, but let's be real, that's not impossibly hard)

Twirrim · on July 3, 2021

There's not many licenses that let you reuse code without including the same headers / licensing blurb. You're in public domain, non-copyleft territory. WTFPL etc.

michaelmrose · on July 2, 2021

There is no such thing as properly licensed code because it is a function of the what is legally acceptable for your company and what it intends to do with the work.

ianhorn · on July 2, 2021

willis936 · on July 3, 2021

That would be a long list of restated license terms and attributions.

eCa · on July 2, 2021

Which licenses would it be ok that the training material is licensed under, though? If it produces verbatim enough copies of eg. MIT licensed material, then attribution is required. Similar with many other open source-friendly licenses.

On the other hand, if only permissive licenses that also don't require attribution is used, well, then for a start, the available corpus is much smaller.

nextaccountic · on July 3, 2021

The overwhelming majority of code on Github, even code under permissive licenses, require attribution of the original authors.

runeb · on July 2, 2021

How would they do that?

oauea · on July 2, 2021

Read the LICENSE file in each repo.

rovr138 · on July 2, 2021

What guarantees it’s intact?

willis936 · on July 3, 2021

It doesn't need to be. If the license isn't positively exactly permissive then you can't use it.

hermitdev · on July 4, 2021

Can you even trust that the License in a random repo is accurate and expresses the actual copyright of all the contained code?

I guess my point is, you can't be positive that even if you're following the license in a repo you forked that the repo owner hasn't already violated someone else's license, and now transitively, so have you.

icebraining · on July 4, 2021

> Can you even trust that the License in a random repo is accurate and expresses the actual copyright of all the contained code?

In fact, that seems to be exactly the problem shown in the tweet - someone copy-pasted the quake source and slapped a different license on it, and copilot blindly trusted the new license.

__MatrixMan__ · on July 2, 2021

Is it still a legal concern if I'm just coding because I want to solve a problem and I'm not trying to use it to do business?

saurik · on July 2, 2021

Yes: not all code on GitHub is licensed in a way that lets you use it at all. People focus on GPL as if that were the tough case; but, in addition to code (like mine) under AGPL (which you need to not use in a product that exposes similar functionality to end users) there is code that is merely published under "shared source" licenses (so you can look, but not touch) and even literally code that is stolen and leaked from the internals of companies--including Microsoft!... this code often gets taken down later, but it isn't always noticed and either way: it is now part of Copilot :/--that, if you use this mechanism, could end up in your codebase.

maclockard · on July 2, 2021

If you publish the code anywhere, potentially. You could be (unknowingly) violating the original license if the code was copied verbatim from another source.

How much of a concern this is depends heavily on what the original source was.

lolinder · on July 2, 2021

And the problem with copilot is that you have no way of knowing. If it changes even a little bit of the code, it's basically ungoogleable but still potentially in violation.

kevin_thibedeau · on July 2, 2021

Distributing binaries to third parties is enough to trigger a license violation. For internal corporate tools, it would be less of an issue as "distribution" hasn't happened.

celeritascelery · on July 2, 2021

From the Copilot FAQ:

> The technical preview includes filters to block offensive words

And somehow their filters missed f*k? That doesn’t give a lot of confidence in their ability filter more nuanced text. Or maybe it only filters truly terrible offensive words like “master”.

minimaxir · on July 2, 2021

In my testing of Copilot, the content filters only work on input, not output.

Attempting to generate text from code containing "genocide" just has Copilot refuse to run. But you can still coerce Copilot to return offensive output given certain innocuous prompts.

Closi · on July 2, 2021

Ahh, so it's the most pointless interpretation of the phrase "filters to block offensive words", where it is stopping the user from causing offense to the AI rather than the other way around.

derefr · on July 2, 2021

I believe the concept is to stop users from prompting the AI to generate offensive stuff specifically, and then publishing the so-generated stream of offensive stuff as negative PR for GitHub, in the same way the generated stream of offensive stuff coming from Microsoft’s AI was a big PR disaster.

stingraycharles · on July 2, 2021

I suppose you’re referring to the AI Twitter bot that initially was very lovely and within a day 4chan had turned into a nazi. That was both very naive and hilarious.

https://spectrum.ieee.org/tech-talk/artificial-intelligence/...

The big difference in this case, however, is that this AI was constantly learning based on user input, however, which I do not think is the case for Copilot.

drusepth · on July 3, 2021

Copilot is indeed constantly learning based on user input (as detailed here [1]) but it seems to be more high-level ("did the user accept or deny suggestion XYZ" and potentially what changes you make to suggestions after accepting) versus just dumping everything directly back into the model a la Tay.

[1] https://docs.github.com/en/github/copilot/about-github-copil...

bambax · on July 2, 2021

Maybe, but even if so, filtering the output would also prevent this.

raffraffraff · on July 2, 2021

Easily offended AI is exactly what the world needs

GenerocUsername · on July 2, 2021

We have too many easily offended NPC's as is.

Jordrok · on July 3, 2021

Yeah. If only we could cure all those delusional righties fantisizing about being the protagonist in their own personal RPG.

eyeroll

verdverm · on July 2, 2021

They probably don't want to repeat Microsoft's incident with Tay, though they seem to have created their own incident which dooms the product if it wasn't already

Jackson__ · on July 2, 2021

Interesting how this continues to be an issue for GPT3 based projects.

A similar thing is happening in AI Dungeon, where certain words and phrases are banned to the point of suspending a users account if used a certain amount of times, yet they will happily output them when it is generated by GPT3 itself, and then punish the user if they fail to remove the offending pieces of text before continuing.

krick · on July 2, 2021

Lol, how does that make any sense? I mean, all these word blacklists are always pretty stupid, but at least you can usually see the motivation behind them. But in this case I'm not even sure what they tried to achieve, this is absolutely pointless.

quickthrowman · on July 3, 2021

GitHub is owned by Microsoft, which has some relevant experience with publicly accessible AI (Tay): https://en.m.wikipedia.org/wiki/Tay_(bot)

aasasd · on July 2, 2021

Maybe Github just doesn't have many repos to control death factories and execution squads?

alisonkisk · on July 2, 2021

does it also censor "lesbian"?

spoonjim · on July 2, 2021

Blocks offensive words, but doesn't block carefully crafted malware.

throwaway2037 · on July 2, 2021

[flagged]

pydry · on July 2, 2021

Changing master to main was something Github did when they were taking heat for their contract with ICE. It was a nice bit of misdirection that cost them nothing, achieved nothing and garnered praise in some quarters.

ICE, of course, runs an actual concentration camp which has a slightly more troublesome history than the word master.

Language policing is to racism what recycling is to global warming - an attempt to shift the focus away from elite responsibility for systemic issues to "personal responsibility" and forestall meaningful reform by placing emphasis on largely non-threatening symbolic gestures.

SahAssar · on July 2, 2021

I get what you mean, but in a discussion about semantics it might be unhelpful to dilute the term "concentration camp", especially if prefixed with "actual" in italics. That is unless you actually mean that ICE camps serve the same purpose and are equivalent to nazi concentration camps.

dragonwriter · on July 2, 2021

Nazi “concentration camps” were not actual concentration camps (a thing which long predates the Nazi camps), they were extermination camps for which “concentration camp” was a minimizing euphemism.

US WWII “internment” and “relocation” centers were actual concentration camps (“relocation center” was itself a euphemism, but “internment” referred to a formal legal distinction impacting treaty obligations.)

bambax · on July 2, 2021

Why is this downvoted... It's simply the truth.

kanzenryu2 · on July 2, 2021

There were only a handful of mass extermination camps. There were tens of thousands of concentration camps. https://encyclopedia.ushmm.org/content/en/article/nazi-camps....

SahAssar · on July 2, 2021

Sure, but I don't know if I've ever heard anyone use the term "concentration camp" without qualifiers to refer to anything else than the nazi concentration camps (or something equivalent).

If someone says that something is "_literally_ a concentration camp" I think that most people will think of ovens and genocide.

Perhaps it's a regional thing, but that is how I interpreted it.

sombremesa · on July 2, 2021

It's not so much a regional as a political thing. Want it to sound worse? Use concentration camp. Want it to sound better? Use internment camp (or in some cases, re-education facility).

michael1999 · on July 2, 2021

Or “Reserve”.

dragonwriter · on July 2, 2021

Relevant to that, the US WWII internment camps were...placed on land taken from (with disputedly-adequate compensation for the use) reservation land.

pydry · on July 2, 2021

The Nazis ran what would more accurately be termed extermination camps.

Though what they did certainly bore a strong resemblance to the Boer war concentration camps/manzanar,etc. whose purpose was to "concentrate" people into one place rather than industrially slaughter them.

SahAssar · on July 2, 2021

I don't know if I've ever heard anyone use the term "concentration camp" without qualifiers to refer to anything else than the nazi concentration camps (or something equivalent).

Maybe it's just me, but I think it would have been more clear if you said internment camp if your intent was to refer to the broader context and not invoke a comparison to nazis.

pydry · on July 2, 2021

Wikipedia redirects concentration camp to:

https://en.wikipedia.org/wiki/Internment

Where it also makes the point that the nazi camps were primarily extermination camps.

Maybe take it up with them and get back to me if you feel truly passionate about this issue.

>Maybe it's just me, but I think it would have been more clear

Gosh, it's awfully ironic that this sentence would happen in a thread about how language policing is used as a distraction from important issues.

Is it more important to you how people use the term concentration camp or the fact that ICE lock up children in internment/concentration/[ insert favorite word here ] camps?

SahAssar · on July 2, 2021

> So, is it more important to you how people use the term concentration camp or the fact that ICE lock up children in internment/concentration/[ insert favorite word here ] camps?

Well, that escalated quickly.

I don't think I ever said anything for or against what ICE is doing, in fact I tried not to because the only thing I wanted to say was that when using the words "literally concentration camps" people might read that as "camps designed to kill people" since that is the way I've been taught it (in history classes) and heard it (in general use).

I don't even live in the US so I have no say in this in a democratic sense. If I did I'd be against the way migrants are treated and want more humane treatment, but I don't think that should be relevant to what I said.

pydry · on July 2, 2021

Your primary worry was that somebody might read that sentence and believe that the US is gassing immigrants?

Seems unlikely.

SahAssar · on July 2, 2021

You seem to think I have some political motive, I don't. I just saw a comment that from my perspective and historical education seemed to equate two things that I regard as different and said that it might be helpful to not conflate those. It seems like you did not intend to conflate them and it is a difference in what you and I read into the term "actual concentration camp".

From my perspective this conversation is as if someone said "working for XCompany is actual slavery" and I said "Perhaps don't use 'actual slavery' as a term for something that isn't that?"

runarberg · on July 3, 2021

Can’t speak for parent, but when I describe the immigration jails as “concentration camps” I do have a political motive. There was a political motive in calling them “immigration facilities” in the first place. I simply want to call them for what I believe they are. When describing the nazi camps I say “death camps”, as to not conflate the infinitely worse horror of the nazi camps.

There is usually a political motive behind what controversial things are called. There was an active push from the oil lobby swap out the terms for the climate disaster from “global warming” to the more innocent sounding “climate change”. Then recently some media companies made the political decision to start using the term “climate disaster” or “climate crisis”.

> working for XCompany is actual slavery

I don’t think this is equivalent (even though it sounds like it to your ears). The ICE facilities can accurately be described as concentration camps. The victims are kept against their will—i.e. imprisoned—in camps in dedicated camps. This is an accurate term. Slavery only applies when you are forced to work for little or no salary. I.e. I often use the word actual slavery when referring to prison labor. This is a political decision on my part. And you are free to criticize this choice of word. And you would be right to say it diminishes the term when compared to the horrible cattle slavery in the Americas until the 19th century. But it is still an accurate term.

pydry · on July 3, 2021

>You seem to think I have some political motive

I'm not really sure what your motive for trying to police my language was. All I know is that the reasons you have given me all seem rather unlikely.

junon · on July 2, 2021

Historians themselves call what ICE is doing a concentration camp. So your experience is very much localized.

hdhjebebeb · on July 2, 2021

It seems like a distinction without a difference, this article for example uses them interchangably: https://www.commondreams.org/views/2019/06/21/brief-history-...

IdiocyInAction · on July 3, 2021

I've heard it used in place of internment camp. Though I honestly associate it with the Nazis too.

pmkiwi · on July 2, 2021

To be correct, both existed.

A camp like Ravensbrück was a concentration camp (for women) while Auschwitz-Birkenau was both a concentration and extermination camp.

https://upload.wikimedia.org/wikipedia/commons/b/be/WW2_Holo...

samatman · on July 2, 2021

y'know it really seems like both purpose and outcome need to be closely examined here, if we're going to be emphasizing actual next to concentration camps.

what's the paradigm of a concentration camp? if we go straight for Auschwitz we'll get nowhere, how about the Boer concentration camps? Origin of the term after all.

What was the purpose? To concentrate the Boer population during a total war against them, so they couldn't supply and hide the belligerents.

What was the outcome? Tens of thousands of preventable deaths, mostly from disease. Success in the war, from the British perspective.

So, let me turn my spectacles to your example of, may I quote?

> an actual concentration camp

Which appears to be a migrant detention center. To put it succinctly, migrants who enter the country without filling out paperwork, and get caught, end up in one of these places for months-to-years while USG figures out what to do with them.

So a Boer concentration camp is filled by the British riding into a farmstead or town, kidnapping the women and children, and driving them out to a field and sticking them in a tent. A migrant detention center is filled with someone enters the United States without following the rules which govern that sort of behavior, and then, gets caught.

Where is the war?

Where is the excess death?

Ah well. I'm out of time and patience to express my contempt for your abuse of language and disrespect for the real horrors which you cheapen with this kind of facile speech.

Enjoy the 4th of July.

bloomark · on July 2, 2021

Your vacuous argument about what is an _actual concentration camp_ is out of place. This wasn't a discussion about concentration camps, it was about github's attempted misdirection, and their facetious show of supporting inclusion, by eliminating the term "master".

https://news.ycombinator.com/item?id=26487854

abugher · on July 3, 2021

> Where is the war?

Central America, where the US has a history of funding politically aligned factions in conflict, contributing to today's instabilities which drive people to our border.

> Where is the excess death?

Mostly Central America, but keeping people in crowded stressful conditions during a pandemic can probably account for a few more.

> To put it succinctly, migrants who enter the country without filling out paperwork, and get caught, end up in one of these places for months-to-years while USG figures out what to do with them.

The US first funds wars in these people's home countries, then refuses to let them in when they want to live in a safer country, then rounds them up and puts them in camps because they came in anyway. That sounds like a concentration camp to me, even if the death rate is lower than other instances of the same thing.

Pedantic attacks on word usage is fun, though. Let me try it:

> ICE, of course, runs an actual concentration camp which has a slightly more troublesome history than the word master.

The history of the word "master" includes the trans-Atlantic slave trade and the institution of slavery in the US, a well known and extremely long running atrocity. That doesn't seem less troublesome than current ICE activity.

But if you consider what is being said, instead of seizing on exact wording, you can see the points that the action of running these camps is more important than the choice of word used to describe a code repo, and that as a form of concentration camp, the camps are bad - not that slavery wasn't a big deal (my facetious straw man), and not that the US are the British and the detainees are the Boers.

A few responses seem to focus on two of the words chosen in that post, and ignore this point about misplaced focus on choice of words.

pydry · on July 2, 2021

Is this an indirect way of saying that you support ICE?

Coz if so Id really rather hear it straight rather than indirectly via an attempt to police my language.

okamiueru · on July 2, 2021

Pretty sure they were being sarcastic. I also don't find your arguments persuasive in the slightest, and I find myself being skeptical of these recent moral outcries. I'm skeptical of its sincerity, and I don't buy it. "Master" has an etmylogical background far more diverse than the dichotomy to "slave". I can wholeheartedly say that I've not once thought to make that association. It's been a title for centuries. Master blacksmith, etc. (See https://en.wikipedia.org/wiki/Master for a list)

Another example of what seems like a fake moral outcry is "blackface". And, I mean what it is being referred to now, and not the actual meaning. The racist ridicule by stereotyping ethnicity. That was "Blackface". Yet, for some reason, context doesn't matter anymore, and we end up with removing episodes of Community because someone painted their face in a cosplay of an dark elf, in exact commentary of this.

There is a significan systemic racism in the US that affects almost everything. In order to deal with those things, the very first thing would be to properly be able to identify racism. Context matters. Renaming "Master" branches is not progress. Ostracising a kid for dressing up as Michael Jackson isn't it.

Whenever I see outrage over such things I cynically think that the person is probably white, and probably doing it for attention. One thing is for sure, it only serves to detract from the real issues.

rorykoehler · on July 2, 2021

Check out the recent Marc Rebillet stream with Flying Lotus and Reggie Watts. They absolutely destroy the bs around the use of the word master. I think both FL and RW will be quite representative of how African Americans (and the rest of the world) feel about this.

okamiueru · on July 2, 2021

Do you have a timestamp? As enjoyable as as it is to listen to each of them, the stream was mostly music and almost two hours long.

rorykoehler · on July 2, 2021

The next couple of minutes from here https://youtu.be/0J8G9qNT7gQ?t=3984

username90 · on July 2, 2021

> For co-workers not familiar with the history of slavery in the United States, there is always a pause, and then some confusion about the changes. After explaining the historical context, 99% of people reply: "Oh, I understand. Thank you to explain."

Most people answer like this when they realize you are an unreasonable person who refuse to listen. Happens all the time, like "Oh, I understand (you are one of those). Thank you for explaining!", and remember that they need to stop using this word when working with you.

bestcoder69 · on July 2, 2021

By rolling your eyes you accept my terms and conditions.

LAC-Tech · on July 2, 2021

Imagine 'explaining' the historical context to someone from say, Brazil.

banana_maker · on July 2, 2021

The word master has many usages. One specific context (master/slave) is inappropriate, but that doesn't mean every other context is unusable now.

Github changing master->main was the epitome of virtue signaling. This literally does not affect black people at all, nor does it do -anything- to help with racial inequality in the US. It's actually quite patronizing and tone-deaf to think that instead of all the things -Microsoft- could be doing to help racial inequality, they're putting in as little effort as possible.

Congrats on granting power over words to unreasonable people who ignore things like context in language and common sense.

RicardoLuis0 · on July 2, 2021

while the word 'master' can indeed be used in the sense of "master and slave", its use in git is more akin to the use of 'master' in "master record", and doesn't refer to 'ownership' in any way

sseagull · on July 2, 2021

Everyone has a line of how much they are willing to change their language, though. There will always come a point where someone will think some change is "silly", even though the old term may have upset some people. And almost every term has some sort of baggage associated with it.

There was a post going around somewhere of a college's earnest attempt and change some language (like avoiding "give it a shot" because of the association of "shot" with guns). Would renaming all the various things we call "triggers" be ok, so we don't upset victims of gun violence?

So the master->main change was the line for some people, not others.

banana_maker · on July 2, 2021

As a matter of principle I don't think we should be moving towards ignoring any and all contexts of words. Granting this power of word banning to random arbiters is quite crazy. In this case, master was moreso changed because it -could- be deemed offensive, not that it -actually- is offensive by itself. Not one person that I've spoken to about it has actually cared.

Words having multiple usages is not really a novel concept. If we ban words based on them potentially being offensive, we'll end up with no words at all as people move onto using different words, and so forth.

It is not silly to have pushback when someone wants to grant themselves power over language usage. Dropping usage of a word should have a strong, tenable argument and larger community support than 0.00000001% of people caring.

LAC-Tech · on July 2, 2021

> Everyone has a line of how much they are willing to change their language, though.

But that line is constantly moving though. People are forced to adapt, or they are ostracised socially and economically.

If prestigious organisations, people and institutions decide "master/slave" is an immoral thing to say, I have no choice. Eventually I'll need to fall in line or my livelihood will be at risk.

Isinlor · on July 2, 2021

Why is there no push-back against using the word Slave that originates from word "Slav" due to enslavement of Slavic people?

By analogy, you are basically using the word African to mean "a person in possession of someone else".

https://www.etymonline.com/word/slave

@edit The fact that people down vote this highlights that the whole issue is just virtue signaling.

blindmute · on July 2, 2021

> After explaining the historical context, 99% of people reply: "Oh, I understand. Thank you to explain."

A similar percentage then think to themselves, privately, "well that's pretty stupid."

slackfan · on July 2, 2021

And in my historical context power fists that your ideology used were used by a regime that murdered millions. In the past 100 years.

mdoms · on July 2, 2021

I don't work in USA and I don't intend to. Your history of slavery is none of my concern, especially when I'm just trying to do my work.

The word 'master' is useful for me, and I don't believe for a nanosecond that anyone, American or not, is ACTUALLY offended by it. I believe that some people (mostly affluent white Americans) are searching for things that they think they SHOULD be offended by.

thinkingemote · on July 2, 2021

From the GPLv2 licensed code:

https://github.com/id-Software/Quake-III-Arena/blob/master/c...

copilot repeats it word for word almost, including comments, and adds an MIT like license up the top

Thomashuet · on July 2, 2021

Actually the indentation of the first comment and the lack of preprocessor show it's not copied from this code directly but from Wikipedia (https://en.wikipedia.org/wiki/Fast_inverse_square_root#Overv...) So It could be that the Quake source code is not part of the training set but the Wikipedia version is.

SamBam · on July 2, 2021

While I strongly doubt they would use Wikipedia as a training set, has anyone done a search of GitHub code to see if other projects have copied-and-pasted that function from Wikipedia into their more-permissive codebases?

bootlooped · on July 2, 2021

Almost 2000 results for one of the comment lines. I'm not going to read through those or check the licenses, but I think it's safe to say that block of code exists in many GitHub code bases, and it's likely many of those have permissive licenses. Given how famous it is (for a block of code) it's not unexpected.

https://github.com/search?q=%22evil+floating+point+bit+level...

A question that popped into my head is: if the machine sees the same exact block of code hundreds of times, does that suggest to it that it's more acceptable to regurgitate the entire thing verbatim? Not that this incident is totally 100% ok, but if it was doing this with code that existed in only a single repo that would be much more concerning.

Animats · on July 2, 2021

if the machine sees the same exact block of code hundreds of times, does that suggest to it that it's more acceptable to regurgitate the entire thing verbatim?

From a copyright standpoint, quite possibly. This is called the "Scènes à faire" doctrine. If there are some things that have to be there in a roughly standard form to do a standard job, that applies.

[1] https://en.wikipedia.org/wiki/Sc%C3%A8nes_%C3%A0_faire

nextaccountic · on July 3, 2021

This would need to first be tested in court; apparently Microsoft is happy in generating thousands (or millions) of violations, knowing most programmers don't enforce their copyright.

ajayyy · on July 2, 2021

It is probably based off GPT-3 with a layer on top trained for programming specifically, like what is done with AI dungeon.

an_opabinia · on July 2, 2021

Wait until people on the toxic orange site find out what has happened to AI Dungeon.

SamBam · on July 2, 2021

I'm out of the loop.

grawprog · on July 2, 2021

https://gitgud.io/AuroraPurgatio/aurorapurgatio

https://www.reddit.com/user/non-taken-name

_d7dt · on July 2, 2021

I don't get it, that seems like standard fare for an R-rated movie? And then it seems like some complained because they decided to start editing it down to a PG-13 movie?

grawprog · on July 2, 2021

Essentially, from my understanding, there was a data leak they never commented on, they instituted a poorly made content filter without saying anything. The filter frequently has false positives and negatives, someone discovered they trained the game using content the filter was designed to block, meaning the ai itself would frequently output filter triggering stuff, more people found out their private unpublished stories were being read by third parties after a job ad and the stories were posted on 4Chan, people recognized stories they wrote that had triggered the filter that were posted, and then they started instituting no warning bans.

I might have missed something, but that's the gist of it.

Filligree · on July 3, 2021

Also, before and while all this was going on, the quality of the AI's output has been steadily dropping to the point where NovelAI.net now generates what's in many ways better writing.

That's GPT-J-6B, to be clear. A 6-billion-parameter model is producing better output than a 300 billion parameter model, because of what I can only assume to be sheer incompetence on AI Dungeon's part. I've also used the raw GPT-3 API, and it does better at writing than either. In other words: Doing nothing would have been better than whatever they've been doing.

armatav · on July 2, 2021

It’s pre-trained, partially, on Wikipedia. GPT-2 did this sort of thing all the time: native to the architecture to surface examples from the fine-tuning training set by default.

edgyquant · on July 2, 2021

It’s GPT though and the GPT models were trained on data from Wikipedia

rmorey · on July 2, 2021

This exact code is all over github, >1k hits

https://github.com/search?q=%22i++%3D+%2A+%28+long+%2A+%29+%...

nextaccountic · on July 3, 2021

Then they all copied from the same source, and what's more, they are all derivative of a GPL work (and thus they should be GPLed themselves)

ajklsdhfniuwehf · on July 2, 2021

that will make a great defense at a copyright court.

"your honor, i would like to plead not guilty, on the basis that i just robbed that bank because i saw that everyone was robbing banks on the next city"

...on the other hand, that was the exact defense tried for the capitol rioters. So i don't know anything anymore.

arksingrad · on July 2, 2021

I guess this confirms John Carmack to be an AI

OskarS · on July 2, 2021

Apparently Carmack was not the original author, the origin I believe is SGI somewhere in the deep dark 90s.

Haga · on July 2, 2021

Was a optimization for a fluid simulation originally..

Fordec · on July 2, 2021

I get why some people were saying it made them a better programmer. Of course it did, it's copy-pasting Carmack code.

nojito · on July 2, 2021

It's up to the end user to accept the suggestions.

croes · on July 2, 2021

Good luck checking every code line for license violations

dr_kiszonka · on July 2, 2021

There will be a VSCode extension for that.

TheDong · on July 2, 2021

It's impossible to automate checking for code license violations.

If you and I write the exact same 10 lines of code, we both have independent and valid copyrights to it. Unlike patents, independent derivation of the same code _is_ a defense for copyright.

If I write 10 lines of code, publish it as GPL (but don't sign a CLA / am not assigning it to an employer), and then re-use it in an MIT codebase, I can do that because I retained copyright, and as the copyright holder I can offer the code under multiple incompatible licenses.

There's no way for a machine to detect independent derivation vs copying, no way for the machine to know who the original copyright holder was in all cases, and whether I have permission from them to use it under another license (i.e. if I email the copyright holder and they say 'yeah, sure, use it under non-gpl', it suddenly becomes legal again)...

It's not a problem computers can solve 100% correctly.

dr_kiszonka · on July 3, 2021

I should have added /s to highlight that I was being sarcastic. Sorry.

croes · on July 2, 2021

Same trust issue

atatatat · on July 2, 2021

It's people for your lawyers to blame, all the way down!

/s

croes · on July 2, 2021

It's the same problem s with self driving cars, you gets sued. The company that provides the service/car or the the programmer/driver? I think the latter.

duckmysick · on July 2, 2021

SaaS idea: code linter, but for licenses.

SahAssar · on July 2, 2021

That's one of blackduck's offerings: https://www.synopsys.com/software-integrity/open-source-soft...

At a previous job we had a audit from them, it seemed to not be too accurate but probably good enough for companies to cover their asses legally.

adrianN · on July 2, 2021

Extend Fossology: https://www.fossology.org/

freshhawk · on July 2, 2021

And it's up to the end user to evaluate the tool that makes the suggestions.

Spivak · on July 3, 2021

This is true but doesn't change the problem that copilot itself is potentially distributing unlicensed copyrighted material. This isn't necessarily a problem for you as a developer though.

flatiron · on July 2, 2021

As someone who does code reviews the thought the developer didn’t code the code submitted to be merged never would cross my mind.

user-the-name · on July 2, 2021

And it is completely impossible for the user to do so.

So, the tool is worthless if you want to use it legally.

nojito · on July 2, 2021

Doubtful.

You can be almost certain it’s being widely used or will be widely used shortly.

The conversations around copilot are eerily similar to the conversations around the first autocomplete tools

gnulinux · on July 2, 2021

It's more like a writer using an autocomplete tool to write the first chapter to their novel.

caconym_ · on July 2, 2021

As someone who gets paid to write code (nominally) and has also written a few novels, I don't agree with this characterization. From what I've seen of Copilot, it's more like having a text editor generate your next sentence or paragraph^[1]. The idea (as I see it) is that you might use it to generate some prose "boilerplate", e.g. environmental descriptions, and hack up the results until you're satisfied.

It's content generation at a fragmentary level where each "copied" chunk does not form a substantive whole in the greater body of the new work. Even if you were training it on other authors' works rather than just your own, as long as it wasn't copying distinctive sentences wholesale, I think there's a strong argument for it falling under fair use--if it's even detectable.

On the other hand, if it regurgitated somebody else's paragraph wholesale, I don't think that would be fair use. Somewhere in-between is where it gets fuzzy, and really interesting; it's also where internet commenters seem to prefer flipping over the board and storming out convinced they're right to exploring the issues with a curious and impartial mind. I see way too much unreasoned outrage and hyperbolic misrepresentation of the Copilot tool in these threads, and it's honestly kind of embarrassing.

As far as this analogy goes, it's worth noting that the structure of a computer program doesn't map onto the structure of a piece of fiction (or any work of prose) in a straightforward way. Since so much of code is boilerplate, I would (speculatively, in the copyright law sense) actually give more leeway to Copilot in terms of absolute length of copied chunks than I would for a prose autocompleter. For instance, X program may be licensed under the GPL, but that doesn't mean X's copyright holder(s) can sue somebody else because their program happened to have an identical expression of some RPC boilerplate or whatever. It would be like me suing another author because their work included some of the same words that mine did.

^[1] At least one tool like this (using GPT-3) has been posted on HN. At this point in time I wouldn't use it, but I have to admit that it was sort of cool.

Filligree · on July 3, 2021

> ^[1] At least one tool like this (using GPT-3) has been posted on HN. At this point in time I wouldn't use it, but I have to admit that it was sort of cool.

Have a poke at novelai.net if you get a chance.

It's... not very smart. It's pretty decent at wordcrafting, though, and as an amateur writer I find it invaluable for busting writer's block. Probably if you spend all day writing fiction you'll find ways around that, but for me the solution has become "Ask the AI to try".

It'll either produce a reasonable continuation, or something I can look at and see why it's wrong. Either is better than a blank page.

caconym_ · on July 4, 2021

In case you're interested, this is the post I was thinking about: https://news.ycombinator.com/item?id=27032828

The application itself is called "Sudowrite". I guess there are probably a bunch of them at this point.

user-the-name · on July 2, 2021

That does not seem like a response to what I just said?

I said that it is impossible for the user to check that the code copilot gives is OK, license-wise, and therefore, they can not be sure that it is legally OK to include in any project.

dgellow · on July 2, 2021

Another fascinating one, an "About me" page generated by copilot links to a real person's Github and twittter accounts!

https://twitter.com/kylpeacock/status/1410749018183933952

bencollier49 · on July 2, 2021

That's bonkers. And the beauty of it is that now someone could realistically do a GDPR Erasure request on the Neural Net. I do hope that they're able to reverse data out.

qayxc · on July 2, 2021

Since the information is encoded in model weights, I doubt that erasure is even possible. Only post-retrieval filtering would be an option.

It only goes to show that intransparent black-box models have no place in the industry. The networks leak information left and right, because it's way too easy to just crawl the web and throw terabytes of unfiltered data at the training process.

jeroenhd · on July 3, 2021

If this system includes personal information that cannot be removed, corrected or controlled, it's probably a gross violation of all European and some American privacy laws.

Designing a system that you cannot control does not grant you legal immunity for whatever the system does. As Github operates inside the EU, personal information this system contains MUST be deleteable, correctable and retrievable, or it's simply illegal.

ohazi · on July 2, 2021

I think the fact that there's no way to delete the data in question without throwing away the entire model is a feature...

The strategic goal of a GPDR erasure request would be to force GitHub to nuke this thing from orbit.

bencollier49 · on July 2, 2021

> Only post-retrieval filtering would be an option.

And illegal, if the original information remains.

I assume that there must be a process for altering the training data set and rerunning the entire thing.

gmueckl · on July 2, 2021

The problem is that the information is in an opaque encoding that nobody can reverse engineer today. So it's impossible to prove that a certain subset of data has been removed from the model.

Say, you have a model that repeats certain PII when prompted in a way that I figure out. I show you the prompt, you retrain the model to give a different, non-offensive answer. But now I go and alter the prompt and the same PII reappears. What now?

computerex · on July 2, 2021

Yes, but the compute costs required for training are probably in the range of hundreds of thousands of usd to potentially millions of usd. Not to mention potentially months of training time.

6gvONxR4sf7o · on July 2, 2021

Good thing “compliance is really expensive” isn’t a valid legal defense.

anyonecancode · on July 2, 2021

I think copilot is solving the wrong problem. A future of programming where we're higher up the abstraction tree is absolutely something I want to see. I am taking advantage of that right now -- I'm a decently good programmer, in the sense that I can write useful, robust, reliable software, but I'm pretty high up the stack, working in languages like Java or even higher up the stack that free me from worrying about the fine details of memory allocation or the particular architecture of the hardware my code is running on.

Copilot is NOT a shift up the abstraction tree. Over the last few years, though, I've realized the the concept of typing is. Typed programming is becoming more popular and prominent beyond just traditional "typed" languages -- see TypeScript in JS land, Sorbet in Ruby, type hinting in Python, etc. This is where I can see the future of programming being realized. An expressive type system lets you encode valid data and even valid logic so that the "building blocks" of your program are now bigger and more abstract and reliable. Declarative "parse don't validate"[1] is where we're eventually headed, IMO.

An AI that can help us to both _create_ new, useful types, and then help us _choose_ the best type, would be super helpful. I believe that's beyond the current abilities of AI, but can imagine that in the future. And that would be amazing, as it would then truly be moving us up the abstraction tree in the same way that, for instance, garbage collection has done.

[1] https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-va...

shadowgovt · on July 2, 2021

A taller abstraction tree makes tradeoffs of specialization: the deeper the abstractions, the more one has to understand when the abstractions break or when one chooses to use them in novel ways.

This is something I'm interested in regarding this approach... When it works as intended, it's basically shortening the loop in the dev's brain from idea to code-on-screen without adding an abstraction layer that someone has to understand in the future to interpret the code. The result is lower density, so it might take longer to read... Except what we know about linguistics suggests there's a balance between density and redundancy for interpreting information (i.e. the bottleneck may not be consuming characters, but fitting the consumed data into a usable mental model).

I think the jury's out on whether something like this or the approach of dozens of DSLs and problem-domain-shifting abstractions will ultimately result in either more robust or more quickly-written code.

But on the topic of types, I'm right there with you, and I think a copilot for a dense type forest (i.e. something that sees you writing a {name: string; address: string} struct and says "Do you want to use MailerInfo here?") would be pretty snazzy.

krick · on July 2, 2021

Yeah, but generating tons of stupid verbose code that nobody will be able to read and understand is more fun. Also, your superiors will be sure you are a valuable worker if you write more code.

abeppu · on July 2, 2021

I may be over-reading, but I think this kind of example not only demonstrates the pragmatic legal issues, but also the fundamental weaknesses of a solely text-oriented approach to suggesting code. It doesn't really seem to have a representation of the problem being solved, or the relationship between things it generates and such a goal. This is not surprising in a tool which claims to work at least a little for almost all languages (i.e. which isn't built around any firm concept of the language's semantics).

I'd be much more excited by (and less unnerved by) a tool which brought program synthesis into our IDEs, with at least a partial description of intended behavior, especially if searching within larger program spaces could be improved with ML. E.g. here's an academic tool from last year which I would love to see productionized. https://www.youtube.com/watch?v=QF9KtSwtiQQ

computerex · on July 2, 2021

I think it’s pretty clear that program synthesis good enough to replace programmers requires AGI.

This solely text based approach is simply “easy” to do, and that’s why we see it. I think it’s cool and results are intriguing but the approach is fundamentally weak and IMO breakthroughs are needed to truly solve the problem of program synthesis.

YeGoblynQueenne · on July 3, 2021

There's a few decades worth of work on program synthesis and it works very well. You don't need AGI.

You need either a) a complete specification of the target program in a formal language (other than the target language) or b) an incomplete specification in the form of positive and negative examples of the inputs and outputs of the target program, and maybe some form of extra inductive bias to direct the search for a correct program [edit: the latter setting is more often known as program induction].

In the last few years the biggest, splashiest result in program synthesis was the work behind FlashFill, from Gulwani et al: one-shot program learning, and that's one shot, from a single example, not with a model pretrained on millions of examples. It works with lots of hand-crafted DSLs that try to capture the most common use-cases, a kind of programming common sense that, e.g. tells the synthesiser that if the input is "Mr. John Smith" and the output is "Mr" then if the input is "Ms Jane Brown" the output should be "Ms". It works really, really well but you didn't hear about it because it's not deep learning and so it's not as overhyped.

Copilot tries to circumvent the need for "programming common sense" by combining the spectacular ability of neural nets to interpolate between their training data with billions of examples of code snippets, in order to overcome their also spectacular inability to extrapolate. Can language models learned with neural nets replace the work of hand-crafting DSLs with the work of collecting and labelling petabytes of data? We'll have to wait and see. There are also many approaches that don't rely on hand-crafted DSLs, and also work really, really well (true one-shot learning of recursive programs without an example of the base case and the synthesis terminates) but those generally only work for uncommon programming languages like Prolog or Haskell, so they're not finding their way to your IDE, or your spreadsheet app, any time soon.

But, no, AGI is not needed for program synthesis. What's really needed I think is more visibility of program synthesis research so programmers like yourself don't think it's such an insurmountable problem that it can only be solved by magickal AGI.

computerex · on July 3, 2021

I said program synthesis good enough to replace programmers requires AGI. Program synthesis based off of informal specifications in natural language. Not talking about highly constrained environments with formal specs.

I am not belittling the work going in this space, and I’m sure for highly constrained and narrow use cases a lot can be done even now. But I believe solving the general problem of program synthesis based on informal spec requires AGI. I am hardly the only one who thinks this.

YeGoblynQueenne · on July 3, 2021

>> I am not belittling the work going in this space, and I’m sure for highly constrained and narrow use cases a lot can be done even now.

No. Program synthesis approaches work very well for a broad array of problems, not for "highly constrained and narrow use cases"- that is a misconception of the kind that results from lack of familiarity with modern program synthesis.

Here's a good recent review of the field:

https://www.microsoft.com/en-us/research/wp-content/uploads/...

Sumit Gulwani, that I mentioned in my previous comment, is an author. To clarify, I'm not in any way affiliated with him or his collaborators. I'm actually from a rival camp, if you will, but the paper I link to is a very good summary of the state of the art. It should help you if you wish to understand where program synthesis is at.

>> I said program synthesis good enough to replace programmers requires AGI. Program synthesis based off of informal specifications in natural language.

Program synthesis from natural language is hard to make work because it's difficult to translate natural language specifications to specifications that a program synthesiser can use. But that is a limitation of current natural language analysis, specifically natural language understanding, approaches - not a limitation of program syhtesis approaches.

I think you equate formal specifications, or specification by example, with "narrow use cases". There's no connection between the two.

computerex · on July 3, 2021

If program synthesis is as far advanced as you say it is, how come I make six figures doing something that you seem to be arguing can be totally automated?

The reality seems to disagree with your statements. Program synthesis is as of right now limited to academic research and highly narrow use cases. If the opposite was true, I’d be out of a job.

I think copilot is probably the first product of its type that might make its way into the hands of users en masse.

Edit:

Btw I was referring to program synthesis based off informal natural language spec. Spec inference is part of the synthesis pipeline, I think it’s not fair to just ignore that problem.

YeGoblynQueenne · on July 4, 2021

The purpose of program synthesis is not to get programmers out of a job. Rather, it's a tool to help programmers better do their job. I think it's easy to see why you're not using it. With few exceptions, advances in research take many years to percolate down to the industry. And of course the industry is famous for following trends without real understanding of anything.

Anyway the review I linked to has some examples of real-world applications of program synthesis. Don't be afraid to read it- it's light on formal notation and you don't need special skills to understand it. I appreciate that it's a long document but there's a Table of Contents at the start and you should be able to skim through in a short time just to get a general idea of the subject.

Anyway I can see you're trying to "wing it" and reason from first principles about something you know nothing about, in true SWE style. Yet, you don't know what you don't know, so you start from the wrong assumptions ("fully automated" etc) and arrive at the wrong conclusions. That's no way to understand anything. It's certainly not going to give you any good idea about what's going on in an entire field of research you know nothing about.

Of course you're not obliged to know anything about program synthesis, but in that case, maybe consider sitting back and listening rather than expressing strong opinions with absolute conviction that is not supported by your knowledge? I think that will make a better conversation, and a better internet, for everyone.

abeppu · on July 4, 2021

I think you're holding text-based approaches and synthesis based approaches to radically different expectations. Copilot isn't approaching replacing programmers; presumably a programmer is invoking it, deciding what to keep or change, etc, i.e. generating parts of programs under the guidance of a human programmer. Synthesis can work at the level of providing an expression or a helper function, as a useful tool under the guidance of a programmer.

Copilot suggests some code snippets, and not necessarily good ones. To be dismissive of another approach to generate parts of programs because they cannot replace programmers is like saying that belt-drive bikes aren't worth considering over chains because a belt-drive bike isn't a replacement for a Learjet.

computerex · on July 4, 2021

To be clear I wasn’t dismissing anything, that was not my intent. I think as a programmer assist text based approaches work and I really like copilot for what it is.

I was merely saying that for the holy grail, program synthesis from informal spec generalised to any domain, the approach will have to be different.