Hacker News new | past | comments | ask | show | jobs | submit login
CodeCompose: A large-scale industrial deployment of AI-assisted code authoring (arxiv.org)
165 points by azhenley on June 3, 2023 | hide | past | favorite | 70 comments



The abstract says ..we present metrics from our large-scale deployment of CodeCompose that shows its impact on Meta's internal code authoring experience over a 15-day time window, where 4.5 million suggestions were made by CodeCompose. Quantitative metrics reveal that (i) CodeCompose has an acceptance rate of 22% across several languages, and (ii) 8% of the code typed by users of CodeCompose is through accepting code suggestions from CodeCompose. Qualitative feedback indicates an overwhelming 91.5% positive reception for CodeCompose.

In other terms, out of 4.5 million suggestions about 80% were off, yet there is 91% positive reception. That's 3.6 million rejected suggestions that potentially distracted programmers from doing their work. Yet users are happy. Is there a contradiction in these figures?


Reading these answers reminded me why I love HN - actually thoughtful perspectives :) Guess a lot boils down to two variables - (a) suggestion UX quality and (b) definition of 'rejection' event. I skimmed through the paper and it turns out that 91% figure is based on feedback from 70 people and anonymous feedback wasn't allowed. So, 'overwhelming 91% favorable' can be paraphrased to `64 people out of the total 16k user base said they liked it'. Would be interesting to see indirect metrics like retention on day 15.


Quite an insightful comment. In an institution that large it's surprising there were only 64 brown nosers. I expect out of 16k captive audience employees you could probably get 64 people to give a positive opinion of replacing paychecks with meta store scrip.


It's easy to :

- anticipate when the suggestions are likely to be useless and not even bother

- scan the proposals to see if they are what you want in cases it's useful

It's a boilerplate generator and you're happy when it saves you tedious mental effort.


>It's a boilerplate generator and you're happy when it saves you tedious mental effort.

On the other hand the person trying to track down a subtle bug afterwards might be a little less happy at having to wade through oceans of boilerplate.


Sounds like you haven't tried copilot, basically scenarios like :

  if(a) {...}
  if(b) // here it predicts line by line from a once you start with similar logic
  if(c) // here it will do a one shot generalization of a and b for c
Likewise method variations, enum mappings, inferring call parameters from names and signature, etc. - these things are trivial to check and test but take effort to type out. When you know what you want to do and someone suggests the solution you had in mind you're happy you saved half a min or min of typing.

I'm as, if not more, likely to zone out on tedium and introduce the subtle bug myself.


That isnt boilerplate. It's isolated, repeated code.


In my book boilerplate = repetitive code you can't abstract.


I’d say it’s hard to argue with the positive impression of the engineer using it. If they find it’s suggestions helpful it’s not a distraction, it’s helpful.

Using GitHub copilot daily I find it’s suggestions often nonsense but interesting to see regardless. Often for boilerplate it’s spot on and it saves me dozens of lines of typing. But it also suggests stuff on every key stroke many of which I just type through, similar to intellisense. Assuming Metas code thingy is better, I would find myself in that 91%, as I’m already there with what’s available to the general public.

My only gripe, fwiw, with copilot in vscode is it interferes with intellisense. Often I want to see the code completion from both, but copilot jumps in before intellisense and the intellisense never renders and I use it as an inline api reference. Sometimes it’s so frustrating I have to turn off copilot. But, copilot is generally useful enough that I reenable it once I’ve understood the api stuff I’m unsure of. There’s some escape backspace period dance I can do that sometimes let’s intellisense win. I’ve not dug deeply enough into vscode configuration to know if there’s some parameter to tweak the race conditions. I’d note that when intellisense renders first copilot still renders its suggestions but the other way doesn’t work.


I treat it the same way I do pre-LLM LSP suggestions, which is basically inline documentation lookup. ‘Oh what was that function name for inserting something at the end? PushB- no, InsertAft- no, App - end! Yea that’s it’

In this case it gave me 3 suggestions but I only accepted 1. I could see this taking 5-10 suggestions for an LLM to when it’s not something as straightforward as a function name. It’s still very useful despite this low acceptance rate


I think the 8% number better explains why users were so overwhelmingly happy. Assuming the suggestions in general are not distractingly wrong, then 8% of code automatically written is a decent amount of time saved researching solutions.


But only 22% are accepted for those 8%, which means that the 78% code suggestions that are not accepted correspond to an equivalent of over 28% of all code written. Not sure that having to spend the time evaluating an additional 28% of code in vain amounts to an overall win.

Though I guess the success rates when using Stack Overflow aren’t too dissimilar.


What it doesn't tell us though is how useful the rejected recommendations were.

Meaning, how many rejected solutions were sufficient to give the engineer enough context to turn a 30m task into a 5m task because they generated a recommendation, got an idea, rejected it, and rewrote it more efficiently or more correctly?

There's a lot of "devil in the details" likely buried in here.


Interesting that 91% find it useful but only 8% of the code is generated by LLM. This is even with a LLM tuned on the internal codebase. This will give a mild boost but not replace anyone.


Have you tried GitHub Copilot? You don't have to accept the code suggestions, so they don't really distract you or get in the way once you get used to the UX.


I find them extremely distracting. Evaluating a suggestion is, for me, an entirely different mental process from the creative process I’m in the middle of. The tagline that copilot helps you stay in the flow is very much not my experience.

I am well aware that others are having a different experience with it.


I've found I am naturally ignoring the large complex suggestions because they usually have mistakes, and accepting the small easy suggestions. I respect your experience though, to each their own.


Mine doesn't even make complex suggestions. I can't get it to suggest more than one line at a time. Wonder what's different? I'm on the beta.


The thing can generate whole unit tests if you leave it a one-like description in a comment next to the function you want tested. It’s actually amazing.


For example, sometimes I'll start out with a code comment for a function, hit enter and the next line suggestion will be the entire function.


The Industrial Challenges section of the paper addresses specific areas of flow disruption they focused on.

Some folks may never accept AI code completion / suggestions (like some prefer vim over modern IDEs) but at least people working on this stuff can describe points known to focus on.


It’s a different system, but it seems interesting to compare with what Google does for code review suggestions [1].

> The final model was calibrated for a target precision of 50%. That is, we tuned the model and the suggestions filtering, so that 50% of suggested edits on our evaluation dataset are correct. In general, increasing the target precision reduces the number of shown suggested edits, and decreasing the target precision leads to more incorrect suggested edits. Incorrect suggested edits take the developers time and reduce the developers’ trust in the feature. We found that a target precision of 50% provides a good balance.

Also, it seems like if the suggestions are too good then they’ll be blindly trusted and if they’re too bad they’ll be ignored?

Where to set the balance likely depends on the UI. For a web search, how many results do you click on?

[1] https://ai.googleblog.com/2023/05/resolving-code-review-comm...


A lot of time suggestions are provided but not used because you already knew the answer and typed fast enough not to take it.


Think of it like traditional code completion. It's mostly wrong but still useful. You either type through it, or tab/arrow to select the correct completion.

AI code completion (like Github Copilot) is like this. Still a time saver overall, even with a low acceptance rate.


If you take random question from stack overflow, my guess is that 80% of them don't have correct answer, yet I am very happy stackoverflow exists.


I've had Bing provide me with code from SO that was from the question, which was code that was explicitly stated to not work and the poster wanted to know what was wrong with it. Bing's AI didn't understand this and claimed it was a solution.


The UX is really important, and the paper covers it, this is super spiffy tab completion, so even if it's wrong a lot, reading is faster than typing, and having something autofill `x if x is not None else ''` correctly even 5% of the time is nice.


I was thinking the same; it feels the acceptance rate is a bit low, but maybe not… I wonder what the numbers are for Copilot?


It's not like programmers normally get everything right first time.


if it makes you an 1.2x dev?


Everyone in this space seems to be building on the LSP and classic auto-complete in particular as their UI. But I've found this to be non ideal.

- As mentioned in this paper I definitely do not want the AI suggestion crowding out a suggestion generated directly from the type bindings

- I often do want the AI to write an entirely new block of boilerplate. To do this you have to write a comment string targeted at the AI, then delete this afterwards

- Sometimes I'd just like the AI to explain to me what some code does without writing anything

- This isn't something I always want on; I find myself turning the plugin on and off depending on the context

Overall I think we need a novel UX to really unlock the AI's helpfulness


I have been enjoying a chat based AI coding modality. I built some tooling that gets rid of the need to cut & paste code between the chat and your files. This makes chatting about code changes much more ergonomic. My tool also integrates directly with git, which provides a safety net. It’s easy to undo changes if the AI does something silly.

Here are some chat transcripts that give a flavor of what it’s like to code with AI this way:

https://aider.chat/examples/

My tool is open source, and currently only works if you have a gpt-4 api key.


Very cool. I’ve been using ChatGPT quite manually for similar effect here, though I’m often using it for fragments of code less than whole projects/files, given I’m often dealing with an existing codebase.


A lot of my recent work has been focused on making this chat style coding work better with larger, pre-existing codebases.

I wrote up some notes on one effective approach:

https://github.com/paul-gauthier/aider/blob/main/docs/ctags....


This echoes my sentiment exactly. My biggest gripe is when type suggestions are replaced with AI suggestions, as I more often just want to auto-complete a method/attribute. I frequently find myself toggling AI suggestions via hotkey.

As for the getting a suggestion by writing comments, an "insert from prompt" action perhaps, or just a separate prompt pane/popup/whatever-you-prefer combined with using good ol' copy+paste would suffice.


Does it need to be that novel of a UX?

If you want to know what some code does, just select it & hit a keyboard shortcut (or right click and choose explain from menu).

If you want AI to write code for you, write a comment starting with a specific word, it suggests the implementation and you can choose to accept & replace the comment with it.


All of these things are pretty much exactly what GitHub Copilot chat does.

You can select code and ask it to explain it to you, or ask it to generate some boilerplate / code and then insert it at your cursor position without adding any comments like you described for prompting the Copilot autocomplete.

It seems okay, but I don't really use vscode that much


In vscode the Genie extension does these things and you can provide your own contextual hooks with custom prompts. It’s particularly good at explaining syntax and semantic errors.


What kind of novel UX are you imagining?


Hm. It seems to be like automated Stack Overflow. Only 8% of the code comes from the AI system, but it's useful for getting examples of how to do something.

Hallucination about API calls was reported as a problem. I've seen that one. There's an amusing, and seriously annoying, tendency for these systems to make up some plausible API call that does what you need, but doesn't exist. Maybe something should collect up such suggestions as proposals for new API calls.


The future of API design — “yes it would make SENSE if that existed, but it doesn’t” => now it does


If you want to skip the paper and watch the video: https://youtu.be/ANDJ0TKjyWw

Disclaimer: I am the person in the video.


It was a great video, and a great paper.

As someone who writes quite a lot of Hack, I'm selfishly interested in whether you plan to open-source this work (not the weights, obviously, but everything else).


Are users finding usefulness and time savings in the suggestions even if they ultimately reject them?

I'm thinking of scenarios where they might get an idea and then reject them and rewrite it better.


My 2 cents on AI-assisted code authoring:

First time I got my hands on Github Copilot some time ago, I intentionally chose something "odd" (so no CRUD API in X, Kafka stuff, sorting algorithm) so I chose to re-create some Excel sheet alike UI in React & Canvas.

First everything proposed was misleading but after the first couple of lines, it astonished me by really proposing things like nested loops for iterating over a grid (surely, I used an according function name). I came quite far (for a quick test) with marking single and multiple cells and being able to drag the marked area and double-click for text input. Then I realized it was about time to invent a proper data model and stopped it there.

All in all, I'm using it for like 50% of daily coding, but it's frequently misleading in subtle tricky ways that I expect to be quite dangerous especially if using it for tests as well (need to double-check if it's not BSing me twice). Debugging code is already a big drag in our work and I'm somehow not happy with debugging AI-generated code.


I would very much like a local code assist tool. Assuming integration with editors is my problem, what's best in class this week if a) I have a respectable GPU; b) I don't, and need CPU-only?


On Visual Studio there's an extension (by Microsoft) called IntelliCode which is a small AI assistant that runs locally on the CPU. It doesn't come close to these new large GPU models but it's quite handy. It looks into what you're typing on the current line and the previous activity along with the current project and tries to predict the full line or even the same change on multiple lines if that makes sense.


We recently published a paper on IntelliCode and share some of the usage numbers.

https://austinhenley.com/pubs/Vaithilingam2023ICSE_IntelliCo...

Disclaimer: I'm one of the co-authors.


Check out https://github.com/TabbyML/tabby, which is fully self-hostable and comes with niche features.

On M1/M2, it offers a convenient single binary deployment, thanks to Rust. You can find the latest release at https://github.com/TabbyML/tabby/releases/tag/latest

(Disclaimer: I am the author)


I have just installed Fauxpilot <https://github.com/fauxpilot/fauxpilot> (nVidia GPU-only) and it works... OK. Still evaluating and I'm basically sceptic on the whole concept, but... let's see.


Nothing even comes close to copilot. I realise you said “local”, but if you insist on that you’re going to be disappointed.


Then results are not really surprising. If you have access to a large body of code the mere ability to query it efficiently and re-use boilerplate or snippets of code performing some standard function and/or adding documentation is a productivity booster. Now we know its 8%. Small but measurable.

Arguably if you optimise the system to do precisely that rather than pretend to hold civilized conversation you could get quite a bit more bang for you gazillion parameter buck.


As the tools improve I suspect a whole new breed of developers to spawn who cant really code but are great at rejecting code that isn't extremely easy to read. After that very little new code is written and all of the focus will be on the development of stronger less toxic glue.

Then everything goes dark, a few companies will absorb and confuse the art or otherwise find ways to exclude everyone else from the process.


This work started as "Big Code" and the team responsible (Probability) was laid off: https://www.linkedin.com/posts/erikmeijer1_meta-layoffs-hit-...


It’s a funny game because they all need their own clones of each model / product.

Feels like tech is making billions but is a little lost ?


Limit training to stackoverflow input and wham! we have automated modern programming ;)


I would like to work in this code copilot space, I think will be one of the fastest applications of LLms in the near future. I have been working on a tool to autogenerate docstrings from a python method in google format



Who else’s is tired of these trillion dollar companies talk and talk and have no products?


It's a paper about a (for now) internal product:

"In this paper we present CodeCompose, an AI-assisted code authoring tool developed and deployed at Meta internally. CodeCompose is based on the InCoder LLM that merges generative capabilities with bi-directionality. We have scaled up CodeCompose to serve tens of thousands of developers at Meta, across 10+ programming languages and several coding surfaces."

Looks like the InCoder model it's based on can be downloaded here: https://huggingface.co/facebook/incoder-6B


Does companies publishing papers about internal tools serve any purpose other than PR?


It was a useful read for me, esp seeing numbers on fine-tuning & use. We are piloting a DB analyst tool where users can use natural language to do DB queries, generate AI analyses, make interactive GPU charts, etc, so many nearby questions we think about a lot. Previously as a PhD publishing on program synthesis, most of our writeups were much smaller scale wrt live user evals, so all combiner... super cool to see.

For FB... It probably helps keep the team valued internally + helps with retention & recruiting. For PhD trained types, this kind of paper is almost table stakes.

Less obvious... FB has been laying off teams like this despite productivity ROI intuitions, so if I was there, I'd be careful to quantify current + future ROI - I'm sure there are key #'s not being shared.


This paper says: “Customized for the organization: CodeCompose is fine-tuned on Meta’s internal code repository, and is thus able to handle Meta-specific languages such as Hack and Flow.” If you work at an org that might want to build their own LLM trained on their own internal code base, then the lessons of this paper would be of value to you.


Makes sense. For Meta, though, by publishing papers like this are they hoping for something else other than PR? My only other guess would be attracting talent.


Lots of companies have research departments and release papers; the people working in these departments have some academic roots at least. The incentives for releasing papers are:

* To raise your profile and reputation generally

* The specific publish or perish incentives in academia

* Because you really think you’ve done something interesting and novel, and want to share it with the world

Only the middle one is removed when going to industry.


There are many scholars working in industry (from software engineering to biotech) who believe in the ideals of information sharing and publication of research.


Of course it does. They give lots of details about the tools and their use which is obviously helpful to anyone wanting to do something similar.

A great example is https://www.uber.com/blog/research/keeping-master-green-at-s...

I think maybe Gitlab Merge Trains predate it, but it was definitely influential.


Google ones seemed to have sparked few revolutions on their own.


For anyone interested in related research, I used https://mirrorthink.ai to find some background on the state-of-the-art.

(disclaimer: this is AI generated, but grounded on contents of papers, with real references, so I'd say it is still constructive)

The state-of-the-art in code generation has seen significant advancements with the deployment of large language models (LLMs) in various code authoring tools. One such example is the study on GitHub Copilot, Amazon CodeWhisperer, and ChatGPT [1], which evaluates the code quality of these AI-assisted code generation tools. The study reveals that ChatGPT generates correct code 65.2% of the time, while GitHub Copilot and Amazon CodeWhisperer achieve 46.3% and 31.1% correctness, respectively. These results indicate that LLMs have made substantial progress in generating high-quality code, but there is still room for improvement.

Other research in the field has explored various techniques to enhance code generation and assistance. For instance, RepoCoder [2] focuses on repository-level code completion by integrating code generation and retrieval models in an iterative paradigm. This approach considers the repository-level context, including customized information such as API definitions and identifier names, to improve code completion suggestions. Serenity [3] leverages library-based Python code analysis for code completion and automated machine learning. The authors explore the potential of data flow analysis produced by Serenity to improve code completion when combined with neural models.

In addition to these advancements, the field has seen progress in incorporating contextual information into code completion models. The paper on enriching source code with contextual data [4] investigates the impact of incorporating contextual information on the performance of code completion models. The authors conduct an empirical study to analyze the effectiveness of this approach. These achievements, along with the advancements in LLMs, contribute to the ongoing progress in code generation and assistance. As the field continues to evolve, it is expected that AI-assisted tools will become increasingly sophisticated and effective in assisting developers with various aspects of the software development process.

[1] Evaluating the Code Quality of AI-Assisted Code Generation Tools: An Empirical Study on GitHub Copilot, Amazon CodeWhisperer, and ChatGPT - 2023: https://arxiv.org/abs/2304.10778

[2] RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation - 2023: https://arxiv.org/abs/2303.12570

[3] Serenity: Library Based Python Code Analysis for Code Completion and Automated Machine Learning - 2023: https://arxiv.org/abs/2301.05108

[4] Enriching Source Code with Contextual Data for Code Completion Models: An Empirical Study - 2023: https://arxiv.org/abs/2304.12269


This appears to be another shill account for MirrorThink. Users under 250 karma should not be allowed to post links/URLs.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: