Hacker News new | past | comments | ask | show | jobs | submit login
Phind Model beats GPT-4 at coding, with GPT-3.5 speed and 16k context (phind.com)
891 points by rushingcreek on Oct 31, 2023 | hide | past | favorite | 347 comments
Hi HN,

We’re excited to announce that Phind now defaults to our own model that matches and exceeds GPT-4’s coding abilities while running 5x faster. You can now get high quality answers for technical questions in 10 seconds instead of 50.

The current 7th-generation Phind Model is built on top of our open-source CodeLlama-34B fine-tunes that were the first models to beat GPT-4’s score on HumanEval and are still the best open source coding models overall by a wide margin: https://huggingface.co/spaces/bigcode/bigcode-models-leaderb....

This new model has been fine-tuned on an additional 70B+ tokens of high quality code and reasoning problems and exhibits a HumanEval score of 74.7%. However, we’ve found that HumanEval is a poor indicator of real-world helpfulness. After deploying previous iterations of the Phind Model on our service, we’ve collected detailed feedback and noticed that our model matches or exceeds GPT-4’s helpfulness most of the time on real-world questions. Many in our Discord community have begun using Phind exclusively with the Phind Model despite also having unlimited access to GPT-4.

One of the Phind Model’s key advantages is that it's very fast. We’ve been able to achieve a 5x speedup over GPT-4 by running our model on H100s using the new TensorRT-LLM library from NVIDIA. We can achieve up to 100 tokens per second single-stream while GPT-4 runs around 20 tokens per second at best.

Another key advantage of the Phind Model is context – it supports up to 16k tokens. We currently allow inputs of up to 12k tokens on the website and reserve the remaining 4k for web results.

There are still some rough edges with the Phind Model and we’ll continue improving it constantly. One area where it still suffers is consistency — on certain challenging questions where it is capable of getting the right answer, the Phind Model might take more generations to get to the right answer than GPT-4.

We’d love to hear your feedback.

Cheers,

The Phind Team




I just spent a few minutes doing a comparison between Phind and GPT-4 for a very high-level question on a distributed job queue. I gave them both the same fairly vague sketch of a kind of system I would like to build. Here are my impressions:

In the positives of Phind:

* Phind was able, even eager, to recommend specific libraries relevant to the implementation. The recommendations matched my own research. GPT-4 takes some coaxing to get it to recommend libraries. Phind also provided sample code using the libraries it recommended.

* Phind provides copious relevant sources including github, stackoverflow and others. This is a major advantage, especially if you use these AI assistants as a jumping off ground for further research.

* Phind provides recommendations for follow on questions that were very good. One suggestion to the Phind team: don't remove the alternate follow on questions once I select one. A couple of times it recommended a few really good follow up questions but as soon as I selected one the others disappear.

In the positives of GPT-4:

* GPT-4 gave better answers. This is my subjective opinion (obviously) but if I was interviewing two candidates for a job position and using my question as the basis for a systems-design interview then GPT-4 was just overall better. In many cases it added context beyond my question, recommending things like logging and metrics for example. It seemed to intuit the "question behind the question" in a much better way than the literal interpretation of Phind. This is probably highly case-dependent, sometimes I just want an answer to my explicit question. But GPT-4 seemed to understand the broader context of the question and replied with that in mind leading to an overall more relevant response.

* GPT-4 handled follow-up questions better. This is similar to the previous point - but GPT-4 gave me the impression of narrowing down the scope of the discussion based on the context of my follow-up question. It seemed to "understand" the direction of the conversation in a way that felt like it was following context.

NOTE: this was not a test on coding capability (e.g. implementing algorithms) but on using these AI coding assistants as sounding boards for high-level design and architecture decisions.


This is a good point about GPT-4, it can intuit the "question behind the question" really well compared with other models. And it's been profoundly useful for me with the most random tasks I knew nothing about prior (like fixing a wall in my house), etc.


That's probably because OpenAi can train on the (succesfull) conversations we have with ChatGPT!


> * Phind provides copious relevant sources including github, stackoverflow and others. This is a major advantage, especially if you use these AI assistants as a jumping off ground for further research.

Did you find them to be correct?


I don't use Phind for coding, except occasionally, but I like it best for generalised tech search because each para has a reference and there's a list of references down the side -- often the references would really be sufficient for me on their own.

I've had one glaring error, I can't quite remember the details, but it switched the names/characteristics of two different processes (ie was exactly opposite in what it said); it was something to do with instruction caching and TLB, IIRC. I assumed you'd was a problem with the input corpus not allowing antonyms to be disambiguated.

Anyway, for me it's the best of the LLM tools I have access to and had mostly replaced search engine (Google, Dukgo) for my tech-related work.

I've only used chat.openai.com (free), bing chat, HuggingChat.


I don't think "correct" is the right word since these were open ended systems design type questions. There are many ways to accomplish the same task.

I also spent about 20 minutes on this which is why I mentioned this is a first impression. I'll leave it to researchers to develop a "relevancy" metric and objectively apply it.

In my experience, the sources were sufficiently relevant based on its responses. They were about as relevant as equivalent Google queries. Some tiny, tiny niggles, like I was explicit I wanted it to recommend approaches in Go and for one reference I recall related to distributed locking mechanisms it provided a reference to an implementation in Java. However, that is completely fine for me since the context was more about the locking on the database side and not really the implementation in a specific language.


And the sources actually existed? i.e. there weren't any made-up ones?


The sources are urls to the cited page (e.g. stackoverflow.com, pkg.go.dev). In the side-bar next to the answer is a more standard search-result style link list with pulled quotes from the pages (like a Google search).

I didn't click every single link (as I mentioned, the citations are copious) but the few I did follow went to relevant articles. I just went back and randomly clicked several more and they all went to pages that exist and mostly relate to the content of the answer. The inline citations seem a bit more on-topic compared to the side bar which does seem more like the links were lifted directly from a search engine.

To be fair there are some lower-quality blog-spammy kinda stuff - more or less the same kind of thing you would get out of Google. But compared to GPT-4, which provides no sources whatsoever, it is an advantage IMO.


Do you have custom instructions? Everyone needs to mention and post prompts else entirely antidotal


We support custom instructions at https://phind.com/profile.


I’m trying to get it to answer only in executable Python. I used the template with instructions I use for my system prompt on gpt4. And I tried using the additional context field for the same.

It gets to writing the expected code but it still wants to include formatted headings instead of commenting those out so the entire response is executable Python.

As a follow up I provided an example heading with the hash out front. It didn’t work.

Any ideas on how to get it to this? Fwiw, gpt4 often if ignores this request, but only about half the time. When it does it is typically a single block of explanatory text.

For that, I include prose detection and commenting as part of my post processing.

Also, I don’t see it easily, but do you have an API for this or is it intended to be run by the user?


Getting it to not output additional text is not something that it can do super well at the moment, unfortunately. We'll work on that.


My trick for this has been one-shot training + regex. I tell the model to produce executable code within triple backticks suffixed by a keyword, like:

```keyword // code ```

and then I just ignore anything outside of those blocks.


I did not have custom instructions for either assistant. You can see the full conversation logs which I posted as a reply to another comment.


The “give context” part has a lot to do with prompting well based on the model. To have a fair comparison there should be just code and see what that come up with


mind providing some of the prompts you use to question them?


Here are the conversation logs:

https://chat.openai.com/share/867ff0c4-d4cf-4af9-a785-31a599...

https://www.phind.com/search?cache=ej8pn1dfjjwfr1tgc6ybwhlg

NOTE: there are a few more question/answer blocks in the phind conversation since I was testing out the follow up question feature.


would you be able to share your prompt(s)?

Edit: they are already posted as comment in this thread.


I tried my standard "trick" question I use for LLMs:

"Give me five papers with code demonstrating the state of the art of machine learning which uses geospatial data (e.g. GeoJSON) as both input and output."

There is no such state of the art. My hand-wavey understanding is that GIS data is non-continuous, which makes it useless for transformers, and also contextual, which makes it useless for anything else. Will defer to actual ML people for better explanations.

Point is, LLMs invariably give five papers with code that don't actually exist - it's a guaranteed hallucination.

Phind was able to give me five links that do in fact exist, as well as contextual information as to why these five links were not papers with code doing ML with GIS data. This is by far the best answer to this question from an LLM I've received yet.


I don't see how this would be relevant for a code model?

The code model isn't trained to retrieve papers/articles, it's meant to complete code. Whether or not you find hallucination in a unrelated task isn't particularly interesting.


Damn, this is how I learn that HN doesn't have a block function. What a shame.

My friend, can you do me a favour and actually click the link and have a play with the app? If you do, you will discover that what you're dealing with there is an LLM. That's literally why it's being compared to other LLMs.

No idea what you were trying to achieve with this comment. "The code model isn't trained to retrieve articles." a) neither is any other LLM, what's your point? and b) the app on the other end of that URL retrieves articles - it's not even tangential to the app, it's key functionality.


Lmao



ChatGPT 4 seems to be better than it was when I was using it (mere months ago)!


Yeah, all of OpenAI's stuff gets better much quicker than I'm used to. So does the general performance ceiling of all open source models, even if individual models don't improve as much.


> the state of the art of machine learning which uses geospatial data (e.g. GeoJSON) as both input and output

> There is no such state of the art

Some GIS work uses vector data: points/lines/polygons representing features (e.g., the location of roads or the outlines of buildings), which can be stored in formats like GeoJSON or WKT. But other work uses remote sensing data/satellite imagery that can be stored in raster formats like GeoTIFF - essentially TIFF image files with additional information stored to georeference them.

You can totally do machine learning on satellite imagery where both the input and output are geospatial data (e.g. to categorise land use - the inputs are multispectral images and the outputs can be images where the value of each pixel represents the identified land use).

You can also use machine learning for tasks like building footprint detection/delineation (e.g., [1]) based on satellite imagery. The output from such a pipeline can be a set of polygons, which could be saved as GeoJSON.

I'd consider either of theses to be examples of "machine learning which uses geospatial data (e.g. GeoJSON) as both input and output".

[1]: https://azure.microsoft.com/en-us/blog/how-to-extract-buildi...



I'm glad there's growing competition, but GPT4 still outperforms. Here's a recent question I asked:

Could you please provide me with a query for my PostgreSQL table? It consists of three columns: 'id,' 'teaser,' and 'full_text,' all of which are of type 'text.' Most 'full_text' entries are articles around 1000 words in length. I'd like to update the 'teaser' column for each row with the first approximately 200 words from the 'full_text' column. How can I achieve this?

Phind:

  CREATE OR REPLACE FUNCTION get_first_n_words(text, integer)
  RETURNS text AS $$
  DECLARE
     word_count integer := 0;
     word_start integer := 1;
     word_end integer;
  BEGIN
     WHILE word_count < \$2 AND word_start <= length(\$1) LOOP
         word_end := position(' ' in \$1 from word_start) + 1;
         IF word_end = 0 THEN
             word_end := length(\$1) + 1;
         END IF;
         word_count := word_count + 1;
         word_start := word_end;
     END LOOP;
     RETURN substring(\$1 from 1 for word_start - 1);
  END;
  $$ LANGUAGE plpgsql;
GPT4:

  UPDATE your_table_name
  SET teaser = (
      SELECT STRING_AGG(word, ' ') 
      FROM (
          SELECT SPLIT_PART(full_text, ' ', i) AS word
          FROM generate_series(1, 200) AS i
      ) AS words
      WHERE word <> ''
  )
  WHERE full_text IS NOT NULL;


Running "Ignore Web Context" enabled can improve performance for design tasks like this. I just got a more plausible answer: https://www.phind.com/search?cache=f0fkv5mxscwvagxgkuwnwgtl. Consistency is something we're working on.


Thanks for sharing, you're right - that does improve performance!


How do you enable "Ignore Web Context"? I don't see that option anywhere on the page you linked, am I just being blind?


It's in the model dropdown under the search bar.


You mean "Ignore Search Results" ?


One example is not enough for performance conclusions


Obviously not. Perfectly reasonable to share anecdotes though.

Also, I ran a few different tests, and every GPT-4 response was superior, but I didn't want to clutter my comment with queries and code.


There is a performance conclusion in the title though.


That conclusion is based on benchmark with many examples in different tasks.


That conclusion is based on their benchmarks. I'm not interested in those. I'm interested in community benchmarks, like those we're seeing in the comments. Lo and behold, GPT-4 is still king. The claims of any company should be taken with exactly a pinch of salt.


that benchmark(HumanEval) is some public benchmark built by others.


That kind of benchmark is a lot more reliable for models published before the benchmarks; models published afterwards have more opportunity to "study to the test". That's especially a concern when a company explicitly uses its score on that benchmark as a marketing point.


sure, but it is the best thing we have.


Well no we have the anecdotes of all the HN folks which I trust many, many times more than a benchmark.


lol, you can continue trusting anecdotes from internet. Industry prefers more scientific methods.


So Paul Graham posted that Phind is better and got absolutely destroyed in the comments

https://twitter.com/paulg/status/1719657855240815026

No, I do not take these benchmarks seriously and for good reason. They're benchmarks. The only thing that matters is the user's direct experience of the product. And Phind isn't there.


> got absolutely destroyed in the comments

by tweeter trolls?..


AFAIK they haven't released the dataset they fine-tuned on, so we can't be 100% there wasn't benchmark contamination. Agree that we definitely need more than N=1 to challenge the performance claims, but I still think its valid to call it out given how much benchmarking-gaming we've seen in this space.


I think you can bring contamination claim to every public benchmark results nowdays: models are trained on TBs of data crawled from internet, and there is no guarantee benchmark is not leaked in some way.


With respect to the pretraining data, its true that we're probably SOL there in terms of verification. But for fine-tuning, they could still publish the dataset and see if others can reproduce their results as well as audit for contamination.

If we're comparing benchmark deltas between different fine-tuned variants that share the same base models, that seems like the bare minimum we should expect to come along with performance claims.


I think both pretraining and finetuning datas are essential secret information for commercial models/services.


In the case of Phind though, they also publish their models on HF with similar bold performance claims without publishing the datasets: https://huggingface.co/Phind/Phind-CodeLlama-34B-v2

Even I am to grant that their subscription product has some secret sauce they want to keep close to the chest (ignoring for a moment their paid product is GPT-4 based), not doing the same for all the models they release to the open source community free of charge with a commercially-permissible license seems suspect.

I realize this sort of open source contribution is mostly for marketing purposes, but being critical of the performance claims I think is still valid nonetheless.


From what I understand it's a single test suite? Of course I don't really mind the clickbait title that much, it's hard to attract attention otherwise.


I think it is valid criticism that that HumanEval benchmark is not completely representative, they also say it in the post.


Depends on the claims made.


With some simple clarification I got this

UPDATE your_table SET teaser = substring(full_text from '(\S+\s*){1,200}')


I really dislike article teasers and "read more" buttons. Now I know it's intentional clipping of the corresponding articles.


> We can achieve up to 100 tokens per second single-stream while GPT-4 runs around 20 tokens per second at best.

Is that with batching? If so, thats quite impressive.

> certain challenging questions where it is capable of getting the right answer, the Phind Model might take more generations to get to the right answer than GPT-4.

Some of this is sampler tuning. Y'all should look at grammar based sampling (https://github.com/ggerganov/llama.cpp/pull/1773) if you aren't using it already, as well as some of the "dynamic" sampling like mirostat and dynatemp: https://github.com/LostRuins/koboldcpp/pull/464

I think these should work with nvidia's implementation if you just swap the sampling out with the HF version.

BTW, all this is a great advantage of pulling away from OpenAI. You can dig in and implement experimental features that you just can't necessarily do through their API.


We leverage Flash Decoding (https://crfm.stanford.edu/2023/10/12/flashdecoding.html) in TensorRT-LLM to achieve 100 tokens per second on H100s.


is that impressive? I was thinking 100 tok/s on an H100 is really slow considering LMDeploy claims 2000+ on an A100 and a large batch size.


We get 100 tokens a second with batch size 1. Those 2000+ figures are for large batches.


Ah, that's fair, and faster than any of the LMDeploy stats for batch size 1; nice work!

Using an H100 for inference, especially without batching, sounds awfully expensive. Is cost much of a concern for you right now?


I don't think they're saying they're doing batch size of 1, just giving performance expectations of user facing performance


Yeah, and this is basically what I was asking.

100 tokens/s on the user's end, on a host that is batching requests, is very impressive.


I think they _are_ saying batch size 1, given that rushingcreek is OP.


Yes they are saying batch size 1 for the benchmarks, but they aren't doing batch size 1 in prod (obviously).


I don't think that is obvious. If your use case demands lowest latency at any cost, you might run batch size 1. I believe replit's new code model (announced about a month ago) runs at batch 1 in prod, for example, because code completions have to feel really fast to be useful.

With TensorRT-LLM + in-flight batching you can oversubscribe that one batch slot, by beginning to process request N+1 while finishing request N, which can help a lot at scale.


I'm not sure about TensorRT, but in llama.cpp there are seperate kernals optimized for batching and single use inference. It makes a substantial difference.

I suppose one could get decent utilization by prompt processing one user while generating tokens for another.


Without batching, I was actually thinking that's kind of modest.

ExllamaV2 will get 48 tokens/s on a 4090, which is much slower/cheaper than an H100:

https://github.com/turboderp/exllamav2#performance

I didn't test codellama, but the 3090 TI figures for other sizes are in the ballpark of my generation speed on a 3090.

100 tokens/s batched throughput (for each individual user) is much harder.


I am a heavy user of GPT4, and Phind was surprisingly able to match GPT4 on several initial programming tasks I gave it. Given the large context window of Phind, it will likely be able to outperform GPT4 for some tasks.

That is quite an accomplishment, I am impressed


FWIW The default context window of GPT-4 via ChatGPT is about to change to 32k.


Given the number of times it just fails with large prompts on 32k contexts, I'm not sure if they'ready for this. In my experience, if you're consuming 20k+ tokens failure rate is more than 50%.


Well over 50%, at least via the api, for me.


This would be great if true. Any source for this?



that would put them significantly ahead again, for my use cases


We will eventually increase the Phind Model to 100K tokens -- the RoPE embeddings in Code Llama were designed for this.


> the RoPE embeddings in Code Llama were designed for this.

The RoPE embeddings were not "designed" for that. The original RoPE was not designed with length extrapolation in mind. Subsequent tweaks to extrapolate RoPE (e.g. position interpolation) are post-hoc tweaks (with optional tuning) to an entirely vanilla RoPE implementation.


100k tokens and good ide support would be great. Copy pasting back and forth with browser and IDE is kinda annoying and you always miss some context. I think model is now good enough but what is kinda missing is good developer experience eg what to load in that context window and how model integrates to IDE. But this is kinda missing with copilot and chatgpt4 as well.


Is it “100k” or really 100k there are so many ways to do context, I remember seeing 100k before but it was doing some cheap trick to get it


What about ALiBi and Sliding Window Attention?

Additionally Apple researchers seem to be playing with "Attention Free" variants.


Source?


I love that Phind cites what it scrapes. This should be the obligation of all LLM. I always suggest people use it over ChatGPT.


What they're citing isn't what the LLM "scraped", it's what the retrieval model fed to the LLM. You're not guaranteed that it's what it actually used to give you the output, and it's also definitely not all the text that it used to get appropriate knowledge to generate the answer, as this is split over whatever millions of examples for the language and for human language in a non human-understandable way


A couple of times I've had the reference not include the detail being mentioned in the foregoing paragraph; the citations are still highly relevant, but it wasn't quite what I expected.


I've heard this coldtake before but OpenAI's source code isn't open to academic scrutiny. So I don't understand why some people are so confident about how it works. It's certainly not magic and Phind seems to be capable of it citation.


It's transformer based language-modelling 101, not really a take, just stating facts. It's highly unlikely Phind has completely fundamentally changed all the exact same problems that the whole field is working on simultaneously, single-handedly, in a purely novel way. It's just how transformers work.


Phind appears to be doing it though. LLMs are stochastic parrots, I don’t see a radical difference. Input goes in, output comes out. Neural network aren’t magic, they’re a complex function. 1 node or a billion we can track the data that’s changing the weights inside the network.


I do this for a living


As a user, i perfer getting the right response compared to the thing spitting out a link. (not saying phind is bad). Lets focus on getting llm right before nerfing it in its baby stages.


Who said anything about nerfing? Citation is just additive, no?


In fact, I’d argue that citation makes LLM better. Kind of a “think carefully” indicator. When LLMs are able to verify those citations independently it’s going to level up again by skyrocketing the objective truthiness.


Interestingly, I'd say that _not_ being able to give citations helps protect the LLM from copyright issues. That being said, I'm much prefer if the LLM could provide citations for every piece of information it was trained on and uses to provide an answer.


Citations are essential for me as I'm using Phind for work and can't rely on "trust me bro". It needs to confirm to my expectations or be confirmed in a couple of the citations that have trustworthy sources (eg are from known domains, well-cited journals, etc.).


I've found great sites and devs using Phind.


Yeah, I prefer the context provided by the original creator. If I'm writing code and I need to reference someone else's work I put their name in my comments. I was digging through Box2D for polygon vs ray intersections and in the comments of the source code Erin Catto cites Collision Detection in Interactive 3D Environments by Gino van den Bergen. It makes me respect him even more.


I find it often makes the responses worse when it's being pre-fed these search results, it was the case when I tried gpt-4 with web browsing enabled, and seems to be the case with this, since even the person from the Phind team in this thread pointed out that turning this feature off improves performance for some tasks:

https://news.ycombinator.com/item?id=38089888

https://news.ycombinator.com/item?id=38090442


Nerf is the wrong word, more like regulatory capture. If all llm had to quote their sources at this point, along with all the other for the human changes we want to do, only the big players would be able to do them effectively making it hard to enter and compete. The current big players want launching a new llm product to be more like opening a new bank than opening a lemonade stand based on the ai executive order released yesterday.


Give me the citations every day of the week. The source of information matters. For example, I don't rely on any ZFS info or opinions I find online if I can't verify it came from a contributor or highly reputable person that has a lot of experience with ZFS.

If you want to show the warts of all these LLMs, ask it about ZFS if you know enough to spot the commonly parroted misinformation that plagues the internet.

IMHO, these systems look super useful if they're citing sources and they're worthless without.


Transparency is paramount. If OpenAI doesn't want to make it's proprietary software open to academic scrutiny I completely understand. However, if their app is going to play an educational role then sources and citation are mandatory in academic content.


Funny you bring up ZFS specifically. I embarrassed myself a couple weeks ago by parroting something GPT-4 told me about ZFS to someone on reddit, which turned out to be completely wrong.


why not both?


Asked it to write a program that I've written before, to compare with gpt4. Didn't really get what I was asking for, gpt4 understood it perfectly, and is ready to continue prompting toward completion.

https://www.phind.com/agent?cache=cloeowfla000dl1084ermly3c vs https://chat.openai.com/share/4147da33-3669-4657-88fa-3a9dfc...

Might not be representative of the whole thing, but it went on about random things I didn't ask about, and just basic information I already knew


The Pair Programmer mode currently either uses GPT-4 or GPT-3.5 (if you've run out). Please try again in the default search mode to use the Phind Model.

Using the Phind Model in the default search seems to work well: https://www.phind.com/search?cache=ln6dpdtv5auwn4cq1ofg3gs9


https://www.phind.com/search?cache=z5odlx0o9lspzpfm4sfpp131

Way way better, I'm stunned. Congratulations on this


Even though the phind model is selected? Is there a technical reason Phind doesn't do pair programming yet?


It's because we haven't updated the Phind Model to support function calling yet but we're working on it.


Can you share what your long term monetization model is? I'm noticing Phind is free to use right now.


We have a Pro plan where you can get (virtually) unlimited GPT-4 and soon, an even faster Phind model. https://phind.com/plans


Is there something you're doing with GPT-4 that would make me want to use it through you vs just using it myself?


The problem is that it’s doing a search of your relatively niche problem, and probably getting pretty poor results. The text from the search is then more highly weighted than the base model, but with relative junk, so it performs better without the additional (unhelpful) context.

You see this with Bing search on ChatGPT as well, and I’ve seen it in my own projects.


> it supports up to 16k tokens

> Llama 1 supports up to 2048 (2K) tokens, Llama 2 up to 4096 (4K), CodeLlama up to 16384 (16K). [0]

This is wild to me.

The token window is one of the limiting factors for having an AI that can actually remember you and past conversations. Having a large window is key for future AI applications that involve long running conversations (weeks, months, years). The tech is already very impressive, but imagine it as it becomes more like an actual pair programmer and remembers all the various things it's learned and worked on with you in the past.

[0] https://huggingface.co/docs/transformers/main/model_doc/llam...


640k is enough for anyone


Extending that analogy, imagine what one could do with 128B tokens.

On cast off/cheap workstation/server hardware.


Token window size is being virtualized with the like of MemGPT, so its effect will diminish.


Still waiting for the day that medium term memory (token average pooling like in sentence transformers) becomes used for this. It's staring all of these companies in the face and apparently no one thinks to implement it.


I've been thinking along the same lines. The token window IMO should be a conceptual inverted pyramid, where there most recent tokens are retained verbatim but previous iterations are compressed/pooled more and more as the context grows. I'm sure there's some effort/research in this direction. It seems pretty obvious.


But some of the earlier tokens are also the most important ones, right? Like the instructions and rules you want it to follow.


Phrase embeddings could bring a 32x reduction in sequence length because:

> Text Embeddings Reveal (Almost) As Much As Text. ... We find that although a naïve model conditioned on the embedding performs poorly, a multi step method that iteratively corrects and re embeds text is able to recover 92% of 32-token text inputs exactly. We train our model to decode text embeddings from two state of the art embedding models, and also show that our model can recover important personal information (full names) from a dataset of clinical notes.

https://arxiv.org/abs/2310.06816


They are. Moreover, the idea that AI companies are missing and/or not implementing this “obvious” tactic is hilarious. Folks, these approaches have profound consequences for training and inference performance. Y’all aren’t pointing out some low hanging fruit here, lol


Actually, yes I am pointing out low hanging fruit here. These approaches do not have "profound consequences" for inference or training performance. In fact, sentence transformer models run orders of magnitude more quickly. Performance penalties will be small.

Also, I actually have several top NLP conference publications, so I'm not some charlatan when I say these things. I've actually physically used and seen these techniques improve LLM recall. It really actually works.

Here's more examples of low hanging fruit. The proof in that they work is in the implementations which I provide. You can run them, they work!: https://gist.github.com/Hellisotherpeople/45c619ee22aac6865c...

Check yourself before you try to check others.


> In fact, sentence transformer models run orders of magnitude more quickly. Performance penalties will be small.

They do not. Sentence transformers aren't new, and have well-known trade offs. What source or line of reasoning misled you to believe otherwise?

> Here's more examples of low hanging fruit. The proof in that they work is in the implementations which I provide. You can run them, they work!: https://gist.github.com/Hellisotherpeople/45c619ee22aac6865c...

This...is your blog about prompt engineering. What do you believe this "proves"? How have you blown away current production encoding or attention mechanisms?


Concur. LLM are still very young. We’re barely a year out from the ChatGPT launch. Everyone is iterating like mad. Several stealth companies working on new approaches with the potential to deliver performance leaps.

You ain’t seen nuthin’ yet…


Out of curiosity, why do you think the answer would be so simple and also completely untested?


Too much money being thrown around on BS in the LLM space, hardly any of it is going to places where it matters. Ignorance on the part of investors.

For example, the researchers working hard on better text sampling techniques (i.e. https://arxiv.org/abs/2202.00666), or on better constraint techniques (i.e. like this https://arxiv.org/abs/2306.03081), or on actual negative prompting/CFG in LLMs (i.e. like this https://github.com/huggingface/transformers/issues/24536) are doing far FAR more to advance the state of AI than dozens of VC backed LLM companies operating today. They are all laboring in relative obscurity.

HN, and the NLP community have some serious blindspots with knowing how to exploit their own technology. At least someone at Andreessen Horowitz got a clue and gave some funding to Oogabooga - still waiting for Automatic1111 to get any funding.


Another curiosity, what do we estimate (if it's even possible) the context window of a human? Obviously an extremely broad question, and of course it must have some sort of decay factor... but... would be interesting to get a rule of thumb number in terms of token count. I can imagine its massive!


Human memory, in my limited understanding, doesn’t have the bifurcation of weights and context that LLMs do. It’s all a bit blurrier than that.

Something interesting that I heard from people trying to memorize things better is that memory “storage space” limits for people are essentially irrelevant. We’re limited by our learning and forgetting speeds. There’s no evidence of brains getting “full”.

Think of it like a giant warehouse of plants, with one employee. He can accept shipments (learning). He can take care of plants (remembering). Too long without care and they die (forgetting). The warehouse is big enough that it is not a limiting factor in how many plants he can keep alive. If it was 10x bigger it wouldn’t make a bit of difference.


I don't think it's massive. In fact, since it's roughly equivalent to working memory, I suspect it's on the order of 100 tokens at most.

It's just that, unlike these AIs, we're capable of online learning.


I know it isn't popular, but I wish there was a way to use this inside Emacs. Or, vim. I just don't want to use VS Code anymore.


The standardizing on VS code is one of the saddest developments over the last several years IMHO. I think it's great that VS Code exists, but we're headed for a world where you have to use VS Code if you want the best tooling because it won't support other options. The same thing happened with Java dev and IntelliJ, and IMHO it has been extremely unhealthy for the ecosystem. I'm immensely glad that Copilot supports vim, but I'm fearful that it soon won't.


Didnt vscode standardise language servers making much easier for all the rest text-editor-close-almost-ides to integrate? Is it really that sad?


Very fair point. Vim has benefited tremendously from that effort.


Same could have/could be said about Jetbrains products. People are likely always going to use vim/emacs and create tooling around whatever new hotness exists for them. And honestly? VS Code is just a new iteration on how vim/emacs work in a lot of ways: Providing a place to edit text and then a bunch of plugins that do things with that text.

And if you want vim/emacs to keep living, then you should spend time helping! Create your own extensions, maintain/contribute to existing ones, etc. They will only die out when the last person actively contributing to them stops, so keep the chain of people going :)


> The same thing happened with Java dev and IntelliJ, and IMHO it has been extremely unhealthy for the ecosystem.

While I agree, at the very least IntelliJ stood up on its own as a good IDE. I cut my baby teeth on Eclipse, and as soon as I realised how good IntelliJ is, I jumped ship without looking back. The same can barely be said about VS Code.


If only the depth of our feelings for Emacs counted for more in the market.

There's an argument that music and the arts are dumbed down by the fact that, for instance, making an album worth $10 to millions of people pays way better than making an album worth a million dollars to tens of people, since the album is going to get priced at $10 one way or the other. It only just now occurred to me that the same phenomenon applies to tools.


In Vim, I tried to assign a shortcut to send the selected text to Phind (or any other LLM) and came up with this:

:'<,'>y|call system('firefox <url>?q='.shellescape(@*).' &')

The only problem left is that the text is not urlencoded.

There probably is some elegant way to urlencode it. But I did not come up with one yet.


https://stackoverflow.com/a/76488059 claims to have one, though it's not explained.


I've hacked together a basic Emacs ollama api integration that does simplistic code completion against a local LLM from someone else's copilot example. It's slower than I want (about 7 seconds per inference on my M1 mac, typically) and very stupid about what context it sends, but nevertheless: it's just, and only just, enough to be useful. Hadn't considered publishing it because it relies on a python façade to convert copilot-style requests and responses back and forth to ollama, but if there's interest I'll spruce it up and get it out.


From downthread, just use ellama. They're further ahead than me by the looks of things.


Pretty sure GitHub Copilot has emacs/vim integration.


It does, although not the most recent features. I use the compatible features in Vim and I really like it. Not enough to switch editors though.


I have been a vs code power user and switched to pycharm two years ago and will never go back because of the features for working with multiple environments and projects in pycharm.

Working with phind needs to be available in pycharm for me considering switching from gpt4 to phind. Chatting with phind on my local files is the feature I am looking for.


Maybe ellama[1] would work? It doesn't support Phind yet, but a provider could be created for the underlying connection package llm[2].

[1]: https://github.com/s-kostyaev/ellama

[2]: https://github.com/ahyatt/llm



You and me both brother. LSP integration seems the way forward.



Awesome model from a quick run-through comparison, it's comparable in results to GPT-4 with web search and references as a plus, but runs faster. Two small nitpicks:

- Dark mode is hard to read, the answer text font has too much weight and brightness which makes it hard to read long paragraphs of non-code text. Light mode is obviously too bright overall, but it's already nighttime where I'm at so maybe tomorrow at noon I'll have another opinion. I'd preferred gray (dark, ie OpenAI) and sepia (light, ie HN) as backgrounds when long lines of text are involved.

- Pricing page and ties to GPT-4: what does "500+ best model uses per day (GPT-4)" mean? What's the "GPT-4" part for? I saw I can pick GPT4 as a model on the landing page, but I just don't get the best model/GPT-4 thing. Is Phind announcing it's a competitor but also proxies GPT-4? Sorry, I'm not up-to-date on GPT-4 "resellers" and the story behind Phind, it's just weird when it announces it "beats GPT-4" then the pricing is about GPT-4 usage.


Thanks for the feedback. We also support GPT-4 as an answering model so users can pick and choose what's best for their use case, but we recommend the Phind Model for the majority of users.


Why is there an 8x difference in price-per-search between Plus and Pro?

I always shy away from stuff like this because I view it as one of two things. Either I'm getting ripped off if I pay for Plus, because 8x the cost to me means your margin is huge, or I'm getting subsidized by you with the Pro version which means I can't rely on it lasting long term.

I also dislike daily limits for search. My search usage isn't uniform day-to-day. I might go most of the month without searching for anything and then do a ton of searching over 2-3 days when I'm trying to learn something. So I'll be idle most of the month and then not have enough searches on the days I actually want to use it.

I prefer the model used by a lot of pre-paid services. Let me deposit a chunk of money (ex: $20-50 minimum) and charge me per search until my money is gone. That way I'm not "losing out" if I don't use it every day and I can "burst" as high as I want when I'm trying to learn something.

If the pricing is based on a certain amount of loss (on my side) from the use-it-or-lose it model, I don't like that. I want simple, fair pricing, not a complex pricing scheme where the primary purpose is to get me to overpay for my usage.


Plenty of people know their upper limit. The ability to pay 50% less if that limit applies is a feature, not a bug. (This applies to any service -- I am not affiliated with phind except as an occasional user).


Phind Plus is $15/month and Phind Pro is $30/month. There is a 2x price difference, not an 8x difference. And Phind Pro comes with (virtually) unlimited GPT-4 uses.

We understand that the incentives of setting daily limits for search aren't great, which is why the Phind model is unlimited for free. GPT-4, however, is unfortunately too expensive for us not to charge past a certain usage threshold.


Plus costs $0.016 per search and Pro costs $0.002 per search.

https://www.phind.com/search?cache=wgyz13tg4jkbl9pklptmpds5


To me the $15/mo plan is just bait so users pick the target $30/mo month. Why would you pay $0.016/search when you can pay 8x less and feel smart about making that choice?

edit: looking at it again, I think the $15/mo is actually just for people who wants Phind "private", so that their data is not used for training.


Cost per search isn't really a great metric. For me, I hit the cap of 30 searches/day pretty easily, but 500 is pretty hard to hit. For me, its just a question of what tier matches my volume.


> You can now get high quality answers for technical questions in 10 seconds instead of 50.

ChatGPT 4 does not take 50 seconds to answer, so I don't understand this comparison.


Recently I've used gpt 4 and yes it does take up to a minute even for easy questions.

I've asked it how to scp a file on Windows 11 and it'll take a minute to tell me all the options possible.

If this takes 1/5th the time for equivalent questions, I'd consider switching


Not my experience at all. Are you counting the entire answer in your time?

If so, consider adding one of the “just get to the point” prompts. GPT4’s defaults have been geared towards public acceptance through long-windedness which is imo entirely unnecessary when using it to do functional things like scp a file.


LOL, it’s not just for “public acceptance”. Look up Chain of Thought. Asking it to get to the point typically reduces the accuracy.


> LOL, it’s not just for “public acceptance”. Look up Chain of Thought. Asking it to get to the point typically reduces the accuracy.

Just trying to provide helpful feedback for you, this would have been a great comment, except for the "LOL" at the beginning that was unnecesary and demeaning.


You are being snarky but are right. I have scripts set up to auto summarise expansive answers. I wish I could build this into the ChatGPT ui though.


I know this is silly, but I've had great success asking chatgpt to summarise chatgpt's answers.


Try the custom instructions feature


The words "briefly" or "without explanation" work well.

By keeping the prompt short, it starts generating output quicker too.


Yeah, I would say this is a prompting problem and not a model problem. In a product area we're building out right now with GPT-4, our prompt (more or less) tells it to provide exactly 3 values and it does that and only that. It's quite fast.

Also, use case thing. It is very likely the case that for certain coding use cases, Phind will always be faster because it's not designed to be general purpose.


This isn't a fair comparison because I have custom instructions that mention being brief but complete, but I did "how to scp a file on Windows 11"

ChatGPT4: 14 seconds

phind with "pair programmer" checked: 65 seconds

phind default: 16 seconds


Take a look at the AutoExpert custom instructions: https://github.com/spdustin/ChatGPT-AutoExpert

It lets you specify verbosity from 1 to 5 (e.g. "V=1" in the prompt). Sometimes the model will just ignore that, but it actually does work most of the time. I use a verbosity of 1 or 2 when I just want a quick answer.


> I've asked it how to scp a file on Windows 11 and it'll take a minute

https://imgur.com/a/iqxOJUV was 6.5 seconds.

https://imgur.com/a/pQFfWli was 15.

You can tell they're GPT-4 because the logo is purple (the logo is green when using 3.5).


ChatGPT4 is more often than not noticeably slow enough that I question why I pay for it.


Sometimes it's insanely quick - like gpt3,5 turbo or a cached answer or something.


We find that it takes around a minute for a 1024-token answer. Answers to less complex questions will take less time, but Phind will still be 5x faster.


That really depends on the complexity of your request and any prompt engineering techniques in use for that request. Especially with "think step by step" in certain contexts, it can improve answer quality at the expense of generation time (because more tokens are emitted).


Ran a quick test with a Rust async code snippet that contains an error. Compared with GPT-4 its gives a far clearer solution, with linked sources to learn more! Super impressive!


Amazing, that's great to hear.


Is it possible to output all steps of solutions in a single copyable block? I don't want to copy 4 separate blocks.


When I use it I often give a final prompt like "Now combine the above answers together into a function that accept the following arguments...". This has worked well for my use cases.


You can tell it that in a followup. Or, configure an answer profile and tell it to use that style: https://phind.com/profile.


Well, neither GPT4 or this Phind model where able to answer my torture test: "Write amaranth code that can be used to control the readout of a frame from a kodak CCD with 4096 columns and 2048 rows."

Which yes, is missing a lot of detail (you could/I have feed/fed in a datasheet).

But Phind goes off on using pyserial (?!), and GPT4 assumes amaranth is a hypothetical CCD control library and makes a useless class control CCD using the hypothetical library.

Edit - Phind at least acknowledged that amaranth exists, unlike GPT4 with this prompt: "Write amaranth code that can be used to control the readout of a frame from a kodak CCD using an lattice FPGA with 4096 columns and 2048 rows. Assume the design will be hooked up to a larger litex SoC "


That’s torture for humans as well. The key to LLMs is communicating clearly to the information cloud.


Sure, but a good example of how far certain domains have to go still. These datasheets should be in the models training data, at least one CCD datasheet, and verilog & (migen | nmigen | amaranth) certainly are.

Controlling a CCD is actually pretty easy, I built (very simple, but working) controllers for several CCD chips in undergrad doing research for the ATLAS detector. You just clock a rows out basically, N columns times. Reset first. I'd expect an senior undergrad EE student to be able to design a simple core in a few class projects.


I have no idea what that means (even after googling it) lol. This is how my local WizardLM-70B responded to your prompt.

https://pastebin.com/BCAthV8y


Will you be offering the model as an API service? The product my team is working on would benefit from a significantly faster and possibly better performing model than GPT-4. If you're planning on keeping pace with competitive models we'd love to integrate the use of your model into our service.


If we get enough demand that's definitely something we'll consider. We're still a small team, however, and we do everything in our power to not get distracted from our main mission.


Please consider releasing an API. Having a faster alternative to GPT-4 would be amazing for so many use cases.

Especially for agents that do function calling.


If you offer an API then you can be used with tools like https://aider.chat/, which is the best way to use LLMs for coding. But if only available via the web it's not possible. BTW this is the main reason I pay for the OpenAI API.


Makes sense, we're also very small (pre-seed) so definitely no cash cow for you guys yet. We probably shouldn't be prematurely optimizing our prompting performance as it's not really a bottleneck, but a 4x improvement just by swapping an API would be too good not to act on.


If you offer an API you don't have to maintain a Visual Studio plugin. Trying to compete with tools like Cursor would be the real distraction.

And Cursor is just the start - there will be innovative workflows built on top of APIs you can't predict. You're missing out not having developers build an ecosystem for you.


just as a point to consider, NOT having an api (and thus no integrations into my editors of choice) is the main reason i haven’t given y’all a fair test run. i’d almost rather not know what i’m missing (though the threads here have convinced me to give it a shot.)


In my experience, Phind is not as good as GPT4, but it's by far the second best LLM for programming. I find that tremendously impressive considering they are competing against the whole world for that title right now.

I agree with the assessment about consistency being its major flaw. While with GPT4 I can continue a conversation for quite long, Phind easily looses the required context. Perhaps it has to do with summarization capabilities, or messing with the context window has these types of side effects.


Have you tried clicking the model selection dropdown and enabling "Ignore web results"? That can help with keeping context for complicated design tasks.


Been using Phind for a bit now and started paying for pro

They're smashing it and can't do enough if you report an issue, also they have started a weekly voice call to discuss algos and such with senior devs, like a surgery, only 10 people join at moment

Don't think I've ever recommended anything as much as I have these guys in last couple of months


I use Phind daily, including the VSCode extension, and I love it. Much better than anything ChatGPT is able to come up and the code it generates requires little-to-no modification to work properly. Very big fan!


Far as I can tell it isn't possible to hook up the VSCode extension to the Phind model, only GPT-4. Do you know any different?


First off, congrats on building such a cool product. I love that I can just "jump into it" which is great.

Note that I'm not really a power user of these GPT style tools- here are my questions:

Is it possible to get right to the code without the ELI5 and general information?

Do you guys offer an API? I was browsing on my small iphone so maybe I missed this info.

Could you give an overview for someone like me how something like phind works technically? You mentioned those H100s, but at a very high level without revealing any "secret sauce" how does this GPT work from my input to getting a response?

Good luck!


Could you open source these great models? OK yes you need a competitive advantage. So maybe open source them when you are say 2 models ahead in production?

In any case I am happy there is some competition and that it has come from a more pragmatic scrappy space than one of the multiple billion dollar funded places.


Can we have a larger discussion about the tradeoffs that come with open sourcing a model?

When fb released Llama they obviously gained a huge amount of developer goodwill but it also required them to invest a serious amount of their own developer time to engage with the community.

I'm asking the community what it can offer the company? Or is this just self-abnegation by the company that releases the model?


I question the word "required". They, or anyone else releasing an open source product into the world, doesn't owe anyone anything, least of all support. As long as there are enough instructions to run the thing, you are perfectly within your rights to let the community sort out the rest between themselves.


I've noticed that even though "They don't owe anyone anything," the community doesn't actually adhere to it. If they shove code over the wall like FAANG companies do now, it appears to upset the community, who will then treat them with hostility.


I don't know what model runs on Phind's site right now, but in August Phind published a fine tune of CodeLlama 34B

https://huggingface.co/Phind/Phind-CodeLlama-34B-v2


I gave it two tries, GPT-4 was much better in both cases. Tried with two Leetcode questions. It came back with an empty response for one, and provided a worse code (O(n2) solutions when it can be done with linear time) for the other one.

GPT-4 on the other hand provided a good answer for both questions. Also I guess the UI is buggy w.r.t code formatting, it things the following line is a code and switches to a code block.

``` You are given an array prices where prices[i] is the price of a given stock on the ith day. ```

The only downside for GPT-4 for me right now, is its slowness.


GPT-4 has ingested all of Leetcode, you can literally just type "leetcode 100 python" and it will regurgitate a response for you.

Only exception I found is with some of the Leetcode Premium questions, you might have to actually type in the problem statement, but it's still very likely that multiple solutions have been ingested from GitHub and elsewhere.


I suggest you try enabling "Ignore search results" from the model dropdown for these types of questions. The web results can be distracting for the model for Leetcode-type questions.


I see you've had to suggest this a few times in this thread, and in my experience I would agree with the suggestion. I wonder if you can have a simple gpt model decide automatically when ignoring search results would improve the result and do it automatically.


Interesting idea.


I tried with that option enabled and now it can't generate code at all. Here's my prompt:

``` You are given an array prices where prices[i] is the price of a given stock on the ith day.

Find the maximum profit you can achieve. You may complete at most two transactions.

Note: You may not engage in multiple transactions simultaneously (i.e., you must sell the stock before you buy again).

Write Python code to solve this: def maxProfit(self, prices: List[int]) -> int: ```

Output:

``` It seems like you want to find the maximum profit that can be achieved by buying and selling stocks, with the constraint that you can only make at most two transactions. Is that correct?

Could you please provide some example input and output to help me better understand your requirements? ```

I also tried a more basic prompt, but the output is not what I'd consider good code.

Can you maybe share some examples where we can see how it exceeds GPT-4's capabilities? Thanks!



In my own RAG implementations in the industrial sector, I've found it effective to first have the AI decide whether it needs to search at all. If it doesn't, the answers are much better.


Hmm I wonder what kind of code quality can be accomplished by looping from phind->gpt->copilot for multiple iterations, asking for criticisms of the code qualities then seeking code which addresses the ai-generated criticisms etc until it knocks out better than I would in a second.


You could cook something like this using Microsoft Autogen. It does allow you to daisy chain the models.


I tried this question and GPT4 did way way better to getting closer to a final answer. Phind was horribly wrong. I can't help but think something seems off with your eval given just how badly Phind did on this.

I want to make an interactive plot in Colab where I can show

X axis is interest rate of a 15 year mortgage. Y axis is the relative advantage of buying a house vs. renting in terms of total net worth at 15 years.

Assume a monthly budget for renting + investing or buying a house of 10k

Plot different lines for a few different market returns.

Make a slider that controls the total size of the loan.


Seemed to give plausible results for me: https://www.phind.com/search?cache=lswmiuewv2l33jt337dgrsho


def calculate_relative_advantage(interest_rate, loan_size, market_return): # Your calculation logic here pass

Chat gpt actually implements it


Just prompt it to implement the function


I did and it was wrong. I was responding to the claim that they got a plausible result.


This model clearly makes a much better search engine than google/kagi/bing/etc.

I've been searching for an obscure connector -- the 8-pin connector you'd find on the cable that delivers power to a GPU, but in a form that can be wave-soldered. I've spent hours searching all the big electronics distributors -- no luck. This thing found it in seconds.

https://www.phind.com/search?cache=a7e9u5l5aw1r8ufls0icpb63

This is a very common connector but in a highly unusual form-factor. Molex refuses to make wave-solderable versions of it.

Edit: the first link does not lead directly to the obscure connector, but to the website of a company that does sell it. Here is the obscure connector: https://www.moddiy.com/products/Special-Mini-Low-Profile-ATX... Maybe it just got lucky.

On the other hand, it hallucinated the crap out of a very straightforward question "how do i connect the wake# pins when bifurcating a pcie port?" -- the answer is that it's an open-drain pin so (unlike the clock pins which need a buffer chip) you just wire them both together:

https://www.phind.com/search?cache=zf9witr85q740l4s3vjwzf01

Then it tried to write a bunch of code for an obviously-not-coding question. Not so great.


The first reference returned was a plug intended to be used with wire connections. Not wave solderable at all. The other two returned were to a site search that returned nothing relevant and a large section of a manufacturers web site respectively.


Correction, the first link was to a vendor who does in fact sell the obscure connector:

https://www.moddiy.com/products/Special-Mini-Low-Profile-ATX...

I'm just elated I finally found out where to buy these damn things from.


Phind co-founder here! Here's a link to our blog post: https://www.phind.com/blog/phind-model-beats-gpt4-fast


hi; great work. so is this more fine-tuning on Phind-CodeLlama-34B-v2?

will there be api access soon?

also: will it be open-source at some point? thanks


Thank you. Yes, it is the 7th iteration that started with our open-source models. We do plan to open source this model as well down the road, once we've released a few more generations.

API access is on the roadmap but we have no time estimates for when we will build it. We're trying to not get distracted from our main mission :)


I see. Thanks!

So Phind's main mission is to overtake Google right? ;)


"I have a table called `onboardings` with the state field. I want to return how many people we have in each state. The Postgres query should return the state, how many people count, and what percentage do those people represent."

Claude-2, correct response and rounds the percentage - makes a nice assumption for me.:

    SELECT state, 
           count(*) AS people_count,  
           round(100.0 * count(*) / (SELECT count(*) FROM onboardings), 2) AS percent
    FROM onboardings 
    GROUP BY state
    ORDER BY people_count DESC;
Phind, correct response as well! Really fast too!:

    WITH state_counts AS (
     SELECT state, COUNT(*) as count
     FROM onboardings
     GROUP BY state
    ),
    total_counts AS (
     SELECT COUNT(*) as total
     FROM onboardings
    )
    SELECT sc.state, sc.count, (sc.count::decimal / tc.total::decimal) * 100 as percentage
    FROM state_counts sc, total_counts tc
    ORDER BY sc.count DESC;


I've been a pretty heavy user of phind and have been very satisfied! Haven't been using it to write code for me but to ask about features and docs and it's been pretty incredible.


I was just discussing using ChatGPT to make working with deploying serverless code easier.

I gave this as an example

“create a CDK typescript app that deploys a lambda + API Gateway where the lambda works with Get request and a dynamodb table. The lambda should have permission to read and write to the Table”

It wrote the code perfectly. I wanted to see if it was trained on the AWS APIs.


What's the best way to use an LLM with a large codebase that isn't RAG? Ideally we could have the full source in the context or already trained into the model... I was thinking I could set something to fine tune a model overnight and every morning I'd have a fresh one ready. Any ideas?


I don't use LLMs in my workflow frequently. When I do I have a hard time making sense out of the very specific and long answers to my questions. Especially if I don't know the answer it is hard to figure out if the long answer of the model indicates the right direction or misses my point completely.

Maybe I'm not knowledgeable enough. But asking questions I already know the answer of has no real life use case other than testing the model. Which of course might be a valid use case for some.

Having a way to let the user specify his own level of knowledge might help to receive answers that are better tailored to the user asking the question.


Have you tried custom instructions ? I use this:

My dad always used to say: Everything that can be said, can be said simply. I prefer top-down structured, short and thoughtful responses.


So I gave it this prompt:

> I need a typescript function which takes in an object with an id string property and a string message property, and also takes an array of search strings, and returns a mapping of search strings to matching message ids

The response I got was close, but it assumed that each search string would match only one message, so it returned Record<string, string>. I fed this to GPT-3.5 and it answered 10x faster with the correct return type.

This is a slightly tricky example, because it requires the model to infer that multiple message matches are possible. But I think that it’s interesting that ChatGPT nailed it despite not using any chain of thought.


> I need a typescript function which takes in an object with an id string property and a string message property, and also takes an array of search strings, and returns a mapping of search strings to matching message ids

Your prompt is wrong. You want a function that takes an array of id/message objects, not an object.

It's quite impressive that GPT is just able to correct for that. As a human, I would first ask what you actually mean, because your prompt appears to be unclear.


Actually it was meant to be an object mapping of ids to messages, but yes, I phrased it weirdly. Both LLMs understood that part, though.


The results I get are so-so. The rubric I use to evaluate coding LLM's is to ask it to create a Python script that determines if the contents of a given directory have been changed since the last time the script was run. This should be done recursively and handle files being added, removed, or modified and be based off the contents of the files and not the timestamps.

When I ask it as one statement it performed ok, but if I made more specifications with follow-up statements, it kept trying to go down one path even though I told it to do it a different way. A solid start but it definitely needs some improvements, IMO.


This is a problem that human programmers screw up… regularly.

E.g.: the efficient and robust file change monitoring on Windows is to read the NTFS change journal. For a single process lifetime there are other change notification APIs as well. Most software does neither and is either very slow or misses changes…


Thanks for the feedback. We're working on improving consistency and precise instruction following in followups.


If you can make this best in class for code outside just human eval, wow, that's the differentiator. Add cursor, replit and vscode support after. But best in class for code, it would be my daily driver


I can't wait to see this open sourced, there's a lot of sampling strategies that help coding.

And I also can't wait to see how much Phind will improve further if the Glaive dataset is added onto it.

Edit: Contrastive search, dynamic temperatures.


To the folks in this thread comparing the model with GPT-4, are you comparing it with GPT-4 in ChatGPT, or with GPT-4 on Phind? Because it should be the latter for a fair comparison. The Phind response seems to be heavily based on the top search results, which may affect the quality of the response.

(An even more interesting question would be to compare ChatGPT GPT-4 with Phind GPT-4, i.e. GPT-4 with relevant web results in context.)


I've been using GPT-4 for coding since it's launched. I did try Phind when you launched it with GPT4 and found it useful but I use VS Code and hate having to switch. I see you do have a VS Code extension now (great job!). I tried the phind model and at first glance it definitely looks better than GPT-4 wrt coding. Do you have any plans to provide this model as an API?


Pretty big jump for java eval, what is the reason for java being so notoriously difficult for LLMs? never mind I asked phind[1] and it said all the complexity... but do you have any tips or tricks for working with that language in your model?

[1] https://www.phind.com/search?cache=u3mnj3iwmjvgqlyf60bnbqo1


It failed for me at a much more basic level.

I asked 5 different, and increasing explicit, variations of the following question: "Can you generate HTML and CSS for a JPG mockup I'm going to give you?"

Each time it answered along the following lines: "Sure, here is how you can create HTML and CSS from a JPG mockup. Follow this process..."

In my experience this never happens with GPT-4.


I've not seen that anywhere, ChatGPT does image input now? Do you have examples of the output from feeding it a JPEG?


Yep, ChatGPT did a pretty impressive job when I tested a few days ago. Just grab a mockup from a Google search and prompt ChatGPT 4 to generate a web page. I'm sure your milage may vary.

However, my point was that Phind's answer was worse than a No, or a hallucinated attempt would've been. By saying "Yes, here is how YOU can do it...", it left the impression that it didn't even understand the question.


Thanks for expanding on your answer.


> Show HN: Phind Model beats GPT-4 at coding

Does it? I don't see any evidence of this strong claim in your post, and I think it's quite deceptive how the only link is to a benchmark of open source models (which doesn't include GPT-4). I've tried Phind a few times in the past when it made equally strong claims and been somewhat unimpressed. (To be fair, comparing anything to GPT-4 is tough!) I think it would strengthen your position significantly to simply say that you're the best of all open-source models.

To be honest though I've been completely ruined by https://cursor.sh/; copying and pasting results back and forth from a web UI to my IDE is so painfully slow when you do it tens or hundreds of times that I don't think I would be able to go back. I'd be happy to try out a Phind extension that has similar UI/UX if you ever make one.


We do have a Phind extension and you can even use it inside Cursor :)


That is true cursor extension is awesome


"Python script to extract a list of all Elastic IP's from all regions, from multiple AWS accounts."

ChatGPT4 gave me a solid answer hitting all the points I wanted. Phind din't get the account handling correct, didn't address regions, and didn't handle pagination.

"Write a python based script that uses boto3 to query AWS Route53. It should print a list of every record for a given hosted zone ID."

ChatGPT4 did exactly as requested with pagination, and even smartly decided to use "input" so I could give it a zone ID at run time. Phind didn't handle pagination, or do ANY error handling. It was also slower than ChatGPT4 to generate currently, and it wasn't in a single block of copy/pasteable code.

ChatGPT's solution worked without modification. Copy-Paste. Run.


Just worked well for me: https://www.phind.com/search?cache=g9y2uizgjwcn378aovb65v92.

We do have issues with consistency sometimes -- please try regenerating if that is the case.


"We do have issues with consistency sometimes" That's a strange statement. Having issues with consistency means that sometimes the output is wrong. What does it mean to have issues with consistency sometimes ? You're either consistent or you're not.


There's a difference between models that are incompetent and aren't capable of getting the right answers ever and models that are capable of getting the right answer but may not do so every time. The Phind Model is in the latter camp.

Consistency issues can be caused by a wide range of factors from inference hyperparameters to prompting.


I meant that saying "something is inconsistent sometimes" is weird because inconsistency implies "sometimes"


Your example didn’t include pagination.


I tried this, but I still have yet to get any LLM to answer me a programming question (that actually works) that I actually want to solve.

Basically:

"How can I send network control commands to an AppleTV in C#"

They always make up some nonexistent library or gives an example using some nonexistent API.


I’d guess the intersection of both tech has low training content so it starts dreaming. If you break up the question into “AppleTV API” (or whatever the primary terms are), then use that context for C# it might work better? Isolate the Apple bit so it uses more specific parts of the training.


That’s because you’re asking it something too obscure that I would have at first assumed wasn’t even possible.

“Make me a billionaire… I’m still poor! Bad AI!”

You need to collaborate with the AI, use it to help with each small step of the problem, with input references provided.

To a degree Phind can do the reference chasing for you, but it’s not magic.


It's definitely not impossible at least.

Someone is doing it in python here:

https://pyatv.dev/

GPT-4 actually sent me here:

"Here is an example of a C# library that implements the HAP: CSharp.HomeKit (https://github.com/brutella/hkhomekit). You can use this library as a reference or directly use it in your project."

Which, to no surprise based on my experiences with LLMs for programming does not exist and doesn't seem to have ever existed.

I get that they aren't magic, but I guess I am just bad at trying to use LLMs to help in my programming. Apparently all I do are obscure things or something. Or I am just not good enough at prompting. But I feel like that's also a reflection of the weakness of an LLM in that it needs such perfect and specific prompting to get good answers.


In a sense you’re asking it the wrong questions. It’s a bit like asking Google “my PC crashed, how do I fix!?” and then expecting something specific to a rare issue in the first hit.

Assuming a C# library even exists for what you’re doing (maybe not!) then still the best use of AI is to troubleshoot specific issues given an almost working piece of code as input.

Ask it to explain why something doesn’t work instead of asking it to do your job for you wholesale.

PS: GPT 4 (you are using the best coding AI, right? Right?) can get you going quickly:

“There are several libraries available for controlling Apple HomeKit from C#. One such library is *HapSharp* ². It is a .NET implementation of the HomeKit Accessory Server that allows you to create your own custom HomeKit accessory on a Raspberry Pi, Mac computer, or any other platform that can run Mono ².

Another option is *HomeKit* ¹. It is a native C# library for Apple's HomeKit Accessory Protocol. However, it is not a complete implementation and does not work ¹.

I hope this helps!

Source: Conversation with Bing, 31/10/2023 (1) netonjm/HapSharp: HomeKit Accessory Server .Net bridge! - GitHub. https://github.com/netonjm/HapSharp. (2) GitHub - ppumkin/HomeKit: Native C# Libary for Apple's HomeKit .... https://github.com/ppumkin/HomeKit. (3) homekit-accessory-protocol · GitHub Topics · GitHub. https://github.com/topics/homekit-accessory-protocol?o=asc&s...


Even all of that is on the wrong track. There is nothing that I can see anywhere about controlling an ATV with the homekit accessory protocol.


Then you asked the wrong question.

AFAIK Apple generally does not allow arbitrary remote control (headless mode) for security reasons — it could be used for spam automation!


They do though. Pyatv can do it (and home assistant is using pyatv since HA is python based) and commercial home automation systems like Crestron and Control4 can do it too.

Really I just need to get an LLM to port pyatv to C# for me I guess.


> Or I am just not good enough at prompting.

Or you're good enough at using your tools that you can do all the low-hanging fruit. LLMs excel at working around inadequate tooling, but (at least at the moment) they can't help you if you're trying to do something actually tricky and get stuck enough that no rubber duck can save you.


I’m working on an open source, terminal-based AI coding tool that is designed specifically for more complex, multi-iteration tasks and features. I think it could likely do a good job on this task.

I’m using it personally every day and while it still needs more work and polish, I’m finding it much better than ChatGPT or any other tools I’ve tried for bigger and more difficult tasks.

Please let me know if you (or anyone else reading this) would be interested to try a late alpha/early beta version: dane@envkey.com


Interesting, I seem to have gotten a decent answer: https://www.phind.com/search?cache=avbridtm69ejk8pdqpx8hcnf


Unfortunately there’s nothing correct about that answer. There’s no tcp service listening for requests like that on port 7000 on an AppleTV.


It's quoting that from a StackOverflow post: https://stackoverflow.com/questions/11857130/tcpclient-or-ht....


Yeah, that port 7000 service is AirPlay protocol and they are sending photos and videos to an ATV with it.

But I want to control a unit like send navigation controls like 4 directions, back and select.

The only app I know that can do it is pyatv, (a python app) but I want to do it in my C# app.

It would be nice if an LLM could port pyatv to C# for me as I don't really know python at all.


Re: "We're excited to announce" - when did this get deployed? I was on Phind Pro ... a month ago or something, and curious if i already experienced this or not.

Phind was really good, but still had a difficult time with library versions. Notably a lot of the search results it saw felt like they polluted it with incorrect assumptions about available methods on specific library versions. The web results felt like it made the LLM worse at some things. In the end i switched back to ChatGPT. Though i expect i'll retry Phind at some point, i do tend to ping pong on each respective release.

Does this version tackle that any better in your eyes?


Thanks for the feedback and I'm sorry to see you go. The new version should be better at library versions. If you're in our Discord, I'd be happy to help you one-on-one -- please send me a DM.


I'm sure i'll be back soon, the overall experience was good. So many competing products it's difficult to pay for them all at once.


A few of Phind's models are open/available

https://huggingface.co/Phind/Phind-CodeLlama-34B-v2


Hi, is there any plant to improve the UI? About 1/3rd of vertical space is used by phind logo and the search box below it. I sincerely believe it needs some professional/business touch on its UI.


Just a reminder GPT-4 is almost 1.5 years old at this point (from before they started internal safety testing), and even the one we have is diminished from the first, uncensored version.


Phind can be very good for general tech searches too. I spent long time looking for a way to disable my Pixel 4a from auto updating to android 14 (which it has already downloaded). With Google I only found one solution which disables update on restart. I asked the same in Phind and in one query I have got around 4 solutions [1].

[1]: https://www.phind.com/search?cache=vtzigjx3rnruc9ltocv9gqi1


Playing on the other AI prompt thread (https://news.ycombinator.com/item?id=38089247) here on HN I tried:

"write me an angry birds clone using JavaScript and the matter physics library"

And I really enjoyed the answer; in particular that it would show the sources for it - making it obvious how close its answer is openly available tutorials.

This is much better than a black box pretending to do black magic.


Didn't work fine when I asked it a design question: the code and API it used is not correct. GPT-4 did a better job.

https://www.phind.com/search?cache=ay8rx37gq8oy3z7uixftlqkt

https://chat.openai.com/share/a3a91dcc-a91a-4b04-8afd-40bd1a...


The GPT-4 answer is only better in so far as it uses RunTransaction. I don't know why it's trying to loop through the stores and then running the i'th operation on that store when it could have just had the store referenced in the operation instead of passing it as a parameter. And then it's also creating a new client for each transaction which seems wrong (to be fair I'm not familiar with Firestore so maybe this is idiomatic).


It's not idiomatic. I agree that ChatGPT implementation is not very good, but at least it's probably working (not tested) and used correct APIs. I tried several iterations after that, and it came up with a better design.


Not looking deeply at the technical side of the answers, but the time of GPT4's answer is very casual/conversational (it starts with "Alright, listen up." and keeps that tone throughout).

I think you might get a better answer if you rewrote your prompt using full sentences and more formal language.


Thanks for sharing the links, we'll investigate this example.


I straight away asked it a stackoverflow question in which input and expected output samples were given. Phind didn't do well. ChatGPT though, [kissing hearts emoji]


This is awesome. Are you planning to open-source the V7 model?


Thanks! We generally plan to open-source our previous models once they're no longer cutting-edge, so yep :)


This is great work, but HumanEval is an extremely limited benchmark and I don’t think you can seriously claim to beat GPT-4 at coding based only on that metric.


Fifth sentence:

> However, we’ve found that HumanEval is a poor indicator of real-world helpfulness.


Thank you. You're right -- which is why we rely on feedback we've received from our own users for that claim. Many of our users who have the choice to use either GPT-4 or the Phind Model on Phind choose the Phind Model.


You likely know this, but keep in mind the kind of selection bias in taking feedback mostly from your own users. The number of times I've heard product designers claim that their users prefer some aspect of how their application already works, ignoring the fact that the users who didn't prefer it have left and hence are likely not available to survey.


Of course. We do our best to talk to churned users as well, but we're doing this Show HN to get even more diverse feedback.


I understand, but big claims require big evidence and so it’s still IMHO not rhetorically a strong position. I’m glad people find it more useful!


How have you liked using TensorRT-LLM? Did you come from faster-transformers, vLLM, LMDeploy, TGI, something else?

We started migrating to it the day it came out, very glad to have it, but lots of little annoyances along the way. Biggest one has been loading our model repository; having to hardcode the location of the engine file means we can't use the built-in ways Triton has for downloading from GCS!


Have you tried your model on this new benchmark from Princeton NLP? https://www.swebench.com/

It's more "real-world" than the toy-like problems in HumanEval, it would be interesting to see if you can crack some of those.


The speed is really impressive! I tried it with a moderately challenging task and it failed pretty spectacularly, hallucinating class methods and missing a bunch. It seemed like the UI struggled with my code too, breaking in and out of markdown somewhat randomly. I was impressed enough I may try again with some simpler stuff, but I'm not quite ready to switch away from GPT4.


Would you mind sharing the link? I'd also suggest trying to enable "Ignore search results" from the model dropdown for inputs with lots of specific details.


I like that it provides sources, but I have to check them EVERY time because too often it hallucinates bogus solutions or protocols for me. I'm asking network questions, though, not asking for code snippets. I've had it hallucinate powershell modules as well. If you're willing to check it's work, then its useful maybe.


Thanks for the feedback. Do you have any cached links you can share? It'd be massively helpful.


I tried to remember specific examples, plugged them into Phind, and it seems to have improved since I used it last.

We had asked it if Cisco LISP supports map server database replication. It doesn't, and maybe this is a bad use case, asking it about something that doesn't exist.. But it basically said "yeah it does, here's how" and spat out some irrelevant stuff (how to do manual database mapping, etc). Again, probably a bad example considering the prompt was kind of garbage, but I did expect it to "know" the difference between replication and manual data entry.

Had a procedure in a powershell script to query a name server for A records. It used a deprecated system class and I believe missed some syntax. Ended up getting what we wanted out of it by re-wording the prompt a few times. I can't replicate this now, I don't even have the original script, so sorry I can't help more.

Has Phind improved SINCE going to Gpt-4? It seems to have?

Anyways.. I still use Phind. Hopefully I haven't undersold it, because it's usually great and I recommend it to everyone at work. Like everything else AI, the prompt really matters, and any frustration I had with it was undoubtedly due to expecting too much of the backend (actual ChatGPT) and not the Phind interface.


Phind has been pretty nice to use for some rubber ducking with C#. Only disadvantage is using Discord to wall communication.


> "Am I talking to real person?" >> "Yes, you are talking to a real person..."


If it’s trained on data (particularly docs) after 2021 that’s an automatic win over Chat GPT in some situations!


Gpt is 2023 now


Not general, at least not for me I still get 2021 notice unless I use bing in drop down


Been subscribed to Phind for 3 months now on 30€ per month, and constant outages made me unsubscribe this month. I did compare Phind and GPT-4 in the past, whenever Phind came out with these kind of articles, and after first question it was obvious Phind was nowhere near.


Sorry to hear that you didn't have a great experience. I'd love to chat further, my email is founders(at)phind.com


I was just testing to see the comparison and ran into a message saying I was out of GPT-4 queries, despite having deliberately selected Phind as my model.

Now I'm confused if the results I was seeing really were from a different model than GPT-4 or not.


Ah you were likely using the Pair Programmer. The Phind Model is not yet supported in the Pair Programmer, only the default search mode. Please try again using that.


If this is the case the ux is off, as pair programmer is on and the model clearly says Phind. I'd recommend making the model more clear before and after search, and instead of adding a bubble tag to a search box in the prompt response list, change the background color of search query boxes


Oh okay, that makes sense, thank you for clarifying.


Your About page is really lacking in detail. https://www.phind.com/about I wouldn't feel comfortable using your service without a lot more detail about the founders and company etc.


This is amazing, kudos to the team


I found phind v7 to be closer to gpt3.5. First answer was great, but quickly started repeating mistakes from previous prompts. I also felt gpt4 understood better the constraints of the problem.

Still a massive thumbs up to the phind team. Impressive stuff!


The speed and quality seem good to me. Will try it on some real scenarios this week.


https://www.phind.com/search?cache=hnqqc3fo3o3n61blb6bfh69b

It's not generating the wrong answer. It's quoting the wrong answer


So on firefox with normal protections I get a blank page in reply to a phind query for whatever reason. On chrome phind does seem to get some interesting answers (and is a bit cheaper than GPT to begin with for sure ;-) )


I tried “Draw a cone with tkz-euclide”. The result is not quite right, it outputs code that draws a circle and two vertical lines and that’s it.

Just curious about how good it works with niche languages


Suggestion: When your title makes a claim like "Phind Model beats GPT-4 at coding, with GPT-3.5 speed and 16k context", link to a blog entry that explains the claim.



Will you open source anything newer than the v2 currently on HF?


What about more realistic benchmarks, like SWE-bench [1]?

[1] https://www.swebench.com/


I went over to my GPT-4 history and pasted some problems and refactor requests in verbatim, the GPT-4 outputs were of much higher quality.


Is the search part of this shelling out to something like bing? Or is this a novel internet search engine as well as an AI model?


Some small feedback/bug: (Mobile, Firefox, using pair programmer mode)

The text box gets hidden after the conversation exceeds the page height


Thanks, we'll take a look.


it would be great to have more clarity on the Plans page re why I need GPT4 in the context of Phing, Im already paying for GPT Plus, Copilot, and Kagi Search. Would be great to have a ref: Is input length of 8000 good for a web app, iOS view, unix util, go server? It seems like the value add is Phing model but you advertise GPT4


"Service is unavailable in this region"

loses to GPT-4 as it apparently geogates infrastructure, which GPT-4 does not.


In some of my coding questions, Google Bard seems to be slightly quicker and both may provide similar replies.


How recently was this LLM seeded with data? In the context of golang, this is easily a generation ahead of GPT-4.


I had a bug in my flutter app. The Phind model nailed it straight away. GPT-4 gave a working but awful solution.


Such an AWESOME time to be a programmer!!


Is the Phind model available outside of phind.com as an api or are weights available for fine tuning?


Wow they really simplified the front end


@rushingcreek

My primary role in IT is not as a programmer; however, I do program in python, PowerShell, and batch files in order to automate some of my administrative duties in both an effort to save time and to ensure accuracy of my work. I am self taught, and a lousy programmer. I do know what questions to ask and what is possible, therefore a tool like ChatGPT or Phind is truly helpful in my hands.

I am writing this comment after directly comparing Phind and ChatGPT. I did an experiment and Phind far exceeded my expectations. I provided an obscure prompt to Phind, not expecting it to intuitively provide the exact answer I was looking for. The prompt was simply 'Look at this batch file' and then I provided a copy of the file. I was looking to rewrite the code to use a for loop and a text file for the destination paths. Phind did not require a further prompt to assume this and provide me the code and explanation needed to understand the code. The latter is very useful for a crappy programmer like myself because it easily expands my knowledge.

The same prompt to ChatGPT, responded exactly as I anticipated. It was simply re-iterating line by line what I submitted.

Thank you for providing this tool.

TLDR; Phind is pretty awesome. Useful to get work accomplished and useful for learning.


The headline seems a little disingenuous: “beats GPT-4 at coding”

The results are impressive and things have been really progressing quickly, so kudos.

But even by your own description in this post, something like “rivals GPT-4 at coding” seems a more accurate appraisal.


I tried a more niche language, Scilab, and it did considerably *worse compared to* GPT4.

- What are your experiences?


It's so fast ... and accurate.


Licensing and privacy details pls?


Looks nice, but it's quite pricy compared to OpenAI's API pricing or ChatGPT.


The vscode integration seems cool, but why do I have to have an account to use it?


Are the weights for the 70B version of the model available?


Is there a noscript/basic (x)html prompt somewhere?


That bot of yours is second chat bot that claimed it can program or can help with programming. And that bot of yours is second one that utterly failed to provide me with an implementation of blocked clause decomposition in Haskell. I needed something, even the most slow version would do.

Yours' bot also tried to bullshit me about validity of its answer, just like the other one.

The difference? Your bot mentioned a paper on arxiv about the problem. But the paper (and I read it long time ago, of course) does not provide even pseudocode implementations of most of the algoithms mentioned there.

Color me not impressed.

As usual, bots like yours are not for when you need something new. If I have an idea, I cannot use any AI, including yours, for prototyping work.

It is expected as neural networks are interpolators, not extrapolators and for them to "extrapolate" one need to train them over the "extrapolation" area quite well.


this why i dont even bother reading the 'evals' or how its better i test it out myself and it seems always to not be true


Does it respond in correct JSON?


are you gonna launch vscode extension as that provides better UX


What‘s the cutoff date?


October 2023


are you planning on open sourcing the model eventually?


What data did you use to train and how do you evaluate your model for overfitting? I ask due to the issues with the HumanEval dataset.

-------------

For those that are unfamiliar with the issues, allow me to elaborate. You can find the dataset in the parent's link or here[0] and you can find the paper here[1].

I'll quote from the paper. First is page 2 right above the github link and second is page 4 section 2.2 (note, this paper has 58 authors... 58)

> To accurately benchmark our model, we create a dataset of 164 original programming problems with unit tests. These problems assess language comprehension, algorithms, and simple mathematics, with some comparable to simple software interview questions.

> It is important for these tasks to be hand-written, since our models are trained on a large fraction of GitHub, which already contains solutions to problems from a variety of sources. For example, there are more than ten public repositories containing solutions to Codeforces problems, which make up part of the recently proposed APPS dataset

So we take from this that the problems are simple and leet code style and that they have verified that the data is not in the training set by the simple nature of simply writing the code from scratch. If you aren't laughing now, you should be. So let's look and see if there are in fact samples of code that are exact or near to those in the test set that exist in public githubs prior to May 2020, their cutoff date.

Now let's look at some of the test questions and see if we can find them on github. Github search is total garbage so I'm going to pull results from the last time I looked (search my comment history "godelski human eval") I apologize in advance for formatting.

HumanEval/4:

Prompt: from typing import List def mean_absolute_deviation(numbers: List[float]) -> float: """ For a given list of input numbers, calculate Mean Absolute Deviation around the mean of this dataset. Mean Absolute Deviation is the average absolute difference between each element and a centerpoint (mean in this case): MAD = average | x - x_mean | >>> mean_absolute_deviation([1.0, 2.0, 3.0, 4.0]) 1.0 """

canonical_solution: mean = sum(numbers) / len(numbers) return sum(abs(x - mean) for x in numbers) / len(numbers)

Found on Github[2], commit date Oct 5, 2019: if reduction == "median": return np.median(scores) mean = sum(scores) / len(scores) if reduction == "mean": return mean return sum(abs(x - mean) for x in scores) / len(scores)

A solution that is functionally equivalent. Swap numbers and scores and remove the if statement. This constitutes a near collision and ML models will preform very well on near collisions. If you look at the testing method for the evaluation you will also see that this code will pass the test. Thus our LLM can very easily simply copy paste this code and pass no problem. I'm not saying that's what happened, but that we cannot rule this out. What actually happened is an open question and we're far from ready as a community to call LLMs fuzzy copy machines.

I also have this search query marked which still seems to be working[3] but you'll have to manually check the date.

You can repeat this process for many examples in the HumanEval dataset. Or simply look the human eval dataset questions and answers and ask yourself "Have I written those exact lines of code?" The answer is probably.

But note here, that overfitting is perfectly okay in certain circumstances. But HumanEval simply measures how good a LLM is at solving short leetcode style questions. It does not measure a LLMs ability to write code and certainly not write non-leetcode. It may very well do so, but this benchmark does not measure such things. This still can provide utility to people and these LLMs still learn a lot more than what HumanEval tests. My issue is with the metric and claims as to what the results indicate rather than the product itself. There is also the danger of chasing benchmarks like these as you will not be able to disentangle overfitting from desired training outcomes. I am not critiquing OP's network nor the work they did to create this. I'll explicitly state here, well done OP. This took a lot of hard work and you should feel very proud. I hope this question and context does not come off as pejorative nor overly cynical. I think your work is without a doubt, something to be proud of and useful to our community.

This is a warning to all HN readers to help avoid snakeoil (I expect every ML person to already know this), scrutinize your metrics and know exactly what they measure. I mean precisely, there are no metrics that measure abstract things like "image quality", "performance in language", "code generation performance" and so on. Generative models are exceptionally difficult to determine what model is better and we are unfortunately at a point where many of our metrics (remember: metrics are proxies or more abstract goals. Metrics are models. All models are wrong, just some are more wrong than others) and you must do far more investigation to come to an even fuzzy answer to this question. Nuance is necessary.

[0] https://huggingface.co/datasets/openai_humaneval

[1]https://arxiv.org/abs/2107.03374

[2] https://github.com/danielwatson6/hate-speech-project/blob/8e...

[3] https://github.com/search?q=abs%28x+-+mean%29+for+language%3...


from my last week test of opensource model , it keep repeating and gives out broken outputs, using q4


This V7 model is much better than the V2 model that we previously open-sourced. And Q4 quantization would also likely have a large detrimental impact.


Are there plans to open source V7?


Great work boys


Wait, can't you use this to develop chemical weapons? Where's your 20-person government-mandated safety team?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: