Hacker News new | past | comments | ask | show | jobs | submit | hack_edu's comments login

Could you share a bit more about semantic chunking with Phi? Any recommendations/examples of prompts?


Sure, it'll look something like this:

""" Task: Divide the provided text into semantically coherent chunks, each containing between 250-350 words. Aim to preserve logical and thematic continuity within each chunk, ensuring that sentences or ideas that belong together are not split across different chunks.

Guidelines: 1. Identify natural text breaks such as paragraph ends or section divides to initiate new chunks. 2. Estimate the word count as you include content in a chunk. Begin a new chunk when you reach approximately 250 words, preferring to end on a natural break close to this count, without exceeding 350 words. 3. In cases where text does not neatly fit within these constraints, prioritize maintaining the integrity of ideas and sentences over strict adherence to word limits. 4. Adjust the boundaries iteratively, refining your initial segmentation based on semantic coherence and word count guidelines.

Your primary goal is to minimize disruption to the logical flow of content across chunks, even if slight deviations from the word count range are necessary to achieve this. """


Might sound like a rookie question, but curious how you'd tackle semantic chunking for a hefty text, like a 100k-word book, especially with phi-2's 2048 token limit [0]. Found some hints about stretching this to 8k tokens [1] but still scratching my head on handling the whole book. And even if we get the 100k words in, how do we smartly chunk the output into manageable 250-350 word bits? Is there a cap on how much the output can handle? From what I've picked up, a neat summary ratio for a large text without missing the good parts is about 10%, which translates to around 7.5K words or over 20 chunks for the output. Appreciate any insights here, and apologies if this comes off as basic.

[0]: https://huggingface.co/microsoft/phi-2

[1]: https://old.reddit.com/r/LocalLLaMA/comments/197kweu/experie...


Wild speculation - do you think there could be any benefit from creating two sets of chunks with one set at a different offset from the first? So like, the boundary between chunks in the first set would be near the middle of a chunk in the second set?


No, it's better to just create summaries of all the chunks, and return summaries of chunks that are adjacent to chunks that are being retrieved. That gives you edge context without the duplication. Having 50% duplicated chunks is just going to burn context, or force you to do more pre-processing of your context.


This just isn't working for me, phi-2 starts summarizing the document I'm giving it. I tried a few news articles and blog posts. Does using a GGUF version make a difference?


Depending on the number of bits in the quantization, for sure. The most common failure mode should be minor restatements which you can choose to ignore or not.


Any comments about using Sparse Priming Representations for achieving similar things?


That looks like it'd be an adjunct strategy IMO. In most cases you want to have the original source material on tap, it helps with explainability and citations.

That being said, it seems that everyone working at the state of the art is thinking about using LLMs to summarize chunks, and summarize groups of chunks in a hierarchical manner. RAPTOR (https://arxiv.org/html/2401.18059v1) was just published and is close to SoTA, and from a quick read I can already think of several directions to improve it, and that's not to brag but more to say how fertile the field is.


Is phi actually able to follow those instructions? How do you handle errors?


Whether or not it follows the instructions as written, it produces good output as long as the chunk size stays on the smaller side. You can validate that all the original text is present in the chunks and that no additional text has been inserted easily enough and automatically re-prompt.


ChatGPT's search is powered by Bing.


Even worse


Words with capital letters typically have higher token counts. This causes the LLM to apply attention differently.


No one thinks that peanuts and vegetable oil combine into some magic superfood, yet Plumpy'nut is concretely beneficial to keeping people fed and healthy. Now, consider this same rationale into to a kind of poverty close to home. There is plenty of evidence that milk in school lunches is often the only reliable daily source of calcium and protein for millions of impoverished children worldwide.

Contextualized one way, milk is a meal replacement or nutrition supplement, and one that is more practical than most other options. A serving of whole milk requires zero on-site prep time, is relatively portable once packaged, and perhaps more importantly, it is often the most palatable option for picky eaters. Public health is complicated.


> for millions of impoverished children worldwide.

Did you read I wrote "modern diet". Those are not the typical HN reading crowd.

Also: you could give them leafy greens can lentils to achieve the same.


Those impoverished kids you talk about are often lactose intolerant.

Sure the benefits of the dairy nutrition may outweigh the problems of the intolerance.


> Those impoverished kids you talk about are often lactose intolerant.

Exactly. Same goes for other common allergies options like legumes, seafood, and gluten. That's why you provide options.


Its impossible to have a sane conversation about nutrition, especially in a historical context, when addressing folks' anecdotal cultural notions of what foods make their body feel good. Whatever point you're trying to make is meaningless without offering science or anthropological-historical context to back it up.


In the US (and many other places actually):

- 70% obesite/overweight rate

- 20% of kids are obese before hitting 18

- The 2 leading causes of deaths are lifestyle related, with diet being a big part of both

Every single study and scientist you will find will tell you our diet is dog shit, they'll tell you we eat too much sugar, too many carbs, too much salt, too much processed food

People discussing if artificial sweeteners are better or worse than sugar are like drug addicts discussing if heroin is less worse than fentanyl

Just cut that shit out of your life, there are no drawbacks and an infinite amount of benefits. They're not necessary, they're not needed, you're just addicted to them. We see the results, we have the stats, the science tells us why, we know. I don't get how anyone can argue the opposite, we eat shit, the vast majority of things sold in supermarket is shit, most people eat shit and they visibly suffer from it


People eat sugar around the globe though it’s not just an American thing.


> When did we become so comfortable with the government mandating presentation of papers and tracking of private transactions?

After Timothy McVeigh blew up a building with a truck full of agricultural supplies.


How well did that prevent 9/11?

The decrease in domestic terrorism is due to societal changes and better old-fashioned policing of home-grown extremist groups. We didn’t somehow make it impossible (or even difficult) to improvise large explosives.


Well, those folks had to spend a few years planning, take flying lessons, dry run everything a few times, and then pull off the most complicated and coordinated terrorist attack in US history.

The actual bomb planning for the Oklahoma City bombing was less than a year and involved two people. So, seems like the bar was raised quite a bit.


You're giving the 9/11 terrorists far too much credit. They were supported/financed by the Saudis[0] and America dropping the ball was the only reason they succeeded.

It's not so much a big victory for the terrorists as a big black eye for America and our intelligence agencies.

One of the 9/11 pilots was reported to Federal agencies multiple times[1] and the hijacking still took place.

[0]https://theintercept.com/2021/09/11/september-11-saudi-arabi...

[1]https://abcnews.go.com/US/story?id=91659&page=1


Do you think that the bar was raised a bit or that the 9/11 terrorists set their sights a little higher?


How well did laws to stop isolated crazy people killing lots of people stop … um … a group of countries investing large amounts of money and years of time into planning and training for the largest single attack on a civilian target in history?

I mean sure, you could also ask how laws against shoplifting fail to stop bank robberies, and it would be just as coherent.


largest single attack on a civilian target in history?

Japan would like to have a word with you.


Perhaps "largest single attack on a civilian target by a non-state actor" would be more accurate.


It prevented 7/15 pretty damn well


7/15: Never Remember


Survivorship bias. You don't know how many plots were prevented by making it difficult to obtain explosive precursors.


To start, thanks for the self-submitted promo piece on your company. If you want publicity, earn it. You ought to make a statement about this fact in the comment section. Or, submit this using an official account.

Here is basically how it goes:

- Phone screening

- Take home assignment

- "Resume” interview

- Technical interview

- Product interview

- Interview with another team

- Finalizing the hire

This might seem that there are a lot of steps… and maybe it’s true. However we feel that it’s good for both parties if they get a good look at what working together would be like.

Are you kidding me? That is more time spent interviewing with you than the legal French work week. Who has time for that? I don't know about Paris, most candidates would laugh in the face of your recruiter. Those that don't are push-overs with nothing better to do.

Put yourself in someone else's shoes and imagine going through 7 days of 1-4hr interviews, concurrently, with a half dozen other companies at the same time. What makes your company so elite? Prove it.

Some some respect.


If you sum it all up, it is less than a day. The phone screening is ~20 minutes, the assignment takes a couple of hours and so on.

I think it's fair to expect that a candidate is willing to invest at least 5-6 hours in an interview process. Compared to what I've seen before (full day interviews, freelance period etc) this seems fair to me. But like you comment proves, it might not be for everybody.


A take home assignment alone is likely more than a work day, if not multiples - I'll take a 7 hour onsite interview gauntlet over that any day.

The only side this process seems to be favorable for is the company interviewing.

To give a flip side, I just finished interviewing with over 10 companies in a rigorous search. Of those, two did take home tests, and ultimately I didn't have the time to complete either, especially since the requirements were written in a way where candidates were encouraged to dump a lot of time into them. My schedule was filled with many high stakes interviews, which was mentally exhausting. It simply is not in my interest to do a take home project, as it reduces the number of companies I can simultaneously interview at.


From experience we saw that for most people doing the coding test was only taking ~an evening which seems reasonable as it removes the need for more on site discussions.

I guess that it's true that the take home assignment is not optimal if you interviews with more than 10 companies and in this case we must be loosing some candidates.


They lie. Most candidates will tell you the coding exercise took less than it actually did because 1) they want to appear efficient 2) you said it would take only 3 hours so if they say it took 8 hours it would look like a failure

Source: me, last week, for another company. Plus, programming is not just writing code, most technical interviewers will want to see the global design, unit tests, comments, etc... Which are not accounted for in the expected time


If this is the case, you are merely writing a piece of "our interview process is entirely average." I don't know about you, but 5-6 hours is standard. What's the point of breaking out your day-of process into 4 bullet points if its just the same as your normal engineering interview? Fluff.

Finally, realize that your candidates (just like your Eng org) spend much more time prepping for your interview than you quantify on paper. However much they choose to is up to them, but don't pretend you're doing them a favor.


I never said our interview process was perfect, I mainly wanted to share it because I saw a lot of people complaining about whiteboard coding and so on... so I figured it could be interesting to some to see a less exhausting alternative.


Just a small observation: whiteboard coding is not inherently wrong. Bad experiences with whiteboard coding usually come from dealing with companies with broken hiring process, where simply eliminating whiteboard coding and replacing it with something else wouldn't help.


I completely agree, this was just an example of a common complain.


Dear Blaine/Solve-employee, we know this is you. Be honest with us and just post on the OP account. Faking it is even worse than self-promotion spam. :)


Eh, I don't think this assumption is warranted. It could be a prospective competitor.


Oakland, CA is a strong contender if you step back. Especially for one of a Stephensonian-styled second-wave cyberpunk. It even plays a part in Neuromancer.

It has its dark parts along with strong facets of common cyberpunk themes: drastic social stratification, the social acceptance of regular drug usage, urban decay meets technocratic renewal, a renewed definition of suburbia, and a greater acceptance of non-binary genders.


I like it, especially because there's a strong artist population there, and a common theme in cyberpunk is showing what the non-techies' life is like (to contrast it to the main characters' lives), and it's quite often artists.


> How do you do that without some basic understanding of computer science-y stuff?

> How do you define "scalable", how do you measure it? How can you have some intuition about a design before we spend 3 months and many sprints building it first?

> How do I know when to cache stuff? Does it matter if I have calls to a remote cache in a tight loop? Should I be using an in-process, out-of-process, or remote cache for a particular piece of data?

You're proving the above poster's exact point. You are putting your weight in applied questions that rest upon the developer's specific experience. This method is the opposite of evaluating people for their ability to memorize a half dozen algorithms and data structures.

In my experience interviewing candidates, asking people to implement a caching algorithm is a distraction to both parties. A much better evaluation is their ability to provide box-arrow diagram and talk it through. This is much more effective towards understanding their thought processes and knowledge. It is also much, much closer to the _real_ day to day of a today's engineer: communication, advocacy, and breadth of knowledge. Code is cheap. Business should screen employees for an interest.

CS textbook questions introduce enormous amounts of bias, especially in panel interviews. It is a dangerous trap that companies use to further entrench their team cliquiness and departmental monoculture. It is ripe for Simple Sabotage. Simply put, its lazy.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: