"""
Task: Divide the provided text into semantically coherent chunks, each containing between 250-350 words. Aim to preserve logical and thematic continuity within each chunk, ensuring that sentences or ideas that belong together are not split across different chunks.
Guidelines:
1. Identify natural text breaks such as paragraph ends or section divides to initiate new chunks.
2. Estimate the word count as you include content in a chunk. Begin a new chunk when you reach approximately 250 words, preferring to end on a natural break close to this count, without exceeding 350 words.
3. In cases where text does not neatly fit within these constraints, prioritize maintaining the integrity of ideas and sentences over strict adherence to word limits.
4. Adjust the boundaries iteratively, refining your initial segmentation based on semantic coherence and word count guidelines.
Your primary goal is to minimize disruption to the logical flow of content across chunks, even if slight deviations from the word count range are necessary to achieve this.
"""
Might sound like a rookie question, but curious how you'd tackle semantic chunking for a hefty text, like a 100k-word book, especially with phi-2's 2048 token limit [0]. Found some hints about stretching this to 8k tokens [1] but still scratching my head on handling the whole book. And even if we get the 100k words in, how do we smartly chunk the output into manageable 250-350 word bits? Is there a cap on how much the output can handle? From what I've picked up, a neat summary ratio for a large text without missing the good parts is about 10%, which translates to around 7.5K words or over 20 chunks for the output. Appreciate any insights here, and apologies if this comes off as basic.
Wild speculation - do you think there could be any benefit from creating two sets of chunks with one set at a different offset from the first? So like, the boundary between chunks in the first set would be near the middle of a chunk in the second set?
No, it's better to just create summaries of all the chunks, and return summaries of chunks that are adjacent to chunks that are being retrieved. That gives you edge context without the duplication. Having 50% duplicated chunks is just going to burn context, or force you to do more pre-processing of your context.
This just isn't working for me, phi-2 starts summarizing the document I'm giving it. I tried a few news articles and blog posts. Does using a GGUF version make a difference?
Depending on the number of bits in the quantization, for sure. The most common failure mode should be minor restatements which you can choose to ignore or not.
That looks like it'd be an adjunct strategy IMO. In most cases you want to have the original source material on tap, it helps with explainability and citations.
That being said, it seems that everyone working at the state of the art is thinking about using LLMs to summarize chunks, and summarize groups of chunks in a hierarchical manner. RAPTOR (https://arxiv.org/html/2401.18059v1) was just published and is close to SoTA, and from a quick read I can already think of several directions to improve it, and that's not to brag but more to say how fertile the field is.
Whether or not it follows the instructions as written, it produces good output as long as the chunk size stays on the smaller side. You can validate that all the original text is present in the chunks and that no additional text has been inserted easily enough and automatically re-prompt.
No one thinks that peanuts and vegetable oil combine into some magic superfood, yet Plumpy'nut is concretely beneficial to keeping people fed and healthy. Now, consider this same rationale into to a kind of poverty close to home. There is plenty of evidence that milk in school lunches is often the only reliable daily source of calcium and protein for millions of impoverished children worldwide.
Contextualized one way, milk is a meal replacement or nutrition supplement, and one that is more practical than most other options. A serving of whole milk requires zero on-site prep time, is relatively portable once packaged, and perhaps more importantly, it is often the most palatable option for picky eaters. Public health is complicated.
Its impossible to have a sane conversation about nutrition, especially in a historical context, when addressing folks' anecdotal cultural notions of what foods make their body feel good. Whatever point you're trying to make is meaningless without offering science or anthropological-historical context to back it up.
- The 2 leading causes of deaths are lifestyle related, with diet being a big part of both
Every single study and scientist you will find will tell you our diet is dog shit, they'll tell you we eat too much sugar, too many carbs, too much salt, too much processed food
People discussing if artificial sweeteners are better or worse than sugar are like drug addicts discussing if heroin is less worse than fentanyl
Just cut that shit out of your life, there are no drawbacks and an infinite amount of benefits. They're not necessary, they're not needed, you're just addicted to them. We see the results, we have the stats, the science tells us why, we know. I don't get how anyone can argue the opposite, we eat shit, the vast majority of things sold in supermarket is shit, most people eat shit and they visibly suffer from it
The decrease in domestic terrorism is due to societal changes and better old-fashioned policing of home-grown extremist groups. We didn’t somehow make it impossible (or even difficult) to improvise large explosives.
Well, those folks had to spend a few years planning, take flying lessons, dry run everything a few times, and then pull off the most complicated and coordinated terrorist attack in US history.
The actual bomb planning for the Oklahoma City bombing was less than a year and involved two people. So, seems like the bar was raised quite a bit.
You're giving the 9/11 terrorists far too much credit.
They were supported/financed by the Saudis[0] and America dropping the ball was the only reason they succeeded.
It's not so much a big victory for the terrorists as a big black eye for America and our intelligence agencies.
One of the 9/11 pilots was reported to Federal agencies multiple times[1] and the hijacking still took place.
How well did laws to stop isolated crazy people killing lots of people stop … um … a group of countries investing large amounts of money and years of time into planning and training for the largest single attack on a civilian target in history?
I mean sure, you could also ask how laws against shoplifting fail to stop bank robberies, and it would be just as coherent.
To start, thanks for the self-submitted promo piece on your company. If you want publicity, earn it. You ought to make a statement about this fact in the comment section. Or, submit this using an official account.
Here is basically how it goes:
- Phone screening
- Take home assignment
- "Resume” interview
- Technical interview
- Product interview
- Interview with another team
- Finalizing the hire
This might seem that there are a lot of steps… and maybe it’s true. However we feel that it’s good for both parties if they get a good look at what working together would be like.
Are you kidding me? That is more time spent interviewing with you than the legal French work week. Who has time for that? I don't know about Paris, most candidates would laugh in the face of your recruiter. Those that don't are push-overs with nothing better to do.
Put yourself in someone else's shoes and imagine going through 7 days of 1-4hr interviews, concurrently, with a half dozen other companies at the same time. What makes your company so elite? Prove it.
If you sum it all up, it is less than a day. The phone screening is ~20 minutes, the assignment takes a couple of hours and so on.
I think it's fair to expect that a candidate is willing to invest at least 5-6 hours in an interview process. Compared to what I've seen before (full day interviews, freelance period etc) this seems fair to me. But like you comment proves, it might not be for everybody.
A take home assignment alone is likely more than a work day, if not multiples - I'll take a 7 hour onsite interview gauntlet over that any day.
The only side this process seems to be favorable for is the company interviewing.
To give a flip side, I just finished interviewing with over 10 companies in a rigorous search. Of those, two did take home tests, and ultimately I didn't have the time to complete either, especially since the requirements were written in a way where candidates were encouraged to dump a lot of time into them. My schedule was filled with many high stakes interviews, which was mentally exhausting. It simply is not in my interest to do a take home project, as it reduces the number of companies I can simultaneously interview at.
From experience we saw that for most people doing the coding test was only taking ~an evening which seems reasonable as it removes the need for more on site discussions.
I guess that it's true that the take home assignment is not optimal if you interviews with more than 10 companies and in this case we must be loosing some candidates.
They lie. Most candidates will tell you the coding exercise took less than it actually did because
1) they want to appear efficient
2) you said it would take only 3 hours so if they say it took 8 hours it would look like a failure
Source: me, last week, for another company.
Plus, programming is not just writing code, most technical interviewers will want to see the global design, unit tests, comments, etc... Which are not accounted for in the expected time
If this is the case, you are merely writing a piece of "our interview process is entirely average." I don't know about you, but 5-6 hours is standard. What's the point of breaking out your day-of process into 4 bullet points if its just the same as your normal engineering interview? Fluff.
Finally, realize that your candidates (just like your Eng org) spend much more time prepping for your interview than you quantify on paper. However much they choose to is up to them, but don't pretend you're doing them a favor.
I never said our interview process was perfect, I mainly wanted to share it because I saw a lot of people complaining about whiteboard coding and so on... so I figured it could be interesting to some to see a less exhausting alternative.
Just a small observation: whiteboard coding is not inherently wrong. Bad experiences with whiteboard coding usually come from dealing with companies with broken hiring process, where simply eliminating whiteboard coding and replacing it with something else wouldn't help.
Dear Blaine/Solve-employee, we know this is you. Be honest with us and just post on the OP account. Faking it is even worse than self-promotion spam. :)
Oakland, CA is a strong contender if you step back. Especially for one of a Stephensonian-styled second-wave cyberpunk. It even plays a part in Neuromancer.
It has its dark parts along with strong facets of common cyberpunk themes: drastic social stratification, the social acceptance of regular drug usage, urban decay meets technocratic renewal, a renewed definition of suburbia, and a greater acceptance of non-binary genders.
I like it, especially because there's a strong artist population there, and a common theme in cyberpunk is showing what the non-techies' life is like (to contrast it to the main characters' lives), and it's quite often artists.
> How do you do that without some basic understanding of computer science-y stuff?
> How do you define "scalable", how do you measure it? How can you have some intuition about a design before we spend 3 months and many sprints building it first?
> How do I know when to cache stuff? Does it matter if I have calls to a remote cache in a tight loop? Should I be using an in-process, out-of-process, or remote cache for a particular piece of data?
You're proving the above poster's exact point. You are putting your weight in applied questions that rest upon the developer's specific experience. This method is the opposite of evaluating people for their ability to memorize a half dozen algorithms and data structures.
In my experience interviewing candidates, asking people to implement a caching algorithm is a distraction to both parties. A much better evaluation is their ability to provide box-arrow diagram and talk it through. This is much more effective towards understanding their thought processes and knowledge. It is also much, much closer to the _real_ day to day of a today's engineer: communication, advocacy, and breadth of knowledge. Code is cheap. Business should screen employees for an interest.
CS textbook questions introduce enormous amounts of bias, especially in panel interviews. It is a dangerous trap that companies use to further entrench their team cliquiness and departmental monoculture. It is ripe for Simple Sabotage. Simply put, its lazy.