Hacker News new | past | comments | ask | show | jobs | submit | TikiTDO's comments login

Ooh, this is a nice link. Totally going to add some of this stuff to my practice.


It doesn't need to be two huge models. If there is an advantage to doing this, I'd expect that you would see it even in a small test case. I'm sure we'll see something by the end of the week if not earlier if there's something to it.


One of the most significant quantization papers of the last year [1] found precisely that these outliers only start occuring with LLMs at 6.7B parameters and above.

One of the most important keys to the success of deep learning in the last couple years has been the fact that emergent features exist after certain scales, so I wouldn't be too quick to dismiss things that don't help at smaller scales, nor would I be certain that all the tricks that help in small data/parameter regimes will necessarily help in larger models. Unfortunately!

[1] https://timdettmers.com/2022/08/17/llm-int8-and-emergent-fea...


Looking at that paper, they appear to be saying that 6.7B is where the problem becomes so intense that no single quantization method can keep up. From what I gather, the paper claims that such outliers start occur down to 125M param models, then at around 1.3B they begin to affect the FFN, and at around 6.7B is when the issue really starts to become apparent because "100% of layers use the same dimension for outliers."

So while you obviously wouldn't be able to conclusively prove the idea fixes the issue in larger models, if you know what you are looking for you should be able to validate that the method works in general down to very small models.

That said, consumer grade cards should be able to train an 8B model with quantization, so you might as well train the whole thing.


The reason it might need to be huge is because the long tail of extreme weights might only begin to show up then, but yes best to just start w something you can run on a laptop.


I think this problem comes down to two core issues: discoverability and terminology.

You're going to be lucky if a paper from the 70s or 80s is available in a searchable database at all. That means someone bothered to scan it in, and OCR it since then. Even for the few papers that are searchable, they are old enough that they probably won't catch anyone's eye unless they are desperate.

Of course then there's also the problem of knowing what to search for. Programmers love to invent, reinvent, and re-reinvent terminology. It's only gotten worse with every other developer running a blog trying to explain complex ideas in simple terms.

The entire field of ML is a perfect example of this. I remember talking to my father about all sorts of new developments in ML back in the early 2010s, and I was quite surprised when he told me that he learned a lot of the things I was talking about back in the 80s just named a bit differently.

In most cases it ends up being a question of how much time you can put into any given problem. If I spend two weeks to find a paper that would have taken me a week to reinvent, then am I really ahead? If the knowledge wasn't important to enough make it into textbooks/classes/common knowledge then attempting to find it is akin to searching for a particular needle in a pile of needles.


I have never come across a popular CS paper that was not available on the web, for what it’s worth. Maybe some of the lesser known papers are lost, but all of the important ones, such as Codd’s writing, are very easily accessible with simple search engine searches.


The important and popular ones are absolutely available, but those are usually important because they have entered the realm of "common knowledge," at least in a particular sub-field. These are going to be at the top of the list when it comes to digitizing useful historic records. It's fairly easy to OCR a PDF, so as long as someone with some time decided "hey, this might be useful" then you'll probably be able to find it.

If you're doing databases then you've almost certainly been exposed to Codd's work, if not through his papers and books, then at least through textbooks and lectures. There are countless blogs, lecture series, and presentations that will happily direct you there.

The challenge is that there's also a mountain of work that never really got much popularity for whatever reason. Say a paper was ahead of it's time, or was released with bad timing, or simply kept the most interesting parts until the end where few people might have noticed. It's these sort of gems that are hard to find. It's hard to even know how many of these there are, because they are by definition not popular enough for most people to know about them.


That really depends on quite a few other factors: how big is the team? What development methodology do they use? Does the leadership understand how to manage and direct a rewrite? Are there people that understand the full scale and scope of the system? Does the system interact with legacy components that can't be modified? Are there political factors in play? These are just a few of the questions that can change the outcome of any given rewrite.

You mentioned hidden bugs, but what about hidden "features" that may be a critical part of existing business processes for core parts of the company? Developers really like to believe they are at the center of the wheel due to the complex work they do, but a lot of the time they are not the ones that actually create the cash-flow.

I've been part of rewrites that have succeeded tremendously, but I've also been privy to utter failures that have cost millions, and led to entire teams getting sacked.


10 years isn't really all that much, is it? From my experience that's around how long it takes for developers to get a big head about how much they know, but 5 years less than what it takes to learn to respect how much they actually don't know about the different aspects of the field, and the real scale of challenges that have to be solved (both the technical, and the human).

Also, not all experience is equal. Someone that's spent 10 years working on 4 or 5 different systems in totally different problem domains, written in totally different languages, and operating in totally different ecosystems is going to have a very different view of development from someone that's spent 10 years doing essentially the same thing over and over again.

This guy seems to have a very focused view of the correct approach to problems. He's familiar with the tools that linux offers (which I agree are great), but he doesn't seem to respect the scale of specialization it takes to use and maintain those tools effectively on a large scale. Also, there is no mention of the cost to rebuild existing systems in terms of developer time, the mental cost to re-train all of the developers, as well as the time to migrate and train the users.

Ironically, I remember getting into debates like this back in the mid-2000s when I was first starting to think I had it all figured out. The points I made back then were more or less the same things I see now in the article above. It's quite nostalgic, though it definitely makes me feel older than I like.


Just looking around, general available figures for public internet (as opposed to tor) suggest that anywhere between 0.1% to 1.0% of users have JS disabled. These numbers have also been consistently going down over time. That's a fairly small number to dictate how a system should be designed.


That depends on your target demographic. JS is more frequently disabled among tech-literate customers, so a cloud provider's home page would probably benefit from working without JS.


> These numbers have also been consistently going down over time.

That trend might reverse if vulnerabilities like these continue to surface.


Right. It’s like designing for any other tiny group: color blind, blind, people who don’t read any of the 3 languages your site is already translated to, etc.

I’m not saying that shouldn’t be done, but business wise its probably usually best to instead add design changes for the latest smartphone screen.

The web isn’t a hypertext graph anymore, it’s a large JavaScript program with a thin html front now.


Was Zen 2017? Last I saw had it dropping in October of 2016. I actually decided to hold off on an upgrade for it.


I'm not actually sure. It might very well be Q4 2016.


> I'd also argue that if tension 1 were really a problem (i.e., Reddit staff were wrong), Reddit would be obviously going downhill, while tension 2 can fester as organizational debt for years before exploding, if everyone is well-intentioned.

I have found the issue to be not so much a matter of the two factors that you've outlined, but more a consistent downward trend of the admin staff, towards a stronger disconnect with the community. There has been less communication, and the communication that has happened has been less clear and less consistent. Even in this entire drama, reddit's response has come through a single point of contact.

For a site like reddit to work, the administrators really need to be able to also participate in the community at large. They need to have firm, definite rules and guidelines of what they will and will not do, and how they will or will not help. They need to make themselves available to the volunteer staff that help run these numerous communities.

This I think is the root cause of both of these tensions. The community simply doesn't know what to expect from the admins anymore.


I took a look at the website you linked at the bottom of your post.

The first thing I noticed is that most of the front-page articles are testimonials. Fortunately there was an article written by the author right at the start, however as soon as I started reading I noticed another problem. Almost every reference is to other articles on that same site, which themselves link deeper into the site. Occasionally I'd hit some articles from psychology today (psychology magazine, not a journal), and other popular media sources. I did not find any references to proper scholarly articles though.

A quick search of Google scholar turns up no articles to back up most of the major assertions he makes on here. Now that's certainly not enough to dismiss the site outright, but it's certainly enough to make me question what exactly he found, and how well he is interpreting the existing results. I have no trouble accepting the existence of porn addiction, but I do believe that making a case as strong as the one you seem to be making requires much stronger evidence than what you have presented.


> shouldn't it be a commonplace thing that doesn't take so much work to get around to explaining and using?

Why? It's a reality of the world that more complex things take more time and more effort to learn. However, often that is because these more complex things allow you to do a lot of very useful things much more efficiently. I would prefer to drive over a bridge built by an engineer who learned all those difficult equations, material properties, and buildings codes as opposed to a high school kid with a few physics courses under his belt.

Programming is similar in some effects. As you get better and better you acquire more and more tools to do what needs to be done. Now granted, if you are working on an interface that needs to be easily accessible to the widest range of people it makes sense to simplify. However, cleverness has it's place in code that is expected to be read by specialists.

In the end, even if you avoid all the clever tricks and shortcuts you know, a large enough project will still be utterly inaccessible to a novice. The real challenge of projects that complex becomes less about the specific detail of how a piece works, but more about how all the pieces work together. If you're skilled enough to follow the design of a project like that, I don't think it's too much to ask that you either know these "clever" techniques, or you should be willing to learn.

Looking at your code you linked in the article, I think part of the problem is the fact that there are entire pages of code without a single inline comment. When you're doing these clever things you really need to document every logical step in order to understand and verify your through process later on. You also have to be ready to accept that sometimes you will mess up in your cleverness. In fact, If you are getting a lot edge cases that's a good signal to go back, re-read your comments/design notes, and find where you could improve your approach.

Ironically, I would argue that go channels are actually an example of doing something "clever" the correct way. These channels are very effective at separating a single concept from a whole pile of abstractions, and doing a lot of clever interactions beneath the hood in order to ensure it's all effectively synchronized. In other words, using go channels is using the same type of "clever" techniques once they've been abstracted away.


" I would prefer to drive over a bridge built by an engineer who learned all those difficult equations, material properties, and buildings codes as opposed to a high school kid with a few physics courses under his belt."

I think this kind of analogy is misleading. Those things are more like the equivalent of understanding data structures and algorithms, performance estimation, being able to use a profiler effectively etc. Civil eng is very conservative in terms of the kinds of language and graphics that can be used to express a design. Anyone doing the equivalent of currying or macros (making up ones own language) would be thrown out. I would think its probable that when programming is as old as engineering its modes of expression will be similarly limited/standardised.


> Civil eng is very conservative in terms of the kinds of language and graphics that can be used to express a design.

I would argue that programming is far more specific in terms of the kind of language can be used too. In fact each such language tends to be described in exhaustive specs.

> Anyone doing the equivalent of currying or macros (making up ones own language) would be thrown out. I would think its probable that when programming is as old as engineering its modes of expression will be similarly limited/standardised.

Both things are very broad when it comes to what can be made using those languages. A civil engineer may use his language to build a house, a sky-rise, and a nuclear power plant. Each of those will have different complexities, and a different requirements of knowledge and qualifications. In fact I imagine the Engineer working on the latter will know how to do a lot of things that the Engineer who works on the former would consider to be akin in complexity to currying and macros.

The situation is the same in programming. Some people may be working on projects currying, macros, and other techniques are a major benefit. These are after all extremely powerful tools. Just like with the Civil Engineer, the challenge is knowing how to use them properly.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: