Hacker News new | past | comments | ask | show | jobs | submit login
A project with a single 11,000-line code file (austinhenley.com)
498 points by todsacerdoti on April 3, 2022 | hide | past | favorite | 329 comments



I remember many years ago coming across a reimplementation of the server side for a popular MMORPG of the time, reverse-engineered from the client (which was Flash) by what was likely a teenager --- it was over 100k lines in a single file, written in Visual Basic. Global variables everywhere, short names, and not even indentation. All the account data was stored in flat files, there was no actual DB. No "best practices" at all. Yet, not surprisingly, it worked pretty well and was actually not difficult to modify --- Ctrl+F would easily get you to the right place in the code.

I guess the moral is, never underestimate what determination and creativity can do, and always be skeptical when someone says there's only one best way to do something.


Ironically, this is a description of hacker news itself. https://github.com/shawwn/arc/blob/arc3.1/news.arc

(HN has indentation, though.)

It’s important to realize that this is good design. It’s hard to separate yourself from the time you live in, but the rewards are worthwhile.


This is a good point, but there's also a key difference.

There's a big difference between "code being in one file" and "code being in one function." It sounds like the OP had something reasonably close to "one function," whereas the HN code has a lot of (what appear to be) small well designed methods.


I'm not seeing the irony. The file you linked to is only ~2600 lines of code and seems to be well-modularized and fairly well commented. Assuming that's the whole site, I think that's pretty reasonable.


The only readability issue I have with that is the functions expected arguments. Add some types and I’d be very happy to work on it. I believe Facebook uses a single directory of files now as best practice? With the file names including namespaces. That was an HN comment from ages ago so could be wrong or misinterpreted.


pg's new lisp, Bel, has something close to typed arguments:

  (def add1 (x|int)
    (+ x 1))
http://www.paulgraham.com/bel.html

I've been implementing it for a couple years now, though not seriously till the past couple months. There are some interesting (and overlooked) ideas in Bel.

Bel is sort of the limit case of generality. For example, you might expect the "type" above to be a separate kind of object, the way that types are separate kinds of things in TypeScript.

But in fact, it's simply a function that receives the argument and can throw an error. So for example, you can do something like:

  (def positive (x)
    (if (< x 0) (err 'negative) x))

  (def sqrt (x|positive)
     ...)
I just wish he'd solved keyword arguments as thoroughly as every other kind of argument. There are hints that it was always in the back of his mind. Though it's true he never needed them, so that's probably why he never made them.


It's nice to have invariants - I believe I've seen them in Rackets contracts but I'm a little away from that stuff in my Lisp journey :)


That doesn't really allow for type checking which is a major purpose of types. It's more of just some nice sugar for runtime assertions


Well, you can interpret the predicate function body as a set of constraints on the type. Of course, such predicate function would need to be restricted to what your type system can handle. Typed Racket does this, by allowing you to implement type refinements[1]. As long as the predicate only uses operations listed there, it can be used for type checking. Idris also lets you write functions that operate on types and that are used for type checking.

[1] https://docs.racket-lang.org/ts-reference/Experimental_Featu...


> I believe Facebook uses a single directory of files now as best practice Do you have any sources for this?


2600lines is not 100k lines.


This is amazing. Is there a syntax highlighting and linter kind of things for the "Arc" language? If it's possible, I would like to try that. I love how Lisp looks like.


Indeed there is. I've been using this vim plugin for over a decade: https://www.vim.org/scripts/script.php?script_id=2720

Which IDE do you like? I'll see about getting some highlighting for it.

As for actually running arc, it’s hard to run the original arc3.1 due to racket updates. I’ve made a few branches over the years that try to preserve the original spirit of arc (no significant changes) while making it easy to run. Try this one:

https://github.com/tensorfork/tlarc

I believe you can simply install racket, then run make && bin/arc and be dropped into a repl. From there you can follow the arc tutorial, whose link I’ll dig up after I’m finished driving home.

EDIT: That fork is actually a lot different from arc3.1 proper. I’ll try to locate a more faithful one.

EDIT 2: Unfortunately it's quite a lot of work to remove the mzscheme dependency from the old arc3.1 codebase. And I'm not sure it's even possible to install the mzscheme lib on the latest racket (e.g. `brew install racket` doesn't seem to have it).

So the above instructions are the best I can do for now.


I use Visual studio code, I couldn't find it in Extensions. I'm just asking out of curiosity, What do you develop with Arc usually?

I like how the "html" and "css" part was embedded in that "news.arc" file. Do you think that VIM script will highlight and lint the "css" part of an "arc" file?


> What do you develop with Arc usually?

I try to use Arc for as much as possible. We wrote our TPU monitoring software in it: http://tensorfork.com/tpus

Eventually I became frustrated with Racket's FFI. So I eventually made my own arclike language called elflang: https://github.com/elflang/elf

... which itself is a fork of Lumen (https://github.com/sctb/lumen) by Scott Bell.

The performance is good enough to run a minecraft-style game engine: https://i.imgur.com/iyr0YrB.png which was satisfying.

Nowadays I've been trying to implement Bel, mostly for the challenge of it than for any practical reason.

> I like how the "html" and "css" part was embedded in that "news.arc" file. Do you think that VIM script will highlight and lint the "css" part of an "arc" file?

Nope. https://i.imgur.com/o9aUG6j.png

But it has one very important feature: it can properly highlight atstrings: https://i.imgur.com/wO4f742.png

It's probably hard to tell, but the "@(hexrep border-color*)" would normally be highlighted as if it were a string. Arc has a feature called atstrings, where you can use @foo to reference the enclosing variable "foo". It can also call functions, e.g. "The value of 1 plus 2 is @(+ 1 2)" will become "The value of 1 plus 2 is 3".


Looks totally fine to me. I'd say it is written very well.


Wait, hn is written in Lisp?


Funnily enough, I have recently had great success by reversing the "best practices" on a distributed "micro services" architecture application into a single big Java file.

Best practices were the usual suspects DRY, IOC, SQL + NoSQL, separation of concerns, config files over code, composition over inheritance, unexplainable overlapping annotations, dozens of oversimplified components doing their own thing, and some $something_someone_read_on_a_medium_post

The Single Java File was around 500 lines no db, lots of globals, a dozen or so classes and some interfaces, Threads for simulating event based concurrency, generous use of Java queues and stacks but i specifically made it static with Zero dynamic hashmaps.

It actually runs in my IDE, I can understand what the hell the product is supposed to do what component is doing more than it should and more valuable was to predict what could break if I change that value in the helm chart from 5.0 to 5.1.

It is quite useful and pleasing, I can actually reason about things and I have new found use and appreciation for Type Systems and compile errors. And I can write tests that run in under 3seconds.


Having the whole project actually in one project is critical. I think some of these “best practices” are actually very useful when applied with caution. But you sometimes need to break the rules. Everything should be optimized for developer convenience. Convenience in deployment. Convenience in debugging. Convenience in refactoring. Only do what HN and FAANG says is “right” when you need to.


This application must be really really simple ;), so no database?


You'd be surprised the garbage overhead brings into an app architecture...

These best practices really only make sense in large organizations, i.e. Conways law.

After all, you can't really ask 100 developers to all add code to one file in a couple weeks - they will spend a month or so just resolving conflicts...

100 file repos are designed so that 100 developers can edit them (in theory), and have relatively few conflicts, not because its better code.

As another anecdote, I find whenever I do solo code I can easily spin out thousands of lines of code within a week (includes testing), when doing code on a project with many other devs my rate drops into the range of maybe 200 a week, just because so much time is spent interlinking other code, finding fixing bugs and tests strewn across many files...


No and no.


Presumably VB.NET? Because the VB6 IDE wouldn’t let you write more than 65534 lines [1]. Don’t ask how I discovered this.

[1] https://docs.microsoft.com/en-us/previous-versions/visualstu...


>Don’t ask how I discovered this.

Like most of the world at one point, it ran on Excel 2003

/s


I too learned this lesson the hard way.


Came here looking for this. <3


It always seemed surprising to me how some of the big Oblivion and Skyrim mods would get by with fairly few bugs despite there being no way to have automated tests and some of them having 10k lines of scripting (or much more in some cases) spread around dozens or hundreds of quests (quests in the CE engine are not just the quests you as a player see, but also a huge number of invisible quests because quest state machines and associated scripting is how scripting works).


I think it makes sense. Modders get a kind of obsessive testing from their communities (including themselves) that devs in most commercial contexts couldn’t dream of. Skyrim mods are, if anything, correcting for bugs in the game.


For a single modder, it may have to do with being a single developer, working on it over a long period of time and being a passion project.

For a while, I was the only developer working in a small module of a bigger project: I started the code base, discussed requirements with clients, implemented the needed features, tested the whole product end to end. I developed a very good instinct about it and about what any change would do, much better than any other project. My theory is that the code base matched my way of thinking, so thinking about it was pretty easy.


Fewer (in charge) developers makes for a more stable product - even open source products fair better when they have a long term maintainer versus ownership changes frequently.

The other thing is style - when used well, state machines don't require testing. There's nothing to test - either your machine works or it doesn't, there is no point testing state transitions because that is the fundamental job of the state machine. You may as well test that addition works.

Ofc, they must be used well - problems may be difficult or overly complex to model as a state machine or even set of state machines - the pattern excels for small problems, less for large ones.


This description has been life the last 4 years.


What are you working on?


I have observed that bugs tend to be introduced when you have more than one person working on a project. Single person projects have very low bugs especially if the coder is experienced and follows simplicity + structured programming.


You can get away with a lot on a single developer project and best practices aren’t in place solely to make code functional.

That application would likely fall apart if multiple developers of with diverse backgrounds had to maintain it and add new features.


I don't know what you mean exactly by "diverse backgrounds" and it doesn't matter in this case either, because there were definitely multiple people working on it (although the initial version was the work of one.) They effectively used a forum thread as source control, and just attached their modified versions to the posts.


To be fair even with using best practices, code can still fall apart with multiple developers from diverse backgrounds.


The conclusion, of course, being that if you need multiple developers the code is less likely to fall apart if they have similar backgrounds ;)



Came to say this. I actually find QuickJS pretty easy to understand and modify, in part because everything is one file and it's easy to search, using vim's `#` and `*` keys, for example.

When working on projects by myself, I like putting everything in one big file too. Trying to find the "right" place for something is some unnecessary overhead, not to mention the navigation cost. It's a different story when a team is involved though.


Circa 2006, I was in college, and I got hired to write a webapp for a college department. I didn't know JavaScript could have classes and capture variables, so I made the app with entirely global variables and plain functions combined with `eval`. It's over 2000 lines, and nobody after me could understand it.


I suspect the game was Runescape. My brother used to be a fan of these custom servers.


A likely guess, although RuneScape isn't Flash based (Java & RuneScript).


Oh right.


I think the game is Dofus, a French MMO ;)


Ah, that takes me back. Commodore 64 freeware games written in Basic.

Ya, you could just go in there and mess with the code all over the place.


me when i look back at the code i pumped at as a 15yr old writing a java servlet web app that could admin a quake1 tf server game


Back in the days at Zynga, there was this ritual that new members of the STG (Shared Tech Group, which developed the game engine stack) had to try to refactor the road logic code.

Suffice it to say, it's a 28k LOC file that was so bad, it could even hold up in court as evidence that a South American company stole the code of Zynga's -ville games. We could reproduce each and every single bug and its effects 1:1 in their games, with all the crashing scenarios that were easy to reproduce, hard to debug, and almost impossible to fix.

Once you dig into the hole of depth sorting and being smart by "just slicing" everything into squared ground tiles on the fly, there's no way out of that spaghetti code anymore.

Fun times, was always a joy seeing people give up to a single code file. The first step to enlightenment was always resignation :)


To be fair, the first step towards refactoring is understanding the existing code -- ideally, knowing everywhere it is used, all of its behaviors, and importantly, its history, so that you don't break anything, and so that you don't reintroduce bugs that have already been fixed over the years. Or, in lieu of all that, a robust automated test suite.

This cannot be done with a file containing 28k lines of code. That is an insurmountable task. They may as well have been asked to start from scratch and build a new engine.

I'm curious what the purpose of this ritual was. Was it just hazing, or was the thought that someone might actually be able to accomplish this?


It is possible to write tests in a single-file program. You could for example have a -test flag that runs all the tests when the program starts. It's never too late to introduce tests - the next bug you fix, start by writing a test that detects the bug, then fix the bug - and confirm that the bug is fixed by running the test. Then you never have to worry about a bug you already fixed will show up again. And the tests will build up to a decent test suit over time. The next step is to also write a new test for each new change or feature. Keep doing this and your program will soon have full test coverage. The trick is to not test small units - instead write the tests as if a user was using the program, so that the actual tests covers a lot of code all over the place, not just the code you added. Mock key presses, and button presses. Then when a user reports a bug, you can write the test so that it repeats the user actions.

What makes testing hard is a lot of side effects, like the program writing to databases or calling external API's. Not the LOC of the source file. You might want to mock an automatic call to the fire department for a fire control program, but for API calls and databases, just have the test write to the prod environment, but include rollback/cleanup in the test. That way you don't need a separate testing environment.


> but for API calls and databases, just have the test write to the prod environment, but include rollback/cleanup in the test

I disagree with that. Automated testing should be done on a test database.

Holding a lock for too long could effectively block the entire production. This could happen while debugging through a test (e.g. by hitting a breakpoint or by just stepping line-by-line). Or by simply having a bug in the test causing a "transaction leak" - depending on the tech stack, this could keep the transaction and the associated locks alive until all tests have finished running, not just the one which had the bug.

Or you could commit instead of rollback by mistake.

Or you could simply put unexpected strain on the database, affecting the performance of real users.


It can be a lot of work creating an isolated dev/test environment if you already have a large app that communicate with a lot of services. But if you can that is preferable as the test will create strange artifacts - but those systems is probably someone elses job to make sure they dont get corrupted ;)


I would expect people new to the code base would look at this gigantuan file and ask why it hasn’t been refactored, and this probably was seen as an opportunity to familiarise new hires with the code base as well as get them on the same level as to the effort involved in a refactor.

I agree with you, a rewrite is probably how they should have tackled this one.


> Was it just hazing, or was the thought that someone might actually be able to accomplish this?

Actually no. Most devs that just got started familarizing themselves with the codebase wanted to refactor the file and came up with the idea themselves. Usually they thought this is a crappy file and this must be an easy task to do because they saw all the nested if/elseif/else statements in the code.

The problem, architecture-wise, was that the road logic was the glue code that integrated a lot of different parts, layers, and NPC behaviours from the rest of the codebase as it was changing the surrounding game world.

If there was a hospital placed with a non-squared ground tile next to it, if it was placed with a 1 offset (roads were 3x3 tiles), if it was placed with a 2 offset next to another road... It went as far as influencing the path heatmap that was necessary for the A* guessing algorithm to make the NPCs walk correctly on the sidewalks. The permutations of possible sidewalks alone were enough complexity on their own...

So in a lot of ways necessary features that historically had no place in the Entity/Component based engine at some point made it in there.

The next best thing (and also spaghetti code) was the Cursor Entity, which had to have line tracing algorithms to be able to select things that are visible under a donut-like shape when the user was hovering the hole, or say, a tree in the game world. Convex and Concave shapes were integrated, and lots of edge cases in there, too, which are actually huge mathematical problems in terms of available performance once you dig more into it, so we ended up with binary height map sprites that helped both the slicing and the cursor at some point.

The important lessons learned from the road logic were very valueable for newbies, as it was teaching the practical problems of isometric game worlds.

So afterwards everyone was able to grasp why the complexity was added, and what was necessary to remove it (in the sprints in the future).

At some point we decided a couple of things because of the road logic and cursor entity for new iterations of the engine, like:

- always use a 1x1 road tile

- always use square based tiles for all objects

- dont make sidewalks, use just road tiles

- dont make trees with holes in their leaves

- dont make trees higher than the buildings

- no artist can ever request crosswalks. Never ever.

...etc


A 28Kloc file can be modular.


It's impossible to refactor spaghetti code without a comprehensive test suite. But you can do it with a test suite - I've done it with large code bases.


You sometimes can. Maybe for any legacy code base someone could, but I have tried and failed on more than one occasion. Some people’s thought process is just perversely different to mine and I keep feeling, oh, this is the layer where that happens, but no every time I have an aha moment I am disillusioned.


If you don't have a test suite, you can't know if you're making progress or making things worse.

Learned repeatedly from painful experience.


To develop a comprehensive test suite can sometimes be hard, especially for code that deals with say concurrency, multi threaded code , locks , 2d/3d physics , video , analog , hardware related , procedurally generated or ML (meta-language ) and the other ML (machine learning) etc.

A lot of edge cases and race conditions would easily slip through, also a different set of edge cases or race conditions you never considered and therefore never tested for in your first version could pop up in your rewrite.


Of course. But dealing with that is why we get paid the big bucks.

I've dealt with concurrency issues. grep is a handy tool to find related synchronization code, then I try to replace it with an encapsulation. In general, I look for things I can replace with algorithms, and things I can encapsulate. And so on.


It might sometimes be hard but I have never seen a case where it was impossible (25+ years of experience dealing with undocumented legacy code more often than not).


If work with not easily deterministic code typically but not always ML models like say speech-to-text or Face-Recognition or classification/ recommendation systems or network performance dependent applications like video conferencing that wouldn't be that feasible .

Almost nothing is impossible to test yes, however to know and be able to mock the data for each test case can be extremely hard and at some point not worth the effort to even attempt.

Most I have seen these kind of systems doing is statistical testing with reference benchmark/ sample data, and maybe monitor real world feedback either telemetry or user complaints.


> If work with not easily deterministic code typically but not always ML models like say speech-to-text or Face-Recognition or classification/ recommendation systems or network performance dependent applications like video conferencing that wouldn't be that feasible

Nah, most ML systems (actually doing something in the world) are mostly just ordinary code, which can be tested like any other code (as you put it into functions, etc). The models themselves are pretty awkward, but you can normally freeze the model and just use that to ensure that things stay working while you refactor, and then re-run a few (10+) times to check coverage and intervals and stuff.

It tends to be more difficult, as many DS/data people are not software engineering focused, but it's not impossible.


I haven't heard of any method to test the model apart from statistical analysis of reference /training data.

The model is what gets continually updated and is the critical path that needs coverae, Testing interfaces are trivial and at times not critical to test if already running in production for a while (you probably have already caught most/all issues and know what to test or take care of in a interface rewrite).

It is not about impossible, here is an example, let's say you are working on English speech-to-text model, the next version works better in your set of benchmarks.

It could for example perform very poorly (compared to your previous model) for accented English or mixed with other languages, for older people or in noisy environments like a car, or for for specific subjects like medical/legal dictation and so on and since your benchmarks originally didn't cover these types of scenarios you wouldn't know one way or another.

These were real cases all added to speech-to-text models after user feedback and adequate demand being identified and research effort put in, and now training/benchmark data includes these. There are plenty of scenarios not yet solved (mixing two languages is active area of research) or not included because user feedback didn't capture it, of not yet worth solving.

Neural network testing is hard because by design they have millions(and these days billions) of parameters as inputs and you cannot feasibly test every possible outcome, you will not know what all things to check until people start using your app in ways you never thought off.

NN /ML is not hard requirement this is true for any complex systems. Shazam type fingerprinting for example is just spectrography and Fourier transforms, NN is just newest tool devs use. All complex systems with thousands and above parameters have same problems


Which means it's often difficult. The traits "Has a comprehensive test suite" and "is spaghetti code" are often rarely seen together. A poor code base has become poor because it's not refactored and cleaned up all the time - and that's often because their is no tests to help with that.

And if there is no test suite, there is often very few ways to add a test suite. Poor code has very few points where you can attach a test. If the code contains file databases or structured input of some sort (a web page) you can add some very high level end to end tests. But not all code has easily verifiable endpoints like that. Perhaps my bad experiences comes from "hard to test" domains (Sound, drawing, ...) code, and not "given this input this is written to the database and this is written to screen".


My experience as well. Also the main recommendation from the most excellent “Working with legacy code” book.


Man, now I want to test myself against your road logic code. Sounds like a worthy challenge.

Always tricky though when the hacks have both undefined features AND bugs.


If you're looking for obfuscated code to refactor, nroff[0][1] is pretty notorious.

[0] http://dtrace.org/blogs/eschrock/2004/07/01/real-life-obfusc...

[1] https://www.youtube.com/watch?v=l6XQUciI-Sc&t=1h28m35s


This could be like that "endless civ2 game save" where OP thought he was in a permanent stalemate but random internet civ2 veterans found it pretty easy to win.


So afraid to write bad, spaghetti code, I ended up writing no code at all.

This thread made me realize that it's better to have a working profitable project with bad code, than a perfect unfinished project, with meticulously chosen design patterns.

Afraid of being judged for bad code, I could not start until I had the right architecture.

I'm glad I read this.

This is developers therapy.


I've sadly come to realize (after witnessing on many projects) that there's a pattern that goes like this:

* Team A writes code quickly. Not bad code, really, but they take shortcuts everywhere they can. They don't have the strongest tests, they don't generalize for all the known use cases, etc. Their code goes to beta and gets users and makes progress.

* Team B deliberates and deliberates. They try to avoid taking shortcuts. But in the end, even their code doesn't have the strongest tests, doesn't generalize for all the known use cases, etc. Team B never gets users or gains momentum, and their code+architecture was probably no better than Team A -- they just took 3x the time to get there.


+1

I had a lot of trouble trying to explain this to juniors.

The most important things is to have code that is easy to refactor when you know what you're doing (i.e. everything is working properly). Juniors I worked with had a nasty definition of a pretty code being split into a hundred files, each no longer than a screen, and each function no longer than 5 lines. The onboarding of new devs to such code was way worse than into a code that would be 10k lines in one file, but with a flat structure and less interdependency.


"Flat is better than nested" - The Zen of Python

I had an "everything should be broken into a hierarchy!" stage back when I was learning to code, and boy was I off track. In my defense, at the time (and this dates me) OOP was all the rage.


I find OOP spaghetti can be incredibly difficult to navigate, example: Class hierarchies 4 or 5 layers deep, some subclasses overriding the parent, others not. It can be it very difficult to follow what's actually going on.

Procedural spaghetti is more manageable, though I once had to update a C app with a 3000 line case statement. Pure madness.


I remember working on a project with endless exception class hierarchies - that was a monumental pain when trying to diagnose what led to an exception.

In a subsequent project we banned subclassing of exceptions.


Inheritance hell would be more lasagna code.

Golang does a good job of addressing this one particular problem.


This depends a lot on the development environment. In Smalltalk years ago, we were encouraged to limit each method to few lines at the most. This makes sense within Smalltalk and the code browsing inherent in the system.

It makes debugging in Java or C# an exercise in face shredding frustration. Where each class is in a file, it's better to structure the class consistently with other classes in the project, and things like naming conventions become a lot more important.

You could argue that the C/C++/Java/C# languages are fundamentally broken because they don't encourage the succinct, small class methods that Smalltalk did, but you could also argue that those small methods don't necessarily work very well in a different class of languages from Smalltalk and neither approach is really more productive than the other, with the caveat that Smalltalk is largely dead and irrelevent to modern programmers other than a curio.


> The most important things is to have code that is easy to refactor

Very true.

I'll just add that another most important thing is to actually take time to refactor, even when things are busy.

I spend maybe 1/3 of my time refactoring, and that feels good.


I like to refactor once I have passing tests and before committing.

Once you know it works take a minute to clean up and make the changes fit your preferred style, extract repeated code into shared methods, comment the tricky bits, etc.


This Is The Way


> The most important things is to have code that is easy to refactor

This whole post is about how refactoring doesn't matter because your project's development lifetime isn't long and wide enough for maintentance to matter.


Less interdepency is absolutely the key to everything.

But... isn't the easiest way to show that there is little interdepency to put them in separate files that don't import from each other?


People misjudge where to draw the lines. You will have an orchestration API call that does five things and each of those five things, not used anywhere else, will get its own class, interface, factory, and configuration, so to read thru the five things you have to open like twenty files. And to notice that despite all this engineering they have static credentials in the code itself, you have to be alert across so many lines of code. The whole thing can be one longer file that reads coherently and in fact lessens the cross class importing.


Be Team C.

Team C works like team A. However every time a feature ships, someone who knows that feature well immediately refactors the relevant code to remove the prototype scaffolding. When code becomes static, an expert adds good quality comments. When a bug is found, it is recreated in a regression test prior to being fixed for good.


The sad reality of tech companies is that there is little incentive or bonuses for improving the situation. You won't get a bonus for cleaning up the code or for rewriting rotten code.

Hack at the code for 4 years, collect your options and leave the mess to someone else.

To be honest, a hacky codebase written fast is not the worst codebase to deal with. The worst type is when someone had the time to overarchitect and overengineer things.

Following references across 200 different files, tracing calls through hundreds of microservices. Graphql servers with complex resolver logic.


> The worst type is when someone had the time to overarchitect and overengineer things.

I concur. Refactoring should be as much about removing unneeded abstraction and features as it is about adding same.

> there is little incentive or bonuses for improving the situation.

Yeah, I just can't seem to believe this in my soul. I just want to fix ugly code and can't stop myself. I get huge satisfaction from speeding up, tidying up or fixing up bad code.

It doesn't help when management wants to minimize time spent on such tidy-up, especially when it's hurting our productivity to maintain it without fixing it.


Just do it. Don’t ask for permission. You will end up more productive not less.


This - I've been doing this for years now and simpler code (both simpler logic and the removal of unused code) makes working on legacy codebases feel completely different.


Yep exactly my experience. That combined with solid tests => I spend almost 100% of my time adding new features instead of staring at code and fixing bugs.


I try to sneak in refactoring with other tasks.


I find the best method is:

1. Figure out what needs doing and what code you need to use

2. Refactor the code you're using until the new code or change is easy

3. Make the change

4. Tidy and document.

Repeat

I often also document during step 1, while I am trying to understand code and realise that comments are missing.


4 years is a long time to suffer through inscrutable code.


I don’t think this dichotomy is helpful. I’m presently working at a startup that’s trying to dig itself out of a hole created by the first CTO, who in doing things “quickly” created an MVP so buggy, inefficient, crash-prone, and unmaintainable that we can’t retain customers or engineers. As always, there’s a balance to be struck, and ways to operate quickly that don’t sacrifice quality too much.


I'm kind of surprised that you can't find engineers interested in creating a new implementation of an existing application that is actually used by people. I think that might be my dream role.


At the risk of crushing your dream role, re-implementations can be long slogs. I'm in the middle of one right now. The Product owners don't know what the thing does. The engineers who originally wrote it are gone, and their replacements are relatively new to the codebase. The bright side is customers are hugely interested in our progress to date and we've received positive feedback. The business wishes we could move faster.


> The Product owners don't know what the thing does.

This is the dumbest part. You’d think that someone documented something when they originally built it, but nooo. Don’t even know the requirements, just that it has to be the same as the previous one.


Business doesn’t want us to stop the bus to change the tires - or spend so much time changing tires that the bus never reaches the destination.


Makes me think of Lightning McQueen losing the first race in Cars because he refuses to pit to change his tires, then blows them out on the last lap.


That type of role typically requires senior engineers at a junior salary


If your dream job could be in Charleston SC lets chat :)


I think something that is lost in this conversation is that "quickly" maps to wildly different results in code quality depending on the programmer.

It sounds like your CTO did not just operate quickly, but also sloppy and chaotic. From what I am gathering from this thread, the best practice is to move quickly AND organized such that refactoring is reasonable.


There's a YouTube video about beginner musicians vs intermediate vs advanced.

The beginner uses simple chords

The intermediate uses advanced chords, crazy fills and runs and riffs.

The advanced uses simple chords


"First there is a mountain, then there is no mountain, then there is."


That's not far from a similar saying in software: expert developers write code that looks like a beginner's, but simpler and with fewer bugs.


Beginner devs write code that is over-simplified, and needs hack on top of hack to do anything useful.

Intermediate devs write code that is over-engineered in places where it could be simple, while still needing a lot of extensive refactoring in the parts that deal with irreducible complexity.

Expert devs think deeply about the problem at hand, understand where the complexity is, and create a solution that is as simple as possible, but no simpler. After the fact, beginner and intermediate devs tend to think of this code as something trivial, that anyone could do.

This is why it may sometimes be tactically useful for the expert dev to sometimes take on projects that some beginner or intermediate team has been struggling with for some extended period of time, analyze it properly, and show how elegant the difficult parts can be solved.

Care is needed, though, as it can affect the morale of the other devs and even cause hostility. If the expert dev has a secure position in the organization and want to keep those other devs around, keeping a low profile as well as letting those devs take as much as possible of the credit is adviced.


"It took me a lifetime to paint like a child." -- Picasso


I kinda agree on this, but not for the implicit reason you're probably thinking of.

Just starting and doing it is just unreasonably effective because very few projects actually need novel solutions - most are just fine with off-the-shelf hacked together solutions.

Thinkers are required if the software is actually groundbreaking new work. Almost everyone's work on this forum probably isn't that however (mine included), which is why I agree with your sentiment


Plus.. When someone from Team B initiate the code, with hundreds of files & massive boilerplate, can't continue the work for a reason (e.g: sick / resign), this new dev who take the position will require super long time just to understand the code. Some of them will eventually writing their own, instead of following existing code standards, leading to have multiple standards at 1 project and resulting hell if it runs into trouble..

Also when someone from Team A initiate the code and Team B takes over, there were few times that Team B feels like the code is damn no good and just massively refactor it with what they think is good (re: the boilerplate) without others' concern. Then when Team B leaves, it goes to 1st paragraph..

I think as a team we need to consider the learning curve of our own code cz we don't code for ourselves.. And it's good to know the tolerance & acceptance to 'structure' of the code from other people..


Something this reminds me of that I've been doing lately when stuck on a particular problem is just coding something. Even if it's the most shittiest, inefficient and naïve solution. More often than not I either discover a more proper solution along the way or just realize my shitty solution actually wasn't all that bad to begin with.


Start with the simplest idea that might work.



Tests, ha


The Player controller from the game Celeste is a single 5600 line file that includes things like systems only used in the tutorial. I honestly don't think it's as bad as some of the criticism it got when the code was released makes it seem, but it certainly could be better looking code.

But ultimately, Maddy Thorson isn't selling a block of code. They're selling a game and it has extremely satisfying control of the character. And that's all that really matters for a player controller.

Maybe better organization and design patterns could have made it faster to develop? But I don't believe it would.

But also the type of product does matter for this. Celeste had 2 programmers so a lot of the things necessary for a team of 100 devs would just be harmful. If you're making a library/framework to be used and modded by others, architecture matters a lot more. If you're designing an enterprise application that you know will need business logic customizations for 25 business customers it matters more. It's all about knowing the scope of your project. But also until you start getting that many customers, maybe the unsustainable method is what will allow you to reach those first few sales more quickly to be able to stay in business long enough to be bit by the technical debt.


I remember sharing that Player.cs code from Celeste in the gamedev subreddit, and getting all kinds of weird novice comments about how the code doesn't adhere to 'OOP Principles' or 'there isn't any unit tests' or 'you should split it into multiple files with 100 lines each' or 'you should use an ECS to make a real game'.

Laster on Noel Berry did give a response explaining the various design choices behind their code:

https://github.com/NoelFB/Celeste/tree/master/Source/Player

Anyways, kudos to the team sharing their code even if it's a bit messy.


I hadn't seen the new Readme update (even though I Googled the repo again to link the cs file lol). Thanks!


I am part of a small team that maintains a legacy point of sale system that is still used by thousands of stores around the world. It started life as a DOS application written in C with some ASM bits, and has since accumulated some C++ and C#. There are functions over 5000 lines long. Files over 50000. Global all over the place. It can be a challenge sometimes, but after almost 30 years, it still brings in millions of dollars a year in maintenance and enhancements, and still processes millions of transactions for those retailers.


The world runs on software like that. Hard to maintain “crappy” code that makes $ beats clean pure “perfect” code that makes zero $ every time.


Getting something working and out there is 90% of the battle, especially on small or single person teams. I wrote a saas php app with vanilla html and JS that ran without issue for 8.5 years for a fortune 1000 company. About twice a year I would return to it to add or modify a feature and I had no idea how a lot of it worked and even had duplicated or redundant files that I was too afraid to delete. It worked though and I got paid every month for a very long time. Sometimes delivering a product is all it takes and getting trapped in delivering 'clean' code is just a blocker. Not often, but sometimes :)


I want to hear more from you.


Been there. I worked in a company where we had a codebase like the one mentioned in the article and over the years we started developing microservices with 100% code coverage.

The new shiny services took much longer to identify bugs and add new features due to the complexity of the design and endless interfaces.


+100

I'm so afraid of creating programs in languages that don't enforce a structure at all, even though I know how to write everything from scratch make it work.

If it's some framework, then it'll already be structured somewhat.

In the rare event that I do create something with no frameworks, I ensure that there aren't much global variables.


You sound like you should read this: [removed]

apparently jwz decided not to be linked from here :/

there's an archive.org link below.


https://www.dreamsongs.com/RiseOfWorseIsBetter.html

For future reference, a non-archive, non-jwz.org link. Straight from the source as that's the author's own site.


Heads up: jwz.org redirects Hacker News visitors (via "Referer") to an image of a slightly-hairy testicle sitting inside an egg cup. So maybe don't click the link at work.



Ok, So I wanted to see it. Quick hack of the link tag in sibling comment, and this is what you see:

NOT SFW! https://cdn.jwz.org/images/2016/hn.png

It's funny. Why does he hate us so?


That’s friggin’ hilarious. What a boss.


Uhh you should know that domain redirects traffic from hacker news to an offensive image.


Put it like this: www.jwz.org/doc/worse-is-better.html

Then just copy & paste it.


from the last time I saw that link on HN, opening it in a private browsing window avoid the redirection.


happy medium: write shitty code with strong API boundaries. at least the damage is localized and every so often you can go back and clean up or replace components independently. or more likely, just leave it that way and make more poorly implemented features that make money.


let me recommend one of my favourite programming blog posts: https://prog21.dadgum.com/21.html


> it's better to have a working profitable project with bad code, than a perfect unfinished project, with meticulously chosen design patterns.

A lot of businesses were built on PHP this way


Restaurant industry version:

    I was so afraid to cook in a dirty kitchen, I ended up not cooking at all.

    This thread made me realize that's better to sell food prepared on dirty surfaces with unrefrigerated ingredientes half-eaten by rodents and roaches that makes people sick, than fresh food prepared on clean surfaces with clean utensils.

    I'm glad I read this.

    This is a restaurant worker story.
Construction industry version:

    I was so afraid of not using the right construction materials and not building code-compliant structures, I ended up not building at all.

    This thread made me realize that's better to sell houses with structural problems and low quality materials that will be unsafe to live in, than houses built according to code.

    I'm glad I read this.

    This is a builder story.
In any other industry, a person would go to jail for saying that. You won't, because luckily for you, software development is not a regulated activity, and people with your mindset can make a happy living outside of jail. But hopefully one day some types of neglect in software development become illegal.

"Better is the enemy of the worse" is no excuse to have spaghetti code, or 50,000 lines of code files. It means that good is sometimes more convenient than perfect. Spaghetti code is not good to begin with.


Your analogy is flawed. The CPU doesn't care at all about how or what code looks like, all the nice comments explaining what it does, nice naming conventions, whether its easy to understand or not. They have zero impact on the final compiled code. The executable ends up as a spaghetti of machine instructions with countless gotos in a single large file.

Using bad ingredients in food, or poor quality materials in construction has tremendous impact on the final product.

>"Better is the enemy of the worse" is no excuse to have spaghetti code, or 50,000 lines of code files. It means that good is sometimes more convenient than perfect. Spaghetti code is not good to begin with.

Just calling code good or bad doesn't mean much - ultimately, results matter - if your code doesn't have tons of bugs, if your team can add features without any problems, if you can ship reasonably on time, if your product delivers value to the end user, etc - then you have succeeded. It doesn't matter what outsiders think about the code or what labels people give. Its best to ignore them and continue doing good work.


[flagged]


Why do you think you can pressure me into accepting your opinions? I don't want to argue with you so how about this - you do it your way and I'll do it my way. But, thanks for your concern.


If you're writing code that controls radiation therapy machines or trains or autoclaves or something, yah, maybe there should be some regulation and potential jail time for negligence. But 95% of all software written (especially if it's the first version of new software) isn't life or death. If the next social media startup or saas-for-painters is spaghetti code it's not going to hurt anyone.

Failure is cheap in our industry. That's largely a very good thing.


No, that is not the restaurant industry version or the construction industry version. You can't possibly compare "good practices" in software with building codes and food safety regulations. Building codes and food safety regulations are based on facts, science, and decades of experiences. Good practices are rooted in ideology, hype, and not based on facts. There are no studies showing that "having tests helps". There are no studies showing that "splitting code into multiple files means that you'll have a better end product". And experienced people (like in this thread) even say that there is often not that much correlation between how code "looks" and how well it works.


> Good practices are rooted in ideology, hype, and not based on facts.

That is a really sad statement that is predicated on the assumption that nothing can be objectively compared and therefore nothing can be ranked, which is also a way to kill arguments that lead to innovation and iterative improvement.

You can measure the complexity of an algorithm, you can measure the cyclomatic complexity of a function, you can measure code in terms of length, you can count references to external functions or modules, etc.

There are many ways in which you can compare code and make decisions about what style is more convenient for your team.

What is clearer for you to understand?

a) 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1

b) 12

If your argument was true, they would both be the same. We both know that a) is a waste of time.


Your example is oversimplifying the problem, I don't think you're proving anything here. I've used tools that calculate cyclomatic complexity, and sometimes they're a good indicator, and sometimes they aren't at all. You can measure the complexity of an algorithm, sure, but unless it comes with a real life benchmark, it's only part of the answer. Same with the length of the code, you can measure the length but it doesn't tell you much about how hard to follow is the code.

If you take any style guide, book about good practice, or stuff like that, you'll find that there are some good ideas, and there are some bad ideas. Even with something as simple as code formatting, we still don't know as an industry if it's better to format everything the same, or use formatting to convey information. The debates about OO, FP, static/dynamic typing are endless and there's no evidence about which is better. There is a rough idea that you need more organization when more people work on code (static types, microservices, more documentation, more isolated parts), but even that isn't really clear.

My assumption is not that "nothing can be objectively compared and therefore nothing can be ranked". It's not even an assumption, it's what I've seen in this industry: we lack data to give objective good practice that go beyond anything trivial ("try to make code easy to read and simple", "give meaningful names to your variables"). There is a wide gap between "common sense" and "cargo cults", which is the gap between easy and complex topics. I would really like if in the last 50 years we had learn a lot about how to build software as an industry, as people. The reality is that we haven't.


We cannot objectively determine what is a good sandwich length therefore we should eat 1000 mile sandwiches.


I think your examples are unfair. Getting stuck on needing the perfect architecture is closer to scrapping a building plan because it wouldn't hold up to a 2km asteroid strike than it is building something that will ultimately kill people.

Also all of your examples have wildly different impacts than a dev "portfolio project". They all cause physical harm to people, which a poorly coded website/cli tool/etc almost certainly won't. Unless this person's hobby is writing code for MRI machines, in which case, go ahead and make everything is perfect, but that doesn't seem to be the case here.


Basic hygiene when cooking is not perfectionism, and is not only done at elite restaurants.

Dismissing basic development good practices as "perfectionism" is just gaslighting people into believing that any form of thinking is overengineering.


> Basic hygiene when cooking is not perfectionism, and is not only done at elite restaurants.

Ok, so let's continue this analogy on the other one.

It's less like basic hygiene, and more like refusing to cook outside a clean room.

> Dismissing basic development good practices as "perfectionism" is just gaslighting people into believing that any form of thinking is overengineering.

Basic development good practices are something you develop during the "portfolio building phase", not before.


There's being uncharitable and then there's this.

Your entire point rests on a baseless assumption. There's absolutely nothing in the parent's post that indicates that the programs he would create could have the potential to harm humans.


I learned that not all text editors go to the effort of loading the file data very carefully with careful underlying data structures when I tried to open a 67K LOC COBOL file on a 32bit system, a while back. (Sidenote: COBOL has a 999,999 LOC hard limit in the compiler spec.)

So very many editors just couldn't open it.

Some would use so much memory that the system would either freeze, or the OS would kill them.

Some would silently truncate at 65,535 lines.

Some would produce a load error.

Some would pop up with an error indicating the developer thought it was an unreachable state. e.g. "If you're seeing this error... Call me. And tell me how the fuck you broke it."

Others would manage to open it, but were completely unuseable. Where moving the cursor would take literal minutes.

There were exactly three editors I found at the time that worked (none of which were graphical editors). And they worked without any increased latency, letting you know that the developers just thought through what they were doing: vim, emacs, nano.

(A few details because people are probably curious - the vast majority of that single file project was taxation formulae. It was National Mutual's repository of all tax calculations for every jurisdiction that they worked in, internationally, for the entire several hundred years of the company. They just transcribed all their tax calculation records into COBOL.)


Emacs is actually quite poor at opening large files, at least comparatively—depending on the machine, 65K lines may be enough. However, there's an addon, i.e. a ‘mode’, that implements editing of large files somehow.

Vim, on the other hand, does it splendidly: it keeps only a chunk of the text in memory, and iirc the ‘swap file’ that it creates for every opened file, keeps the changes in some kind of a sparse structure, so they can be tracked at various places in the original file. This ‘swap file’ also serves as a savepoint of the editing session, so the changes can still be recovered even if the machine crashes while the user never saves.

Alas, editors still tend to deal badly with very long lines (just in the low thousands characters). IIRC both Emacs and Vim drop into a big think if the user attempts to put the cursor further down that line.


> Emacs is actually quite poor at opening large files, at least comparatively—depending on the machine, 65K lines may be enough. However, there's an addon, i.e. a ‘mode’, that implements editing of large files somehow.

Yeah, essentially this (apparently) mostly occurs when the file has no newlines (like json often does). I think the hacks are around turning off font-lock mode and one or two other things (install long-lines-mode if this is a problem you're having).


Again alas, long lines is not the only large-file problem in Emacs. Though perhaps most of my woes pertain to Org-mode, but I had to look for a solution to edit large files in the past.

This is probably the most current implementation of a ‘view large files’ package: https://github.com/m00natic/vlfi


What editors ended up working?


> There were exactly three editors I found at the time that worked (none of which were graphical editors). And they worked without any increased latency, letting you know that the developers just thought through what they were doing: vim, emacs, nano.


No doubt dozens of devs will throw in their own 10k LOC story here, and yes it's painful to watch so many people having professional cramps over it.

But don't forget society itself if governed by OOM larger bits of text with no referential integrity, no machine to tell you if it's inconsistent, and no way to test anything, other than making humans write more text to each other and occasionally show up in court. The law itself, even parts of it like the tax code, and regulations on various areas, are a melange of text and cultural understandings between lawyers, judges and government. We collect the data for this machine in the form of contracts and receipts, and it piles up in mountains.

As with code, it's not just legal professionals who have to deal with law. It spills into everyone's life, and there's nothing to do about it other than either guess what to do or pay a pro to tell you what to do.


You are wrong to say there's no way to test anything. Imagine an enormous AI generating test cases for you constantly, in an adversarial fashion, with built-in rewards for advancing a more correct understanding of the text. Lawyers call this "testing", rightly so. If you are interested in efficiency / cost-effectiveness, it's got lots to be desired. But if you are interested in the internal integrity of the document etc, then this is better than almost anything developers have.

I hate these words as I type them but the law is also "agile" (ugh). It gets modified as it's used. It does not need high-assurance machine-verified "referential integrity". In my entire course of studying the law I don't think I've seen a single legal dispute over a problem of referential integrity. Mistakes, especially drafting mistakes, are corrected on the fly pretty much everywhere they appear, and then they disappear. For a dev, using the wrong variable name in a bad language could mean you introduce a huge security vulnerability and massive loss of trust. (Or if you write smart contracts, $100M down the drain.) For lawyers, referring to the wrong section has essentially zero consequences. Nobody cares. Maybe you get a funny look from a senior.

Finally re the 10k LOC tangent that this is supposed to be connected to, I'm not really sure what you're complaining about. You get "10kLOC" cases, but you also get well-organised practice guides & bench books. Laws in statute are typically very well organised, in my experience about 5-10x better than the average codebase. Laws are organising large swaths of the sum total of human endeavour, just as code does. I would say developers are behind overall, which makes sense for a discipline that's less than a century old.


The payroll check printer for my employer was once a couple thousand lines that generated raw PCL to be sent to a LaserJet that used magnetic toner to produce checks that had a working MICR number. It was rendered into spaghetti by multiple GOTOs that jumped to helpful labels like "666", and calls into other helper programs to generate more PCL that did things like change fonts and draw graphics. Of course none of it was commented, so you had to have a copy of the PCL spec on hand to know what any of it did. It was the product of a retired cowboy that had also written the rest of our custom payroll system over a number of years.

I attacked it by printing out and taping together each program into "scrolls" and tracing control flow with highlighters and sharpies. Had them all taped up on my office wall so I could refactor the whole thing from scratch, coworkers found that entertaining. Got a much more readable replacement working nicely. Then a couple years later HR bought a new system and we stopped printing our own checks. I was not sorry to see the whole thing go.


Reading your process I have that stereotypical TV series image in my mind of a person so deep into the subject matter that plaster every wall with notes and pull string all across the room at head height to hang up ever more notes kinda like that one NCIS episode (S8 E6 "Cracked"): https://img.sharetv.com/shows/episodes/standard/616591.jpg (although that image is only a small part of that whole view).


A printout of a one-file spaghetti code with gotos is the only case where I can imagine that trope of the wall of connected strings actually being a useful tool.


We call that a Murder Board in my office.

Here's a TV Tropes article on the cliche: https://tvtropes.org/pmwiki/pmwiki.php/Main/StringTheory


I went through something similarly during my PhD when my advisor printed the main code of the program that we were going to work on. At first I thought he was kidding (he's a very chill guy) but... Heck, after a few hours of "paper debugging" we discovered a lot of nasty issues, got new ideas and found redundant and spaghetti code that we didn't find when we debugged digitally (obviously, we were not a team of CS students/coders. Just a bunch of chemists kinda newbies to coding). It was a really useful and funny approach.


I worked on Word for years. Office has thousands of files over 10,000 lines with, uh, various degrees of test coverage and comprehensibility. After some time and experience, your mental model of the architecture ends up being way more important than simple metrics on source code organization.

IMO, organizing source code in files seems archaic. E.g. tracing the history of a function moved across files can be tedious even with the best tools. I’d like to see more discussion around different types of source storage abstraction.

There are benefits of large source files... When compiling hundreds of thousands of files (like Office), the overhead of spawning a compiler process, re-parsing or deserializing headers, and orchestrating the build is non-trivial. Splitting large files into smaller ones could add hours of just overhead to the full build time.


What's an alternative to files that doesn't just have all the same attributes of files anyway? If it involves breaking code into multiple chunks of related functions, and possibly having these chunks act as namespaces, that sounds like what a file does.


Maybe it's the editors that need updating. It would be pretty neat to have the chunks of functions / namespaces model as lots of tiny separate files on the disk, but a sort of view-layer so that you can view all the related ones in one virtual file. You could even then have multiple virtual files that include the same raw files.

For example in a video game your player, monsters, health potions, and attacks could all have the code for HealthComponent as a part of their virtual file. And updating the HealthComponent would affect the raw file so the virtual files would have the updates automatically. Yeah you can open dozens of editor tabs or always use jump-to-definition, but just being able to scroll around or ctrl+f within a restricted set of limited files would be nice.


Look up "class browser", in particular "smalltalk class browser" to see examples, current and historic, of editors supporting that kind of approach to navigating codebases.


Maybe I'm looking up the wrong thing but these basically list the classes and method names and hotlink to the code?

I mean something that would dump them all into, one contiguous "file" from the editor's perspective. Included components (I don't want to say classes because it could work in a functional language or maybe you just don't include full classes but certain methods) wouldn't have to be coupled to the main class you're editing. Like if you have a decoupled event system you could pull in just the events relevant to the idea you're working on. You could have different views depending on what idea you're working on and save them as their own file.

To use the gamedev analogy again you could have MonsterCombat.view MonsterAI.view MonsterAnimations.view which would all expose different subsets of the Monster class and various related methods from other classes/modules.


Something like this https://tibleiz.net/code-browser/ maybe?


I might be remembering wrong but I thought Visual C++ 6 might have had a class browser as an alternative to a file browser. Maybe modern Visual Studio has it too?


I don't really mean a class browser but something that completely abstracts the underlying code organization and let's you create a sort of meta-file that includes all the related code for various ideas.

Like you could have all of the code from class A except the debug related methods, a few methods from class B, just a few functions from a static MathUtils class, so on.

Maybe your ClassAUnitTest meta-file could include some of the ugliest methods from the class you're testing but not the entire class.


Give https://www.unison-lang.org/learn/tour/ a read. The Unison language stores code as hashed syntax trees.


I love how much this questions the status quo.



Someone please tell me this is transpiled from a separate project.


It's originally written in LISP and this is why it's a single C++ file. However, I believe that it's now being maintained in its C++ form.


If I remember correctly it was in fact written in Common Lisp; the output was originally that file but it may have been modified since. You can probably google the truth with those breadcrumbs :)


Python's main interpreter loop is a single 4k line function.

https://github.com/python/cpython/blob/main/Python/ceval.c#L...


I've refactored monolithic code several times in my career. It starts with a thorough going-over, making notes, identifying the state machines and drawing what was handled and what was not.

Then, reimplement as a simple state machine but this time, fill in all the transitions (event+state => new state + action)

One was an Infiniband code base from the vendor - a 'computer scientist' had written several layers to do what one or two could accomplish. Another, the Windows CE DHCP client (went from seconds to choose an address to milliseconds). Then there was an HDLC modem protocol - I got done, that was sped up a multiple and no longer crashed.

I can't understand them by just reading. I had to make a road-map of all the states, events, actions and interfaces. Design a new code. Then make sure every function of the 'old' code was represented in the new code - line by line. So nothing got dropped.

Satisfying. But more like turning the crank and making sausages than design or architecture.


I like your approach. Having a clear understanding that all software is just state machines is a great way to solve hard problems.


Better one good organized file than 100s of folders and subfolders and files and symlinks. I have worked on projects where even after 2 years I didn't grasp the folder structure and just used search to locate files.


People love to complain about things that are simple, fast, and easy to complain about, without regard to whether the complaint is insightful or useful. It's sort of the dark twin of bikeshedding.

If you divide the single 11k-line file into a thousand 11-line files, it may become objectively much harder to understand, but it'll also receive much less flak, guaranteed.

I suspect this is also why Architecture Astronaut-ery can be so successful within a company. If code is chock-full of superficial signs of order and craftsmanship, such as hierarchy, abstraction, and Design Patterns(TM), it takes a lot of mental effort to criticize it, and most people won't.


If you divide a single 11k-line file into 20 files averaging 550 lines per each, by semantics and levels of abstraction, your code will quite possibly be easier to read, maintain and add to. Maybe. Perhaps.


I mostly agree, but it's often not that big a deal, and some people and applications may favor bigger files.

I have a 4000-line script in a single file that has served me very well. It's perfectly organized and modular. I thought about breaking it into more files but it seemed pointless. It's very convenient for jumping through every mention of a variable, for example.


> If you divide the single 11k-line file into a thousand 11-line files, it may become objectively much harder to understand, but it'll also receive much less flak, guaranteed.

A thousand 11-line files? You definitely could not make that guarantee of the people I work with.


> dark twin of bikeshedding

is it different from regular bikeshedding? or are you saying that the dark twin is the evolutionary process of eg. architecture gaining complexity until it becomes difficult to criticize..


Bikeshedding is usually more about how something should be done than how it should not be done. But, yes, you could think of them as basically one thing.


I once inherited a mission-critical PHP project which had no version control, no tests, and no development environment (all edits were made directly on the server). It used a custom framework of the original author’s own devising which made extensive use of global variables and iframes and mostly lived in several enormous PHP files. I was able to clean it up somewhat, but there was one particularly important file that was so dependent on global variables and runtime state that I never dared touch it.

When I was finally able to retire the project several years later, I first replaced the home page with this picture: http://2.bp.blogspot.com/-6OWKZqvvPh8/UjBJ6xPxwjI/AAAAAAAAOv...


It wasn't mission critical but my very first production programming project (n.b. I'm not a programmer and never had any classical training or education as one) was an abomination. I'd like to think the realisation of how bad it was, despite it just about working, was a call to arms to up my game a little. I ended up learning a lot about data structures, writing understandable code and comment, when not to write code, all that OOP stuff and things like STI, Generics (still not sure what they are), testing (TDD AND BDD!!! Yea, Cucumber!) and a plethora of other useful things.

I'm still not a programmer.


Haha, same!

The first project I inherited was PHP app that used a custom UI framework created by an agency that didn't work with us anymore.

One file had 7000 LoC and it would generate hundreds LoC of with-sprinkled JS code and send it to the browser on every click.

Debugging that thing was a nightmare.


My first thought when reading this description was that step one is to make a local copy, get a development environment setup where you can toy around, see how things fit together. The 'stupid'er the setup (like using plain old files instead of a database), the easier that actually gets (apt install apache2 php; rsync da:files /var/www). Wouldn't that have helped solve this particularly important but untouchable file?


If I remember correctly, the file was processing global state from other parts of the system, and it was such a Byzantine bit of code that I had almost no hope of understanding what it was actually doing without being able to observe state in the production system as it was being used. Plus at the time I wasn’t a particularly competent programmer myself (this was my first programming job). In the end I figured it wasn’t worth risking breaking it when its replacement was on the way.


If well done, single file projects are not bad. They save a lot boilerplate code. It is also easier to find things, since it is all in the same file.

EDIT: I'll go even further. Programmers who don't like long files are probably using the scrollbar to navigate around the file. Vim saves me from that bad habit.


What programming language requires 'a lot' of boilerplate code to use multiple files? That sounds awful. I don't think the argument for things being easier to find goes up either, with a tool like grep.


You don't need to go far. In C, function prototypes in header files are boilerplate ;)


Perl XS (the system used to interface with C) requires module == file, so if you have a particularly large module then it just has to live in a single file. Here's one:

  $ wc -l perl/lib/Sys/Guestfs.xs 
  11930 perl/lib/Sys/Guestfs.xs
Worse still, this expands to C which can be large and takes a noticable time to compile:

  30019 perl/lib/Sys/Guestfs.c


I don't get the obsession with file length. What's the benefit of having 100 files with one 50-line function per file, over having a 5000 file with 100 functions? Obviously not counting extreme cases where the file size would break some editors' buffers


Usually (but not always) a single, huge file points towards missing structure, missing abstractions, missing boundaries that aid with understanding.

If it were a huge, single file, with very understandable modularity within that file, likely nobody would've bothered to write a blog post about it :-)


Personally I find it much more difficult to keep n places in one giant file in my head than I do n individual files.

We have a few multi-kloc legacy monsters where I work and I quite often completely lose my place when working on them (and, by association, my train of thought), even though they’re actually structured somewhat reasonably.


I had this problem until I found an editor that had outlining as it's core design paradigm. Now, with the outline always visible, it's _really_ easy to navigate any length file.

Unfortunately, at one point I got so used to navigating with the outline that I ended up making a 1500 line function in C (I was an even worse C programmer then than I am now). Because of the outline, I could read and follow it easily, but anyone with a different editor was royally screwed :-(

If you're interested, the editor is LEO (http://leoeditor.com/) it's been mentioned on HN a few times


I think the problem in this case was that the entire file was the script that ran top to bottom. It’s not so much that the file was big, but that the function was huge and impossible to reason about.

I agree that obsessing over file length is it’s own kind of anti pattern. I have had colleagues who insist on putting every little thing in a different file and that is its own special kind of hell.


Try debugging a single 10k loc file versus fifty small modules where each takes care of a distinct part of the logic.


As someone who did a bit of enterprise Java, I much prefer the former. Jumping around between lots of tiny files and not being able to see where the actual work happens because it's spread everywhere is a debugging nightmare.


I think you need a better IDE.


I'm not too bothered by the single 10k loc file (and I've seen plenty of files with thousands of lines). I would aim at files in the range of 200-300 LOC

If you split it, it's crucial that you're splitting the logic in the right way (if the modules are too small, they'll just waste your time) and that you're making sure references can be easily traced (eg. if you have modules with some DI system which prevents references from being recognised, as it happens frequently in certain node.js enterprise applications).


But what's the difference between file1.c ... file50.c vs cat file*.c > onefile.c?


I imagine the 50 files have meaningful names. Just kidding, no-one knows how to name anything.


It depends on your tooling

I just searched for the largest code files on my system and found a 100k file

I opened it on the online repo and gitlab did not want to show it at first. When I clicked on show anyways, Firefox broke trying to load the website and I had to restart it. (then I could not post on HN anymore due to the noprocrast setting there)

When I opened that file in an IDE, it was shown quickly without any issues. But there is a notable delay when typing, so 100k lines are too much.

Other IDEs might already fail with smaller files

Although when the IDE has a "search in the open file" and not a "search in all files of the project" feature, one file is much easier to use than multiple files


At one point - and perhaps still today - Java would refuse to JIT class files with more than a couple of thousand lines in them, falling back to interpreted mode. So in that case, you really, really wouldn't want the 5,000 line file.


* Editor support for perfect semantic navigation may not be taken granted

* Compiler support for function-level incremental may not be taken granted

* Editor shows a nice file tree (although you can do that with symbols too)

* Working with git is easier

* Reading code on site like GitHub is easier


Code could be well-modularized in one single file, of course. But we don't have the tools to write code like that (editors and languages basically).


I find IDEs work better with multiple files (i.e. navigation around if you want to have several windows open at the same time), but agree that’s not so well defined.


Ability to structure your code base hierarchically

Ability to search through your codebase by file name

Ability to hide irrelevant information and expose a higher level API through private functions


None of those are necessarily driven by file-level organization, though, except for the one about the file names themselves.

My mortgage is paid by a 50 kLoC C program with a single 11000-line function. I'm always blown away by how many so-called "code editors" can't give me a simple list of C functions in the file, the way BRIEF could in 1987.

Few things annoy me more than having to trudge through a codebase with hundreds of .c files, inevitably all with 8-character filenames. Any day when I have to break out Eclipse to navigate an unfamiliar project is an official Bad Day At Work.


Eventually a single file won't scale. Maybe not in the lifetime of your project, but you never know. If a day comes and you have a 500,000+ line file that must be split into separate files, that could be a nightmare. Why not just follow best practices from the beginning and separate your programming logic into different logical/related units?

Similarly having the discipline to separate your programming logic into different files will force you to think about the architecture(or lack thereof) of your program. This is a good thing. The Java "one file for one class" model is overkill IMO, but it does force programmers to discipline themselves by thinking in terms of namespaces/classes when they write code, which for beginners at least is not a terrible thing.

Obviously version control is another reason. Hard to get work done if everyone is working on the same file.

It also will make it easier for someone else to grok your program. When I git clone someone else's large project, I start trying to understand the project by writing down what each folder and file is designed to do before I go any further. I suppose if everything was in one file I would just have to do the same for functions, but what if there were thousands of them? Imagine if a large program like WordPress, Doom, the Linux Kernel etc was a single file?

TLDR: For small one person projects, no big deal. Otherwise, it's just a bad/unscalable practice.


A brilliant tool I once worked with is TetGen; it takes a hollow 3D shape and creates a volumetric, space-filling mesh of the inside using tetrahedra. Most of what is TetGen is in once giant C++ file, clocking in at 36,566 raw SLOC.

https://github.com/libigl/tetgen/blob/master/tetgen.cxx


But it’s a class with lots of small methods. Maybe not the same as the single large VB script the OP described.


The most strangely maintainable code, though I should probably put maintainable in scare quotes, I have ever seen was a astrophysics code that calculated the changes to spectrum during interactions with background fields. That thing hat two loong nested loops, in the outer loops it calculated local backgrounds, and the inner loop was basically a Euler solver that used the backgrounds from the outer loops.

The outer loop was something like 4 kLOC, and consisted of blocks where there were first 20 lines of loadTable(filename) calls, then a call to calculateLosses( <all the tables just loaded> ) and then freeTable( <just loaded tables> ) calls. The inner loop was a little bit of setup and then a very long part where all those losses would be subtracted from the spectra.

The funny thing was, that once you got the structure, that code was actually not that bad. However, I told my boss several times that the second something comes along and doesn't exactly fit into that pattern the entire thing will blow up, and was always told that they maintain that code for 15 years and that didn't happen yet.


10 years back when I was first starting my company, we wanted to build a phone IVR system with Twilio to book tables at restaurants. It was fairly complex; it to track many different aspects of call state, including client being able to enter things like date/time/party size etc. with push-dial, and call our internal APIs. I assigned a recent college grad to the task.

In a week, she came back and said "OK I've finished the prototype." I thought no way and I asked her to demo. Try X, try Y, try X + Y, etc. -- it all worked.

Then I looked at the code.

She had written the API handler as a one gigantic function, presumably because Twilio gives you a single API callback on an incoming call. It was a maze of nested if-statements going 10+ levels deep, subroutines relentlessly copy-pasted inline throughout the whole thing. Then she manually tested by dialing the phone 100s of times, putting in hacks throughout the if-tree.

Her prototype ended up being pretty easy to refactor and was ultimately the basis (at least logic-wise) of what we put into production.


I've never written a file with 11,000 lines of code, but I have often built Clojure projects like this, with everything in one file. I think I might have once had a file with 4,000 lines of code. Maybe 5,000? A complete system might be 5 apps, working together, each made of one large file. It does help with some things. Especially if I try to on-board another programmer, if they don't know Clojure very well, using one file means they don't ever get tripped up by name spaces, instead, they just open one file, and then they can load it into the REPL and start working. I would not recommend this style for every project, but it does offer a kind of simplicity for the projects I work on.


i used to work on 40-50k line files with one function and a bunch of gotos in perl, in a multi billion dollar company

its fine

you just binary search your way into it, put print "AAAA" in the middle, see if its printed, then put it in the half of the half and try agian.

emacs couldnt even find the bracket ending of the if condition (not the block, the condition..), have you ever seen if conditions(again, not the block) that spans your whole screen?

its not as bad as you think, it made me realize we take code very seriously, but its actually ok, 10k line file 100k line file, whatever.. its all the same


This is the right attitude, sometimes you really do need the duct tape approach.


The size of a code file doesn’t matter. What matters is the amount of state the code in the file manipulates. For example, a 100k line code file with 500 pure functions not using any global state is fine. It is simple. However a 100k line code file with 500 functions that all manipulate 1000+ global variables is extremely complex and hard to maintain because of the undocumented global state invariants and hidden side effects.


File length is a bit of a bike shed in my opinion. My main concern here would be separation of concerns and code quality.

I prefer many short files and folders structured hierarchically and grouped semantically. I have no proof this is better so I would probably just leave it to a vote with the team.

In the end I think they is how a lot of this should be viewed until we get proper research. How do you WANT to code? TDD? No tests? One giant file? It should be a team and executive decision.

If you don't like the style on your team, and nobody wants to change it, move on or adapt.

Technical debt is like a superfund site. It renders the real estate worthless and poisons the rest of the company.

It does matter. My current gig is hemorrhaging money because we can't keep devs even though the pay and benefits are great. We cannot execute on mission critical initiatives.

We cannot adapt our product to meet the needs of the market in an agile way.

This is due to people saying "a working product is more important blah blah.." for years. I would argue there is a balance to strike and you can do both with a good team and realistic planning. But there is always the nay-sayer who is willing to step in and say whatever product wants to hear.

It is so bad we cannot train people to use the software anymore. It is too poor quality and were can't on-board them before they decide to go elsewhere.

Everyone who knew anything has left and there is too much of it. So the remaining devs get overwhelmed, they leave... It is a vicious cycle.

The funny thing is the money machine works, but it is so frustrating to see all of the extra money we could be making and having to leave it on the table.


I can beat that.

One of my first jobs was as a maintenance programmer for a 100KLoC (or more) single-file FORTRAN IV (1970s vintage) application (a proto-email server).

Three-character variable names, no documentation, and having been stepped on by every junior programmer that went before me.

My best debugger was a Ouija board.

The original author was a ringer for Donald "Duck" Dunn. Interesting chap.

It taught me the value of writing good code.

It sucked.

It was great (because of the lesson learned).


Surprised nobody's mentioned the Telegram Android app yet. Their ChatActivity.java is 27, 720 lines! https://raw.githubusercontent.com/DrKLO/Telegram/master/TMes...

Thing is, there's a great knowledge of the Android OS in there and the app works great when I use it. I think the OP blog post is correct 'Users don't care about the technologies or code.'


Well, this is not necessarily a bad thing. If it was approachable by a non-IT person in HR, and business rules could be updated without contacting IT and waiting, then more power to them. I have seen this sort of thing developed as a coping mechanism because the official IT team could not be used, either due to time, cost, priority, or whatever. Also, even being in IT, 1 multi thousand line file can be a lot more manageable to work in vs. a dozens of smaller files where it's not clear where to look without being in an IDE.


At my first job out of college, there was a "utils" file with IIRC over 100k LoC. Nearly every file in the codebase imported it. This was in Perl. That single import statement would increase the time to start anything by upwards of three seconds. One of the best things I did for my efficiency was to factor out subsets of functionality that didn't need any of those utils, since those subsets would run unit tests in a tenth of a second instead of three to five seconds.

All of which is to say, by all means argue about whether colossal files are acceptable software engineering, sometimes that fight takes a back seat to "a double-digit percentage of the company's CPU and memory are wasted on parsing and loading this file in literally every new process".


TypeScript's checker.ts is a 2.65 MB file containing 44,932 lines at the moment.

https://github.com/microsoft/TypeScript/blob/main/src/compil...

Does anyone know why and how they maintain it?


Unlike the OP's file, there's a rather substantial test suite and massive corpus of TypeScript code to work with, so at the very least, you'd have some grumpy people knocking on your door if you did something that negatively impacted the greater ecosystem.

Some documentation from Orta Therox on the checker:

https://github.com/microsoft/TypeScript-Compiler-Notes/blob/...


const anon = "(anonymous)" as __String & string;

What does that even mean? It seems that typescript uses an alias for string as __String in the source but then a bitwise operator with string?


Debugging a single file is much easier compared to debugging a tangled mess of interconnected .h, .c, .tcc files with include directives that only work in a specific sequence and with a specific compiler.

Fix your include/import systems before preaching for modularity.


I worked on a project like this once. If I remember correctly, it was even larger than 11,000 lines. ASP classic project with VB-script. We tried to get the company to do an overhaul, but it was more cost effective to try building it into the existing system.

After the better part of a week becoming acquainted with the code, I found a suitable integration point. Luckily for me, the new feature being requested didn't depend too much on the existing code so I didn't have to make too many modifications to the existing code. I added the entry point to the new section along with some comments describing how things worked and some ascii art of a dragon. In the end, the new feature worked great and the customer was very happy with the results.

Some years later, I was working for a different consulting firm and that project surfaced again. This time it was being re-written in ASP .NET after being passed around to a couple of different off-shore development teams. My coworker was working on it and asked me if I had written a specific piece of code in index.asp. I took a look and we both had a laugh, because my ascii survived after all those years!


That's me when I use languages that don't support circular imports. Can't have circular imports if everything is in the same file. Taps head.


For me the limitation of circular imports forced to drastically rethink how I architected my software (golang, rewrote the whole mvp 3 times be the first two I was either completely blocked by no-circular-imports or the structure felt so hacky that I didn’t even wanna touch it… then I learned about interfaces!!)


Just in passing, generally you can break circular import by isolating the coupling that triggers the circle chain in a file dedicated to that. Bonus : each coupling use case gets to be explicit


Python allows for circular imports as long as you don’t directly import things you use but instead their modules for example.

Is that the same for all languages?


> Once I dared to clean this up and reuse the authentication response, but it broke everything. I never figured out why. To this day I sometimes lie in bed wondering what could have caused this.

Is it just me that get piqued by the sound of it? I often spend my night figuring something intriguing and chasing that aha moment. I also like play puzzle games, and to me, its one way or another to spend the time. Then if I get to clean the mess up, usually that's another rewarding effort. However, I definitely understand the frustrations if one's hands are tied or there is a deadline that you just want it disappear.

I love to clean up my own mess as well. We all make messes, at different levels and at different perspectives. Just like playing a game, it is just boss fight at different level. Novices make mess, veterans make mess as well. Usually novice's mess are easy mess.


That beats the 1000 lines inside a single if {} block that I once found.

(The conditional in that if {} always evaluated to true).


Lol, reminds me of the meme:

var a = true;

if(a == true) then return true;

else if(a == false) then return false;

Or something like that.


I have seen that in real code. No kidding.


I once looked at 30 lines of code, analyzed them and found out that they always computed true, never false. My background knowledge told me that Zeit was the correct value. But I did not dare to eliminate it in the main branch, only in a refactoring branch that I think got never merged. I am not proud.


I've seen this in production code:

    public static bool ConvertToBoolean(this int number)
    {
        var TrueOrFalse = false;

        if(number != 0)
           TrueOrFalse = true;

        return TrueOrFalse;
    }


i hate to say that I've definitely written code like that. not any time recently, thankfully (at least I think)


The thing is it’s easy to imagine a situation that leads to this. It’s five o’clock on Friday, your partner is hassling you to get home because you have visitors, you are exhausted because you worked 60 hours this week and your boss is breathing down your neck because they want their pet feature finished right now.

This is why I’m always loathe to criticise stuff I see on WTF.


I could totally conceive places where "you have to write at least N lines of code per day" and then this kind of thing explains itself.


else throw "this should never happen";


Yeah, but the function only had a single exit point, so it was following best practices.


I don’t remember if it did or didn’t have a single exit point.

But the thing as a whole didn’t follow any best practices I’ve ever heard of: the project also had what looked like a bizarre attempt to reimplement the concept of properties(!), in that the UI classes had fixed length arrays of all their subviews, which were accessed by constants. Each section of the UI had its own god class, where different views in that section were all the same class, called with a constructor that determined if a view ought to be created for any given constant.

There were also something like 20,000 blank comments. No idea why, the guy who added them didn’t even understand my question:

  int something = foo();
  //
  double baz = bar(something, 5);
That kind of thing.

(The project is no longer available and the business who made it has since closed, before anyone asks).


I was being tongue in cheek about people who religiously follow that "single entry and single exit to any function" rule and then get wrapped around the axle when a function needs to make several discrete sanity checks at the start before doing work.


Fair enough, it’s hard to convey tongue-cheek position in writing ^_^


One can develop* what serves as the edifice[1] of a $1B hedge fund that handles signals, orders, trades, positions, and p&l, among other things, using a system not unlike this one.[2]

[1] Merriam Webster definition of edifice: a large abstract structure

[2] Source: Me

* Or inherit and maintain


CORSIKA, current Version 7, is a program to simulate extensive air showers, ie particle physics in the atmosphere.

It's main file is 88kloc Fortran 77, started in the eighties, still actively developed.

https://www.iap.kit.edu/corsika/index.php

Currently a rewrite in c++ is underway.


It depends how you see it: I see a project that is quite successful since it's running in production for mission-critical needs and the code is solid enough that even non-programmers can do improvements to it


That's like running down the side of a very steep mountain and calling it a successful endeavor before you've reached the bottom


Depends on your measure of success. Given a steep enough slope, you are almost guaranteed to reach the bottom one way or another.


The scariest part of this isn't the 11K lines of code, it's the lack of automated tests. It IS impossible to make any sort of substantial change without breaking another part of the darn thing.

My favorite quote has got to be: "Unit tests aren't meant for you now, it's insurance against a future developer".


The number of lines in a file doesn't necessarily mean anything. SQLite is compiled as a single file: https://www.sqlite.org/amalgamation.html . It depends on the structure that is in that file.

If this app was factored into 300 different files, it would still be an impossible mess. The redundant and buggy logic would just be in different files.


SQLite concatenates its sources into a single file to ease distribution, but it is developped in many seperated and well organized .c and .h files.


This reminds me of an app we had at work for static content, written in JSP of course. It was made so designers and UI developers could make mostly static content pages.

Someone had a good idea to make a header.jsp template for common header stuff.

But it was hilarious. The file essentially became a giant if-else condition with a few hundred conditions like "if path == 'some_page'" followed by CSS and sometimes JavaScript for that page.

Absolutely horrendous.


Think of all your code as a single file with line separators, file separators and a special UI that considers both to present these in a text editor.

The existence of large files is mostly just a style issue.

Text editors with different or lets say more semantic interfaces to the code would not care about file size.

You could then happily have 1M loc in a single file.

You would care about it as much as you care about how the code was laid out on sectors and pages on your HDD.


I once wrote a sql query that was (well, is, since it is still in use) 5,000 lines long. It still ran in less than a millisecond, but I was surprised how long it really was once it was finished and tested.


It really speaks to the state of software engineering when there are ample comments here defending this practice. This is virtually indefensible in my book, as it screams technical debt and strongly suggests that there are much deeper issues hiding in that codebase. Personally I would not be willing to work on it without first addressing those issues.


It speaks to the state of software engineering that people are complaining about file length, instead of something meaningful like cyclomatic complexity.

A 10k single line program can be easier to understand and better organized than an overly abstracted mess strewn across multiple files, but which checks all the "best practice" checkboxes.


Do you think the people who wrote this were measuring cyclomatic complexity? Obviously you can have >10k LOC file that is well-maintained and useful.. but there's ample evidence in the post that this was _not_ one of those, but a machine held together by duct tape, spit, and hope.


> Do you think the people who wrote this were measuring cyclomatic complexity?

As a first-order guess, I would guess that no code has its cyclomatic complexity measured. As a fraction, I'm probably not too far off.


I think the idea that optional extras are seen as required is more telling.

I suspect there are a lot of younger programmers with these views, who don't know elsewise.

Testing suites, having dedicated test/dev environments, all of these things are relatively new across most programming on the web. "Best practice" has changes more times than I can count in the last 20 years, and we've gone full circle from "monoliths are bad, split everything into microservices" to "maybe try combining them to reduce complexity".

I'm not saying this style of coding is the best, but automatically assuming the current practices are the definitive best ones in all cases is silly, and the idea you have to refactor because it doesn't meet those practices is - in my view - insane and wasteful.


This more of a "rubber hits the road" than a theoretical forum. There is a time and a place fir stuff like this. It's not a reflection of the bigger picture of sw.


And that's fine. Nobody is qualified to work on every piece of software.


Is it weird that I think that, armed with a modern editor, this would be basically easy and practically pleasant compared to the FooFactoryFactory “best practices” crap I’ve had to deal with?


I always start a new project with all code in a single file then only split it out when I feel I gain more than I lose. Judgment call, and a tad subjective.

One factor in favor of mono-file is its often easier to navigate around fast and do search/replace in a single file then across many. Believe me, I understand all the benefits of mutiple files too -- not my first rodeo -- but under right conditions not always a sin to have a "fat" file.

generally after a project has left its prototype stage and become more of a stable thing, primarily under maintenance, it should certainly be split out into smaller files with sensible module boundaries. Best in long term, and plays with version control better, and large teams doing concurrent mutations on the source tree.


In 2012 I worked on a jsp web app project, we replaced another consultant, so the project was inherited. A monstrous jsp page, of 13k lines IIRC, once traspiled to pure java hit the max length for a class method, 65535 bytes.


We got many of these including a 13k LoC file in our OSS project. Yes, it isn't ideal but sometimes for performance and practical reasons these things grow over time.


Obviously Sam from accounting did this before Excel was t urning complete.

https://www.felienne.com/archives/2974

https://www.microsoft.com/en-us/research/blog/lambda-the-ult...

No metion of a cell-u-lite version.


An early 2000s web analytics tool awstats was a 500kb Perl file. It was surprisingly easy to modify and hard to break - I spent a lot of time adding SEO goodies to it.


In Pascal that is really common, since the files are the modules. You publish your library as one file, and the user can import it by the file name.

I ran wc on FreePascal to search for some. There are a few, but not as many as I expected.

9k file, data structures for the compiler itself: https://gitlab.com/freepascal.org/fpc/source/-/blob/main/./c...

30k file: Pascal parser/scope resolver: https://gitlab.com/freepascal.org/fpc/source/-/blob/main/pac...

And the record:

119k file, Sharepoint API (but it seems to be autogenerated): https://gitlab.com/freepascal.org/fpc/source/-/blob/main/pac...

As far as libraries go, this is one of my favorites:

23k file, regular expression library: https://github.com/BeRo1985/flre/blob/master/src/FLRE.pas

I searched my own files and found a 197k file to parse HTML entities. But that was an autogenerated trie (one switch/case for each letter)


> Once I dared to clean this up and reuse the authentication response, but it broke everything.

Yet! Those non-programming people somehow managed to add their little requirements over the years, without breaking the other forms?

There is probably more to the story. Like that, I suspect, users who needed to produce a certain form probably had private, years-old copies of the program that they used, impervious to subsequent changes.


I'd wager they did break those other forms, the whole thing had its own long list of bug reports per the writeup. They just didn't break it for themselves.


My $DAYJOB task this week is beginning the process of splitting up a 14554 line C++ file. It's basically the implementation of a single class, but by way of comparison, the header file is only 743 lines, so there's a lot of code in some of those methods.

It'd be pretty unmanageable without Visual Assist (a plugin for Visual Studio that does fairly fast searching for symbols and files).


I once refactored a 15k plus line sql file. Fun times


I'm guessing / hoping there was a lot of data in there as INSERT statements.

My horror story was being asked to do maintenance on PHP sites written in the early days of PHP. Hundreds - maybe thousands - of PHP files with copy-pasted HTML and intermingled logic. As far as I can tell, the idea of instantiating a whole MVC framework from a single entrypoint file came after that particular site was created, so every possible page was its own entrypoint with its own boilerplate. Source control also seemed to predate this project, so you had plenty of .old.php and .old.v2.php files.

Programming at webdev agencies is a challenging experience.


Nope. It was a crazy stored procedure. When all you know is SQL… you’ll solve some crazy problems using sql.

Oof. That tbh sounds worse


I worked on untangling a 29,000 line single .c file MacOS app in the mid-1990's that featured a 14,000 line event loop function. Fun.


In my first company i quickly established myself as a good programmer. As a result i was awarded with a tough project on a tough client. What i came across in the code base astounded me. It was a java based code base with everything happening in the constructor of a class. A 3000 line long constructor. I didn't stay with the company for long after that.


I worked at a large ecommerce company on their Android app. I was tasked with using the Sonar tool to analyze the source code of their app. A bad cyclomatic complexity is 30, at which point it is difficult for an experienced engineer to follow the code paths.

There was a file in that source that had over 900 cyclomatic complexity.

The reason was that at that point Android had a limit on their method count. Called the dex limit. You could only have 64k methods, before multidex was introduced.

The engineers couldn't add more methods, so they had to jam code into existing methods. No DRY. No refactoring.

It was fucking impossible to understand at the code level. You had to just use the app for years to understand it.

That analysis got a bunch of us targeted by the senior developers because it made them look bad. They fired multiple people, including me, who pointed this out. We said it made it impossible for new engineers on the project to succeed, and the incumbents didn't want that to be known, less they lose their bonuses.


It's worth an honorable mention toward the 47,000 line file containing the entire garbage collector for the C# runtime. It was originally written in lisp, then automatically translated to C++ by a script before .net was released. It's been hand maintained since and the folks at Microsoft have rejected attempts to refactor the code.


In the 90s I was given the task to make some changes to a shell script which provided a complete systems-engineer configuration management tool for country scale network management systems.

This shell script was 9,000 lines long. Each new "menu" after entering a number corresponding to your selection from a previous menu took 60-120 seconds to appear. Needless to say, configuring systems was a painfully slow process.

I quickly found it practically impossible to extend this tool to meet the new requirements, so I quietly rewrote it (also still in shell script, because of host limitations). The new version was 1,000 lines, and menu changes took < 2 seconds in most cases (thanks to appropriately placed caching).

Management was not happy that I decided to rewrite (since the manager had written it himself originally), but the users were ecstatic about the performance increase. I would not be surprised if that tool was still in use today.


I'll never forget the compiler bug we found once. We had a precompiled header file that was over 65,536 lines, and for some reason, it would just stop reading after that many lines. Of course we had some interesting compiler errors, but we never expected that it was a compiler bug.


Quote: "There was no test environment. If I made a change, I had to test it in "production"."

What year is this? 70's and mainframes? Because no way in hell you cannot, on large organization that had "Jeff in marketing", since 80's and PC's, duplicate the environment. Especially given this was used by almost everybody in the organization.

And once you're done duplicating the production, and create proper test environment, you can start actually refactoring and creating a beautiful app out of it instead of just "To this day I sometimes lie in bed wondering what could have caused this".

Conclusion - article's author is a whiner instead of a solver, no better than the ones before him that "copy and pasted at some point then later diverged"


If the author was the only person editing the file, and had a reasonable expectation of owning it for at least 6 months, then yes, start building a second environment.

But one antipattern might be that the file is being modified by other people, even if on paper he is the only owner. So while you create a test, someone you don't even know exists modifies prod right under you.

Another antipattern is forever-changing ownership. You own it for 2 weeks. Then someone anonymous in some layer above you decides something else, and of it goes. Maybe they did bother to tell you you're not the owner anymore, maybe they only told the new owner. In these 2 weeks, you've got ownership of 3 new programs and lost it for 2 others.

I've seen it happen. There is no way to build something stable in these circumstances. Management will need to provide some stability before the underlings can do any work. If you're living there, run away, you can't save them.


"There is no way to build something stable in these circumstances"

Yes, there is. Since he clearly stated that started as a job assigned to his team, he could've version it right there where it was officially staying. Then it would've be very easy to see changes done to it by somebody else, even if who was that somebody else was unknown to him.

"Then someone anonymous in some layer above you decides something else"

He also stated in the article that he had a lot of time in his hands and tried a refactoring. You don't have time to do a refactoring of a file with 11k lines of code in only 2 weeks, so he clearly had at least several months.

My conclusion still stands.


Most of the comments here seem to be focusing on Austin's reference to it as an 11k line of code file, but if you read the article it sounds a lot more like it's both an 11k line of code file and an 11k line of code procedure. That is, that there are no sub-procedures, just one straight execution through the whole thing. If you've only encountered large-ish source files like this but with reasonable sets of functions inside, that's a far cry from finding an 11k line of code procedure itself. The former is often justifiable (though perhaps a stretch), the latter is almost never justifiable. It's just garbage.


https://github.com/SheetJS/sheetjs/blob/master/xlsx.js with its 24500 LOC, the author is cool


One of the things I often tell people is that if you hear someone say "Yes! It broke!" that person is probably an engineer.

But the moral of the story in the article is that unfortunately some things break much less easily than we'd like.


I one rewrote one of these programs and translated ten-thousands of lines to a well unit tested 500-line purging script. Got a spot-award and then the program was never used again… choosing to keep the “working original version”


This was common with Delphi/Pascal because units were designated per file, and nobody wanted to type hundreds of "uses" statements to import every one of those. So, people leaned towards keeping code in as few files as possible. For example, TurboVision had units named "Objects", "Views". I'm sure they were thousands of lines long.

It was possible to use include files to split up a single unit, but that wasn't used much either because lack of cross-file navigation features in IDEs of the era made it really difficult to manage many files.


I encountered the QGSJET-II model which we used for modeling cosmic ray showers in the atmosphere. At one point I was asked about finding a way to parallelize the code, but the 17k+ line fortran file for the model, which I recall also included self modifying code, was too deep for an undergrad to penetrate.

https://gitlab.iap.kit.edu/AirShowerPhysics/crmc/-/blob/mast...


Modularity applies at different levels and the purpose of modularity varies according to the level. Code can be modular at one level and not modular at another. Functional, domain knowledge, deployability, organisational alignment (e.g. operational support and maintainability), etc

I see lots of people getting confused these days about modularity and people missing the point altogether. What is it you're after? What's most important to you? That's where you need to focus your modularity efforts.


I've seen a complete high-level language compiler in a single file of around 250k lines. Granted, it was assembly language, but it was entirely hand written.


This one time, working at a country's state electricity provider, I was tasked with drawing all the power stations in the country on a map. I was stumped, I couldn't figure out why the markers weren't showing up. Then I zoomed out and saw them being drawn on the other side of the world. The latitudes and longitudes were the wrong way around. The data was from a government-maintained database. lol.


Had one like that, too. The previous programmer didn't believe in constants and the code was interpreting a binary protocol. Globals all over the place, too - I refactored a variable once and got a message that production is down; it was, the database was busy deleting customer history on all its connection pool threads. To this day I don't know how that refactor could've caused it. Fun times.


I worked one year with "accounting software". The single-file-VBS-approach was almost all I ever saw. Anyone who knows anything about software development quits, and those who don't stay there and fiddle more with the incomprehensible 11.000 line VBS script to try to figure out why nobody born before 1981 didn't get their salary paid out last month...


Code quality, internal modularity and clean interfaces between the parts don't have anything to do with how it is split into files.


Those are rookie numbers.

About 10 years ago I was shorty working on an app for bank tellers.

The app was in production for 2-3 years at that point, 4 people where implementing new features.

The whole app was 4 Java files/classes.

2 files had about 1.000LOC, 1 had about 15.000LOC, and last one had 80.000+ LOC.

The big one did it all, UI, db calls, forms for printing, network calls.

Most of the variables where named something like “c12”, “bkf22” etc.


Magboltz [1] is a single >95k line Fortran file (although, I'm not sure if it's multiple files combined into one before upload).

[1] https://magboltz.web.cern.ch/magboltz/


Moral of the story: no engineer worth their salt wants to work on a "codebase" like this, so as an independent contractor, you can virtually name your price if you're willing to wade through the mess and solve acute problems.



A lot of what we call best practices, are good practices when collaborating. When I do things just for me - 10000 lines in a file isn’t a problem. I can see the whole thing in a minimap and know every comma


> What is the moral of the story?

Either figure out how to improve the situation, or leave.


When I learned to write FORTRAN IV for the CDC 6400 in 1974, I was told the standard compiler was written as a single routine. Didn't see the code myself, sadly.

My, how our knowledge of how to do things has changed!


My company has REST client that is a single 9,000 line Go file that I have nightmares about. By my estimate it is really a ~300 line program written by someone who hated DRY.


More likely it was autogenerated from the API specs, if it hasn't been edited too much you might be the hero who made regenerating it part of the build.


That was my first thought too, but if there's a generator for this it has been lost to time and retirements.


Just so you know awstats was a 25k+ line Perl script which i maintained as an internal fork for employeer once. It was epic & as hard to work with as you'd expect.


> What is the moral of the story?

Countless non-software developers, ranging from IT support to business analyst able to implement business requirements. Good for business if you ask me.


All software projects should be a single, self-contained file.


And on a single line of you're a real pro!


Single line programs is my expertise. :)


Reminds me of an article about the code that Neil Ferguson's Imperial College team used to model covid. Though I think the Imperial code sounded worse.


Short and sweet article. Worth it just for the last line.


> what is the moral of the story

Your boss will still expect you to be able to fix bugs in an hour or two without introducing any new issues.


Those are rookie #'s, I have had php code doing mostly sql migration work in PDO twice that length in a single file :)


Guilty. I wince wrote a win32 app in a single 30,000 line C++ file.

It was an interesting exercise but I’d never do it again.


The horrors of working at a place that treats IT staff as second class citizens is an interesting one.


There are two types of people in the world:

1. Who blog about how bad a 11,000 line project is

2. Who shipped the 11,000 line project


You can fit the plan9 kernel into that multiple times (and its not a microkernel)


man i really love the look and feel of that guys website. its perfectly simple.


I see people do this all the time with terraform and it's madness.


Terraform is declarative, so it's much easier to refactor.


> What is the moral of the story?

> I have no idea.

Sometimes successful software just grows.


Obligatory "Do not try to simplify this code" reference here https://github.com/kubernetes/kubernetes/blob/ec2e767e593953... (https://news.ycombinator.com/item?id=18772873) from Kubernetes persistent volume manager code.


Choose one:

Perfect code you never shipped.

Single 11,000 file code you shipped.


This is normal.


I don't understand why people complain about such things.

A seasoned programmer should have no problem navigating a multi-million-line codebase, that's just routine.

There isn't anything that special about a 10k file.


Writing long files is okay, but u less a languages module system gets in your way seperating different aspects of your code into different files does no harm.

There is no need — at all — to be evangelical about it one way or another.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: