Can't quite tell if this is a joke, but here's a related "story about a bug" from Doug Crockford [0]:
I made a bug once, and I need to tell you about it. So, in 2001, I wrote a
reference library for JSON, in Java, and in it, I had this line
private int index
that created a variable called "index" which counted the number of characters in
the JSON text that we were parsing, and it was used to produce an error message.
Last year, I got a bug report from somebody. It turns out that they had a JSON
text which was several gigabytes in size, and they had a syntax error past two
gigabytes, and my JSON library did not properly report where the error was — it
was off by two gigabytes, which, that's kind of a big error, isn't it? And the
reason was, I used an int.
Now, I can justify my choice in doing that. At the time that I did it, two
gigabytes was a really big disk drive, and my use of JSON still is very small
messages. My JSON messages are rarely bigger than a couple of K. And — a
couple gigs, yeah that's about a thousand times bigger than I need, I should be
all right. No, turns out it wasn't enough.
You might think well, one bug in 12 years you're doing pretty good. And I'm
saying no, that's not good enough. I want my programs to be perfect. I don't
want anything to go wrong. And in this case it went wrong simply because *Java
gave me a choice that I didn't need, and I made the wrong choice*.
He did not need the choice but others do. And he is wrong when he says it makes no difference whether you use a byte or eight of them. Yes, it will take the same amount of time to add two of them but it will also cost eight times more cache space and memory bandwidth to move them around. It may not be an issue if you have a single number or ten of them but it certainly becomes one if you have an array with millions or billions of them.
There are use cases where you need the choice, but most people do not need the choice.
Most programs written in the real world (enterprise-y Java apps) do not need strong control on GC, choice of integer types, or many other things offered to them. Reducing choice will increase code/tool quality.
I think that we should make the uncommon choice reallllly hard to put into place. Make it a pain to configure the GC, give specific integer types really long names. Just stop people from premature optimization and leave these tools to people who know what they're doing.
I guess that most programming languages, with possible exception of COBOL, were not created with boring stuff - like enterprise-y apps or webdev, that compose most of the programming today - in mind.
If you are unable to make good decision between different number types you better don't write software, IMHO. How do you reason about the operations you apply to your data if you are ignorant of the possible values?
Being able to make good decisions does not guarantee you will never make a mistake. And he even explains his reasoning behind his choice and it was a justifiable decision when he made it. But I stand by what I said - if you are unable to decide between using 8 bit or 64 bit integers or a floating point type or a decimal type you should not be a professional software developer because it is really a very basic and fundamental skill.
In the web industry, which is the biggest part of the software industry by people employed, most developers use dynamically typed languages and very happily never make this "fundamental" choice. Should they all resign?
Not only web industry: in Smalltalk every integer can have unlimited number of bits. On 32-bit systems internally 31 bit is used first and with overflow it automatically starts using variable size integers... There is no need to make a choice on the number of bits... Just use integers as they are in the nature...
Dynamically typed does not mean that you don't have to think about number types - 3/2 and 3/2.0 may yield different results in dynamically typed languages.
And how many web developers don't know about or are unable to decide between different number types? JavaScript type coercion is such a mess, how could you get away without thinking about types, even though number types are usually not an issue?
Thankfully Python fixed that discrepancy about 3/2
Anyhow, I don't agree with you: I think that getting a bug years down the road due to a too small numeric type is something that the programming language itself should avoid.... not because "developers don't ought to know it", but because mistakes happen
Anyhow, even with a dynamically typed programming language like Python or Javascript you can care about the size of your numbers.
Just import the array module (in Python) or use the Int32Array/Int8Array/etc. types (in Javascript)
In what way it isn't? Just look at this beauty [1] and its consequences. But probably nobody uses the equality operator anymore because it is such a mess. Less than is even more messed up [2].
Well, your first line doesn't make sense, since the first two comparisons are using type coercion but the third isn't, so it's like saying `0 == false` but `0 !== false` how can this be? (" " != "" is the same as " " !== "")
The second line at least uses type coercion, but still you are making the wrong assumptions. true could be coerced to many strings 't', '1', 'true', 'yes', 'on', but they chose to use '1' (You may not like it but I think it's a good choice). Infinity on the other hand has not many choices when coercing it to a string I can think of '∞' (which is difficult to type), 'Infinity', and maybe 'Inf.' so I think they made a good choice here.
I'm not saying type coercion in js has no problems, but you said that it's a mess and I just think you chose the wrong examples.
That is not specifically type coercion, but the behavior of the equality (==) operator in JavaScript. Type coercion in JavaScript can be very useful, for example !!('foo') is coercive and easily understood (and has an unsurprising result). Thankfully the == operator is completely optional and has a more easily understood === (identity) correspondent, making your point less about a language and more about a specific operator within a language.
Other operators in JavaScript may behave more intuitively than ==, but I don't think you can really make a good case for JavaScript's type coercion being 'unsurprising'.
I suppose what I meant was that you point out that the coercive behavior of some operators is counter intuitive; ergo type coercion in JS is a mess. I dont think that follows because some operators behave in a coercive fashion more convenient than many other languages.
None - because === does not coerce types. Type coercion, not double-equal weirdness as such, is the gripe of the GGP. Non-transitive, inconsistent equality is nothing more than a symptom of JS's rules for implicit conversions.
Language-war disclaimer: I love javascript and everything, it's a very expressive language; but a good wodge of the tooling around it nowadays is to help people avoid things like implicit type-coercion 'surprises'.
Ruby, Python and PHP all support several types of numbers. But even if they did not this would not preclude people developing in those languages from being able of making such decisions.
I write software, and the amount of times I even have to explicitly manipulate numbers in a given week is very close to zero. Even iteration is done through iterators instead of indexes, so writing a plus sign is done pretty sparingly.
Data manipulation beyond "pull out of database" or "submit user input to database" is a lot rarer in enterprise software like this than in scientific computing. I'm not saying it's bad to be aware of it, but software is more than numbers.
I develop enterprise software, too, and I definitely think it matters there, too. You better make sure your database columns have the correct number type or you will get in trouble if your inventory numbers or monetary values start showing rounding errors.
I just wanted to address your point that enterprise software does usually not involve dealing with numbers.
When it comes to integers you are right - signed and 32 bits is a viable choice in north of 90 % of all cases. And when I wrote you should be able to make good decision about the number type to use I was already thinking of all the number types, however I did not express this well. But then I really don't see a lot of difference between being able to choose between integer, floating point and decimal types on the one hand and various integer types on the other hand.
Yeah, you're right, data types matter. Like sibling said, this was more in response to signed/unsigned or different bit sizes. We could get rid of minutiae while still allowing for broad choice when it actually matters.
I do think that the difference between Integer/Fractional is important, but honestly if you're dealing with money you should be using some Money datatype that's smart about this instead of raw numbers.
Even if you abstract away the indexes, when working with datasets that large you have to worry about whether the standard implementation makes the same class of error.
For example, using Java's binarySearch on arrays of length over 2^30 was broken until 2006[0][1].
»But in todays CPUs there is no advantage using the short thing. You can add 64 bits or 8 bits, takes the same amount of time. And you look up what is the cash value of having saved seven bytes on a number. When you add that up it is zero. So there is no benefit.« [1]
As long as you are concerned with adding a bit of eye candy and interactivity to a web page this may be true enough to get away with the JavaScript way of making every number a double precision floating point number but there a other domains where this will not fly. And even in the world of JavaScript asm.js is trying hard to overcome this limitation.
He is comparing Java and JavaScript and implies that it is a bad choice of Java to offer several options. And given the broad range of applications Java is used for I don't think this is a justifiable opinion.
Interesting, this is similar to the discussion going on for "int" in Rust (or the exact opposite, depending on how you view it). [1, 2]
One the one hand "implicit int" are being phased out in favour of explicit int size. Your variables cannot be `int` anymore, you have to sit down, think and choose: u8? u16? u32? i32? u64? i64? This avoids all the pains of programs behaving differently or crashing when compiled on different architectures.
On the other hand, a new "native integer for sizes that do not matter, minimum 32 bits" is being brewed, for example for pointer offsets or collection sizes. The idea is that you will not be able to have a collection with more of 2^32 elements in a 32-bit architecture nor more than 2^64 in a 64-bit architecture.
After this discussion, my hope is to see the introduction of a fast-ish dynamic bigint (that starts native and grow up to 256 or 512 bits) that can be used in all the cases where you do not care about the exact size type, yet you want to be future-proof (this `private int index` fits this case, IMO).
It would affect the standard library of rust, not rust itself I think. Second, the word-size matching the architecture has only a slight performance impact, i.e. operations on a 64-bit word on a 32-bit architecture take more cycles/instructions than 64-bit words on a 64-bit architecture.
It's only important for things like pointers that they match the size of the addressing space, and not even that is a very hard constraint, just a very convenient one.
16 bit architectures are weird enough that a lot of tools don't support them. And 8 bit architectures are basically only used for things like Arduino nowadays.
However, using a 32 bit number to store values that will never be larger than 16 bits isn't that bad, it's just very slow.
I'm guessing (since this is Doug Crockford talking about JSON) that this was in reference to how JavaScript does things differently, in that it just stores everything as floats, which are quite capable of representing integers within the 32-bit range anyway.
However, an overflow to floating point isn't necessarily an improvement because, while a float will hold bigger numbers, it does so with limited precision and sometimes that lack of precision will cause bugs too. Probably more often, in fact.
In the example given it wouldn't be so bad, but you'd only get an approximate indication of where the error occurred rather than a specific line/character. So of course, whatever is reporting the error would now need to understand and handle the much more complex scenario of "fuzzy" location information instead of a simple unique index to a specific character. Depending on what it then needs to do with that information, the complexity could spiral from there.
If you want to just have things work no matter what, you have no choice but to use bignums. I was wondering about this recently, so did some benchmarks in Clojure. The performance was horrible, so frankly this is still not a viable alternative. Maybe in 10 years time, if every CPU has a bignum coprocessor by then.
Also, there are times, particularly in low-level graphics programming or cryptography, where you actually want integer modulo arithmetic, or to be able to do bitwise booleans predictably. In those cases, JavaScript-style loose typing can be a huge pain.
BTW, I've been a big advocate of JavaScript for about as long as Doug Crockford, so my point isn't that JavaScript-style type handling is bad: just that it's very far from a silver bullet.
While negative numbers very often provoke exceptions like IndexOutOfBoundException, with unsigned integers error could be uncaught for much longer time. I'm all for signed integers, unless storage requirements so tight that you really need that one bit.
This is a decision a compiler can never make in a reliable way because it entirely depends on the actual input and is not known until runtime. You may get away with dynamic recompilation when you realize that the input is not what you assumed when you compiled the code but I really doubt that this is a smart and efficient way to go about it.
And asm.js is no evidence for the no side - the information is in the original source code, it does matter and asm.js works around the JavaScript limitation and makes this information available to the JavaScript compiler.
If you are working with files bigger than 2GB, hoping they're smaller than 4GB is NOT a good habit.
And I certainly don't believe there are programmers making only 1 mistake for 12 years. I believe he's just making a joke, or using the example as a means to an end.
Make the default numeric type effectively unbounded, and allow those who need it to choose more compact types when needed. This is what many languages do, and it is possible to both generate efficient code when needed, and correct code is more likely to happen.
I'm constantly amused how Java is supposedly stupid for protecting programmers from things they shouldn't do and yet also stupid for not protecting programmers from things they shouldn't do.
IMHO Java makes some choices about safety. If you don't agree with those choices, use a different tool. It doesn't make Java wrong for having a different opinion. Likewise I wouldn't berate C for being too low level or Ruby for favouring readability over performance.
So for that kind of software you use Python. Doesn't everyone know that? Java gives you that choice because it's for the kind of software where you need that choice.
The interesting meta-point though is that an audience of 20 million viewers is a big hit [1] so a billion views is 20M people watching it 50 times or, 200M people watching it 5 times. And 2 billion views is double that.
Put in perspective that is probably in excess the number of times the most favored "I Love Lucy" show has been seen. Or put another way, you've got a music video with the same eyeball impact as the highest rated television show ever.
That says to me that either advertising on Youtube is a bargain or advertising on TV is way over priced :-)
Or advertising on TV seriously under-represents the total number of impressions over time through alternate consumption streams. Right now, supposedly "unpopular" shows are cancelled, and then immediately get a successful Kickstarter from what turns out to be millions of fans who happened to be watching only through Netflix, or iTunes, or DVD box sets.
(Of course, none of these streams show the same ads the original broadcast does—but if you're a clever ad agency, you're already doing product-placement instead of interstitials most of the time anyway.)
> Right now, supposedly "unpopular" shows are cancelled, and then immediately get a successful Kickstarter from what turns out to be millions of fans who happened to be watching only through Netflix, or iTunes, or DVD box sets.
Can you name any examples of this?
The closest thing I can think of is Veronica Mars which was Kickstarted many years later and raised ~5 million dollars from 91,000 backers to make a single movie.
I think perhaps the "alternate consumption streams" viewers are not as lucrative as you think.
The Firefly series got enough support (in the form of written letters - this was pre-kickstarter) to be made into movie after a comically botched distribution through normal channels (The first seasons episodes were aired out of order in random time slots on Fox. It never had a weekly time that was consistent. This was the only season, natch.)
Family Guy also had a similar fate, but not because of a botched launch, but because its audience existed, but did not consume television through mainstream sources. It was canceled after 2.5 seasons and then went on to become the best selling animated DVD series. Fox brought it back the next year.
>The closest thing I can think of is Veronica Mars which was Kickstarted many years later and raised ~5 million dollars from 91,000 backers to make a single movie.
> (Of course, none of these streams show the same ads the original broadcast does—but if you're a clever ad agency, you're already doing product-placement instead of interstitials most of the time anyway.)
I think you just backdoored into the most interesting ad campaign ever:
1) Find a show with a directory / writer / production team known for producing content that "stands the test of time" (e.g. likely to have a high total_views_over_time:broadcast_views ratio)
2) Include product placement for a non-existent product by a currently-existing company with strong brand recognition
3) Test response to non-existent product by initial viewers
4) Start viral campaign around non-existent product (this likely favors "Hunh?" shows a la Lost or Fringe)
5) Trigger view bump in show (win award, produce new episodes in partnership with Netflix, produce new movie, etc.)
6) Launch real-product multiple years after initial product placement
I think you're missing a unit. You should be measuring eyeball-minutes. An episode of The Walking Dead might be 20M x 45min = 900 megaeyeball-minutes. Gangnam Style is 2B x 3min = 6000 megaeyeball-minutes. Disregarding target demographics for the moment, that says the advertising spend for a first run episode of Walking Dead should be about equal to 15% of the lifetime spend for Gangnam Style.
You're equating two things that have different lengths of attention which require different attention spans. They're also different in how the audience viewed that content. That gives the advertiser a different experience with the viewer.
For example, with I Love Lucy, the audience member likely sat and watched the entire commercial. With a YouTube video, the audience member can skip the ad or move on to other content.
TV = 22 minutes of content.
YouTube Video = 3 minutes of content.
Plus, the metrics that constitute views between the two media formats are completely different.
I'd probably do the same if I had similar watching habits :). Right now I mostly use YouTube for either a particular search result or just to play some music that's not on Spotify, and having to listen through a minute of advertising to watch a three minutes long video is a bit anger-inducing.
And god knows how many times the top music videos on YT are played at parties & other semi-public events! Heck, as the parent of young children, I've probably watched things like Gangnam Style >20 times just within my house.
Indeed! There's a cartoon rabbit for small kids here in the Netherlands called "Nijntje", and there are a few "official Nijntje songs" on YouTube. Our 1 year old daughter's favourite is this: https://www.youtube.com/watch?v=20J8DUJMgA4&app=desktop "Nijntje dansles" - it has 12 million views, and there are only 20 million people in the Netherlands, total!
This song has been played many times by a relatively small section of the Dutch population :)
That assumes YouTube eyeball count is of equal value to TV eyeball count though, right? Which doesn't seem like something we can assume- YouTube's targeting doesn't seem great, and there are plenty of other things to do on a computer while you wait through the ad.
Maybe YouTube ads aren't the best, but the ads on the Internet can have much better targeting and performance tracking than TV ads.
>there are plenty of other things to do on a computer while you wait through the ad.
Yea, for example you can buy the advertised product with a few clicks. If you are quick enough then you can finish the buying even before the video ad finishes (sure it's not the most realistic scenario but it's possible). Or with a quick search you can learn more about the product to check how honest is the ad. TV ads cannot compete with this efficiency. The only thing TV ads can do better is reaching bigger and the less tech interested audience.
"You should not use the unsigned integer types such as uint32_t, unless there is a valid reason such as representing a bit pattern rather than a number, or you need defined overflow modulo 2^N. In particular, do not use unsigned types to say a number will never be negative. Instead, use assertions for this." [0]
Which is a completely birdbrained policy given that signed integer under and overflow is completely undefined. If you want to catch implicit signed -> unsigned conversions then enable that warning on your compiler.... what they'd advocating is just dangerous.
In a strict typing environment, the other major issue is that int is cross-platform and forward compatible whereas uint32_t, uint64_t, uint8_t, uint16_t, etc. will all always be unsigned within a specified bound, so whenever we have 128-bit or 256-bit registers, we'll have to go back and update all this code that effectively "optimizes" 1 bit of information (nevermind the fact that int is usually more optimized than uint these days).
Furthermore, casting uintx_t to int and back again while using shared libraries is a huge pain in the ass and can waste a lot of programmer time that would be better spent elsewhere, especially when working with ints and uints together (casting errors, usually in the form of a misplaced parenthesis, are pretty small and can take a very long time to find).
> int is cross-platform and forward compatible whereas uint32_t, uint64_t, uint8_t, uint16_t, etc. will all always be unsigned within a specified bound, so whenever we have 128-bit or 256-bit registers, we'll have to go back and update all this code
uintN_t (and intN_t) are MORE portable and cross platform than int in the sense that you get much better guarantees about it's size and layout.
Furthermore, int is NOT the size of the register (x64 commonly has an int of 32 bits) so any updating you'd have to do to uintN_t, you'd have to do to int as well. Regardless, I can't imagine why you'd need to do any updating in the first place - it's perfectly valid to stick a uint32_t in a 64 bit register.
> nevermind the fact that int is usually more optimized than uint these days
Where are ints more optimized than uint? Not in the processor, not in the compiler (modulo undefined behavior on overflow) and not in libraries.
> so whenever we have 128-bit or 256-bit registers, we'll have to go back and update all this code that effectively "optimizes" 1 bit of information
This is why we have uint_least8_t and friends. In fact, int is really just another int_least16_t.
> Furthermore, casting uintx_t to int and back again while using shared libraries is a huge pain in the ass and can waste a lot of programmer time that would be better spent elsewhere
Could you give an example? It sounds like you're just talking about performing the casts, which shouldn't take much effort at all as indiscriminately as C casts about integral values.
You don't have to update code. If 64 bits was enough on a 64-bit CPU, it'll be enough on a 128-bit CPU. The one exception is when dealing with quantities that actually depend on the bit width of the CPU, like dealing with array sizes. The language already has good types for this, like size_t, and using int won't save you. (Quite the contrary, int will sink you, because int is almost always 32 bits even on 64-bit systems.)
I had my first nasty production bug (back in the early 2000s) when I assumed an Integer was 32bit in VBScript.
2 billion survey results was never going to happen. 32,767 would have been fine as well except to compound the issue ops pointed the production site at the test database.
Are your choices between variable-width "int" and fixed-width "uint_x"? After all, in C you can just declare something "unsigned" and it's the width of int.
However, I think this is a problem. The expected value ranges of your variables don't change just because your memory bus got wider - maybe you can use more than 4GB memory in a process now, but it's a mistake to plan for single array indexes being more than 32bit.
If you do try to be more flexible, I'm sure this would introduce more bugs than the forward-compatibility it'd add. Especially if 'int' is smaller than on the platform you tested on. That's why languages like Swift, Java, C# always have 32-bit int on every platform.
> casting errors, usually in the form of a misplaced parenthesis, are pretty small and can take a very long time to find
Agreed, but writing casts also adds unwarranted explicitness. What if someone made a typo and put the wrong type in the cast? How do you tell what's right? What if you change the type of the lvalue or the casted value? Now you have to think about each related cast you added.
What's the alternative? Well, the compiler should just know what you mean…
Int is not cross-platform and forward-compatible. It's implementation-defined, so it's up to the compiler. Practically speaking, every modern compiler defines int as 4 bytes, and can be expected to never change that (because of the vast swaths of bad code out there that is written with the assumption that an int is 4 bytes). So it's not forward-compatible. And while on most platforms you can expect the compiler to have picked 4 bytes, it's certainly possible for compilers to pick other sizes for int (I would assume compilers for embedded architectures might do that), which means it's not cross-platform either.
The size of int is implementation dependent, but its minimal range isn't. If I'm representing integer quantities between -32767 and +32767 with int, then it will work reliably across all platforms and compilers that's C99 complaint. I believe that's what GP is referring to.
"Completely undefined" is a good thing, because it's a strict line between good and bad (good and evil?). So, now that you know all integer overflows are bad, you can:
* dynamically test your program with ubsan to be sure they really don't happen, and then
* let your compiler optimize with the knowledge that integers won't overflow.
This last one eliminates maybe half the possible execution paths it can see, and loop structure optimizations practically don't work without it.
On the other hand, unsigned overflows? Some of those are bad, but some are fine, right? How will an analyzer know which is which?
Some notable libraries like C++ STL want you to write loops with unsigned math (size_t iterations), but those people invented C++, so why would you trust them with anything else?
Ubsan won't catch signed integer overflow unless you happen to hit the overflow case during your tests. Relying on dynamic analysis to catch errors you should have avoided statically is shoddy.
It's certainly less complete, but it's a little harder to decide what you want to prove statically.
If a function must-overflow the optimizer (hopefully) replaces the entire thing with an abort under ubsan, so you could look for that. But that's probably not sensitive enough.
And if the function is just 'x + 1' that may-overflow, but it's not important.
To be fair, even though unsigned integer overflow is very well defined, it's most certainly NOT what you want when used as an index or counter of anything.
+1. Everytime I see for(int i=0;...;i++) I wonder why we have developed this habit of defaulting all int as signed and consider uint as taboo (most coding guidelines asks not to use them unless "you know what you are doing"). Most of the time we use integers for counting and so uint should have been more natural. I did this in one of my libraries I was writing from scratch and I was happy for a while but then I got in to trouble because there is lot of code out there with interfaces expecting signed ints even though they should using uint. So ultimately the legacy forced me back to default again at using signed int.
> I wonder why we have developed this habit of defaulting all int as signed and consider uint as taboo (most coding guidelines asks not to use them unless "you know what you are doing").
I'm pretty sure that it's just because "int" is one word and "unsigned int" is two, plus more than twice the characters. I suspect if "int" defaulted to "unsigned int" and you'd have to specify signed ints explicitly, the taboo would be reversed.
Never underestimate the power of trivial inconveniences.
Forget about for statements for a second and let's write both a counting up and a counting down loop using while statements.
// count up
std::size_t i = 0;
while (i != 10)
{
std::cout << i << "\n";
++i;
}
// count down
std::size_t i = 10;
while (i != 0)
{
--i;
std::cout << i << "\n";
}
After initialization a for statement repeats "test; body; advance", this is ideal for counting up loops, but what we need for counting down loops is "test; advance; body". Since C/C++ do not provide the latter as a primitive you have to use a while loop as shown above. Using a signed integer to shoehorn a counting down loop into a for statement at the cost of 1/2 your range is a hack IMO. Note that when working with iterators you have to resort to a while statement as iterating past begin is UB.
> Still, the trick makes it look suspect and that's an argument against using it.
This is true. The code is confusing to people not used to it. A workaround could be to hide this code inside a macro, so people not interested in digging into the code would take the macro's word:
#define REVERSE_LOOP( x, i ) for( size_t i = x.size(); i-- > 0; )
But unfortunately, that doesn't help with the fear that people has against unsigned types.
In this case, it makes absolutely no difference at all. It could be argued that writing unsigned int would make the code slightly harder to read. That said, I like to use stdint.h and unint32_t would, I think, not have any drawbacks.
> there is lot of code out there with interfaces expecting signed ints even though they should using uint
That's not a good reason to not use unsigned integers, it's a zero-overhead cast from unsigned to signed (at the risk of overflowing into the negative).
yeah, some numbers can never be negative, but their difference can. and that's when it usually comes to bite me in the ass. i almost never use unsigned ints now.
I disagree, signed integer arithmetic in C and C++ is just toxic. Sure, if you need to compute the difference between two integers, which have both been pre-checked to lie between say -100 and +100, then fine, use signed ints... but for arbitrary input you need to do more work.
There's example code on the CERT secure coding guidelines here (look under 'Substraction'):
All arithmetic in C and C++ is toxic. That's the reality of using bounded-precision types. Honestly, I wish they'd had the foresight not to use the traditional infix operators for built-in types; they practically beg programmers to implicitly treat built-in types like the mathematical types they very vaguely resemble.
Really, working directly in fixed-precision arithmetic is absurd. In order to be able to rely on its correctness with any degree of certainty, you need to very carefully track each operation and its bounds, at which point you may as well have just used arbitrary-precision types, explicitly encoded your constraints, and had the compiler optimize things down to scalar types when possible, warning when not.
the funny thing is that fixed precision arithmetic is used literally everywhere and it just works. i'd say it's good enough for most practical purposes.
It is not used literally everywhere. It does not always "just work," as the original post demonstrates. It often happens to be good enough for most practical purposes, yes, but arbitrary-precision arithmetic is better for most practical purposes.
Fixed-precision arithmetic has one main advantage over arbitrary-precision arithmetic: it is more time- and space-efficient. This advantage only applies if the fixed-precision arithmetic is actually correct and the fixed-precision arithmetic meets some concrete time or space constraint which arbitrary-precision arithmetic fails to meet. It generally takes time and effort to demonstrate that these conditions hold; because one can rely on the correctness of arbitrary-precision arithmetic without doing so, arbitrary-precision arithmetic should then generally be the default choice.
This assumes that you care about making relatively strong guarantees about the correctness of your programs. If for some reason you don't, then sure, use ints and whatnot for everything. If you do, though, I suspect you'll find that it's easier to track down a performance bottleneck caused by using bignums than an obscure bug triggered by GCC applying an inappropriate optimization based on overflow analysis.
PostgreSQL doesn't give you an unsigned int option but if they did I wouldn't use it.
Having a negative pkey space is actually useful. In LSMB we reserve all negative id's for test cases, which are guaranteed to roll back. This has a number of advantages including the ability to run a full test run on a production system without any possibility of leaving traces in the db.
Most DBs don't support unsigned int [0] as a type (though its perfectly sensible to have a constraint that enforces >0.)
[0] though several do support UUIDs, which are essentially unsigned 128-bit ints, and which (with a well-selected generation mechanism) are better as server-assigned surrogate keys than sequential integers, signed or unsigned, anyway.
That seems like bad advice to me. A possible infinite loop is given as justification in case of wrongly implemented reverse iteration (counting down an unsigned loop variable). Well, i claim that an infinite loop is a much more noticeable bug than undefined overflow behaviour, negative view counts, etc. Unsigned ints will make bugs impossible that with signed ints will (hopefully, famous last words) trigger assertions, if they are enabled...
One problem with this is that the sizes of STL containers are returned unsigned, and with high warning levels, compilers will warn about comparing a signed int with one of these sizes.
Surely only temporarily, though. I mean, this is an exponential process---adding one bit only doubles the space, and it will not take another nine years before some video passes 4 billion.
By that line of reasoning, 15.75 years from now there will be a viewcount greater than 8 billion.
... and now I realize you may be correct, and that it's probably inevitable that a viewcount will not only exceed the total number of people alive, but will double or even quadruple it. Our total population is actually about 100 billion, but only ~7% of us are still alive.
The shadows of the dead will be forever enshrined as YouTube view counts. Our shadows.
YouTube is slightly under 10 years old. It's interesting how we're willing to project it forward into the far future. Not that i think that's wrong or anything - it's just that it's gone from nothing to essential in a fraction of a human lifetime.
"We always overestimate the change that will occur in the next two years and underestimate the change that will occur in the next ten." - Bill Gates
Why not just use some sort of unlimited BigNum implementation? Yeah, for small numbers it's still ~2x the size of just storing an int, depending on implementation (or it can be: "int unless MAX_VALUE, in which case bignum is stored somewhere else") and it might be slower to operate on... but, on the other hand, you are already storing and processing a full video for every such counter!
Edit: Now I realize that would mean Google couldn't have made this joke. But I am still not sure this was foreseen by Youtube devs from day one.
Yea according to a reddit post from a Googler this was more of a staged easter egg then a real bug. Google coding styles actually prohibit the use of unsigned integers in C++ code.
Interesting, until very recently I was writing a whole bunch of C++ at Google but I can't recall any such restriction. Most internal data structures at Google are expressed as protocol buffers and surely these support unsigned integers. In fact if someone suggested that a view counter should be signed that would invite a snark (kind of Google's specialty) like 'can it ever be negative'?
[One problem with unsigned integers in protocol buffers is that some supported languages like Python have no concept of signed/unsigned integers. Which is why for example Thrift does not support that distinction. This has nothing to do with C++ though.]
> One problem with unsigned integers in protocol buffers is that some supported languages like Python have no concept of signed/unsigned integers.
You can still implement range checks to enforce that the numbers are in the correct domain. It's not that much of a problem.
This is more of a problem in a statically typed language like Java, because it means there is no native data type to map protobuf's unsigned types to. In Python this doesn't matter as much because numbers will automatically promote themselves to bigints.
I suggest someone write a browser add-in that re-plays this 100s of time when machines are idle to do massive distributed viewcount attack and force YouTube upgrade to 64-bit unsigned int now!
When youtube launched in April, 2005, the initial source code was based on another completely unrelated website that I had worked on before, written in PHP and running on Apache and MySQL. It’s always fascinating how implementations of complex systems evolve.
The interesting question to me is why this particular video is so wildly popular. I don't generally go in for music videos, but I find this one fascinating and have watched it a dozen times. I read an article that tried to explain to non-Koreans like me the meaning of it all, and apparently there are several layers of parody and social satire. I think I love it for its combination of attitude, surrealism, bizarre humor, and self-mockery, plus the music that seems to fit magically.
The explanations of Korean parody/satire are largely irrelevant to its success given its popularity elsewhere, surely? I think it's the bizarre visuals that had it spread (why I tweeted it when it first emerged), then catchiness plus a repeatable dance move. It's the Macarena of its time in that regard.
Being Korean might've given it crossover appeal into much of Asia? Just a guess.
I love it because, in a world of fake pop musicians, this guy comes off as such a genuine goof. I can't help but like the guy, I'm very happy for him for this level of success on YouTube. And the song is super catchy. He's one of the very few pop musicians I appreciate (though so far, this is probably the only song of his I care for). The political satire makes it all the more compelling. I love the horse riding on top of a sky scraper.
in the words of my niece, its fun. She likes the silly man and while the sexual connotations of some things he does might make parents wince, they fly right over her head. She still has that innocence of youth. So why we can enjoy he irreverent humor, the sexual innuendo, she enjoys the silliness at her level. (plus she can do his horse stepping dance)
Aren't most Linux servers already 64 bit? And we aren't even close to 2020.
I'm sure some software will need to be re-written between now and 2038, but I don't think it will be quite as bad as Y2K just because that was only a 15 year gap (Sometimes less), whereas this is over 24 years.
I just think a lot of software will be naturally replaced between now and then. And while there will be a slight mad scramble to fix stuff at the last minute, I don't think it is Y2K-2.
People who think that 64-bit servers are immune are part of the problem. Even if you've got a 64-bit server, you've still got file formats with 32 bit timestamps embedded. For that reason, time_t remains a 32 bit integer, which means that functions like UNIX_TIME on MySQL will stop working. And then there is the mess of embedded software that most decidedly is NOT 64-bit and will be in machines that are still in operation.
Just as many as all of the 8-bit systems in use today. There is no need, in the vast majority of cases, for wide data busses in embedded applications. 16-bit is going to die out, though, like the 4-bit and bitslice processors.
I set the clock to one minute before time_t overflow on an iMac once. Recovering from that and just getting the machine to boot afterwards was no joke.
I'm not a XNU hacker, but it looks like they haven't. Their time_t typedef seems to be a __darwin_time_t [1], which in turn looks like to be 32-bit signed long [2].
As far as I know, the only major operating system that has dealt with Y2038 is OpenBSD [3].
I did tech support when the 99->00 switch happened, got paid 3x overtime. I got one call, and it was actually legitimate, but was a third party piece of software so after that we left and went to a party :)
I doubt this will be a real problem in 2038, then again the prevalence of computing devices is much larger now and will continue to grow by 2038, but so will technical aptitude, so hopefully they'll cancel out and this will still not be a problem.
Same, but 4x overtime here :) I was just on the PC team though so I left at 7pm after finishing the last few BIOS updates; the AS400 and HP-UX teams got the pleasure of staying past midnight.
This is a minor problem. In the 1980s, the number of tradable things with ticker symbols in US markets passed 32767, and some new issues had to be delayed until it was fixed.
Nifty example. Billionaires, trillion dollar budges, billion-view celebrities, fast CPUs, and large memories: all reasons I am done with 32 bit architectures (old article of mine, but only on large memories http://www.win-vector.com/blog/2012/09/i-am-done-with-32-bit... ).
32-bit architectures have nothing to do with the size of different data types that have existed forever. We had 64-bit longs in 8-bit cpus.
Also, there are perfectly valid applications that require numbers of 8, 16, 32 or 64 bits (or variable encodings with arbitrary precision). Petabytes, embedded microcontrollers, etc.
Sorry I was unclear. 32 bits architecture can mean a lot of different things (buss sizes, address word sizes, and so on). Mostly I am done with small pointers (having to use segments to address all of your memory, or not being able to memory-map a disk sucks) and small counters (only being able to put signed 32 bit integers into a collection sucks).
I saw this a few days ago, at first I thought it was an easter egg on youtube's part - saying "so many views we overflow!"
But it's real?! It seems incredibly absurd that it could actually overflow, how are signed values useful for a count of views? How are you going to have negative views?
Say you're comparing the number of views between two different videos as (video_A.views - video_B.views). How do you represent that the second video has more views than the first?
And what if you want to say how many more? You'd have to first check which one is greater, and then subtract in the right direction. The code is simpler if you just use signed integers.
To say "b has n more views than a", you need two operations whether you use signed or unsigned ints. Signed let's you say "these two videos have n different views" but, to be honest, it seems unlikely that you be doing either in so many places in your codebase, or so frequently during execution, to make the extra operation that significant.
Yes, in this specific example you still need the same number of operations. My point was, when you just use signed integers, it's easier to conceptualize and harder to screw up. It's more flexible for unforeseen future use cases. Also, this is a very simplified example case; in most scenarios those unforeseen possibilities will be more significant.
Given that switching to unsigned saves you exactly one bit, it's just not usually worth it. How often do you need exactly 32 bits of unsigned space, when 31 isn't enough, and you can't use 63? (I'm talking about standard-length integers here, not extreme situations where you're trying to make maximum use of 8 bits of storage or similar.)
Java is mostly confined to Android and the frontend (think GWT) which do not constitute a large proportion of the Google internal code as far as I can recall.
It's just a 2 billion limit crossed, 32 bits can count up to 4 billion. Afterwards, they certainly don't have to change to 64-bits, just add a few bits more.
If you do it at home, for your 10 videos, there isn't. At youtube scale, they certainly can benefit in having different number of bits in storage and transit and in the CPU. Only on the CPU, and only if you actually want to use the value in calculations the 64 bits is best step after 32. See also discussion here: https://news.ycombinator.com/item?id=8691291
uint_32 strikes again! And one day we'll stop using it in favor of int_64, and all unique identifiers will be string, and all will be well.
I remember when Twitter had rolled over their tweet ID's because they were using an int type that was too short. Should have gone with variable length strings to avoid that problem.
Using strings avoids one problem but introduce a bunch of others (e.g. a string is harder to verify, therefore less secure, and therefore needs to be handled with the kiddie gloves). Checking that every character is between 0-9 and dropping all other characters is easy, cheap, and effective. Then just check it is between uint64.Min and uint64.Max, and you're done.
uint32 gives them twice as much capacity (which isn't enough at this stage), they'll likely want to go with a uint64.
That's 16 bytes instead of 8 bytes for a uint64 that still grants you 18,446,744,073,709,551,615 variations. Seems overkill, particularly if you actually generate GUIDs "correctly" in which case you've allocated 16 bytes but will only ever use a sub-set of them[0].
ObjectId is a 12-byte BSON type, constructed using:
a 4-byte value representing the seconds since the Unix epoch,
a 3-byte machine identifier,
a 2-byte process id, and
a 3-byte counter, starting with a random value.
You can actually convert the _id to/from a timestamp, which lets you do cool things like never keep a timestamp field (you can convert a datetime to an ObjectId and use that for comparison).
If the 4 bytes are unsigned, then did they just intentionally introduce a year 2038 problem?
12 bytes is enough to just store a random value, with low chance of collisions until you've got around 100 trillion items. It confuses me why anyone would want to waste bytes of an ID on low entropy values like machine and process ID.
Or any combination of high res time plus random works nicely.
OTOH, Mongo's not exactly been a bastion of engineering excellence.
Every time i check the most viewed videos on YouTube i get depressed and lose all faith in humanity.
Landing on a comet gets you 250K views, anouncing the discovery of the higgs gets you less than 100K, latest twerking video or PewDiePie 2M at least...
Not sure why the pessimistic poster is being downvoted. Factoring for repeat viewers, it is sad how little society values scientific achievements. That said, I was up in the wee hours of the morning watching the LHSC start up and probably put more than 20 views on that song. Have ye hope :D
One is little news report about recent scientific finding (that you don't even need to watch since there are tons of other better sources of that) with viral entertainment video, which is specifically made to get as many viewers as possible. These are not of the same type to be compared.
Scientific achievements don't necessarily make for entertaining video content. I'm sure most people would rather read about most scientific achievements (which these days are largely invisible or theoretical) than read a literary translation of Gangnam Style, too.
EDIT: is there a reference for formatting comments? I've never been able to find one.