I have to disagree. You should always put limits on everything, as said nicely here [1]:
"Put a limit on everything. Everything that can happen repeatedly put a high limit on it and raise or lower the limit as needed. Block users if the limit is passed. This protects the service. Example is uploading files of logos for subreddits. Users figured out they could upload really big files and harm the system. Don’t accept huge text blobs either. Someone will figure out how to send you 5GB of text."
Sure, you should ideally do this in your application code. But if there is multiple interfaces (such as a REST api etc) to your database then you have to remember to put them in place everywhere. I don't see a good reason to make a username field to be TEXT instead of a generous VARCHAR(300). If somebody wants to choose a longer username than that, he's probably malicious. It protects you with zero cost and allows you to make some user input sanitation mistakes (we're all humans) in your application code.
Just define a column constraint (syntax might be wrong):
CREATE TABLE t (col TEXT CHECK length(col) < 50);
What is nice is that you just need to add/drop a constraint if you want to change the max value check rather than changing the data type which can result in a table rewrite depending on the scenario. You can also add a min length check as well using this method or regex check, etc.
I think the author should have made this point rather than just glossing over it with "constraints, triggers are more flexible".
Doesn't the use of check constraints add an overhead and possibly negate the slight performance advantage that was gained by using text instead of varchar/char?
Yeah. Technically, `foo varchar(n)` amounts to syntactic sugar for `foo varchar check(length(foo) <= n)`.
What the benchmarks show, if anything, is that while the `check` involves negligible overhead per row, the overhead can eventually be enough to make a measurable difference.
Not to mention, the Postgres actually manual says as much as the above, in its description of text types.
> Sure, you should ideally do this in your application code. But if there is multiple interfaces (such as a REST api etc) to your database then you have to remember to put them in place everywhere.
If its not a logical constraint on the data but a limit on input to prevent malicious actions, the right tradeoff to prevent malicious action may not be the same across all interfaces, and, in any case, even if it is its not clear that -- not being a logical constraint on the data -- it belongs in the schema in any case.
> It protects you with zero cost
No, as TFA, the detailed analysis linked in TFA, and the documentation point out, it is not "zero cost". CHAR(x) and VARCHAR(x) have a non-zero performance cost compared to VARCHAR and TEXT.
It may be a justified cost, but its absolutely not zero cost.
As you can see times are very similar – there are
differences between data types, but there is no clear
“winner", and I'm much more inclined to assume that
these differences come from random fluctuations of load
on my machine, than from anything else.
....
This (I believe) proves that performance of all 4
datatypes is the same.
In other words, there is no difference between them in terms of performance. There is a slight increase in disk usage possible for CHAR, but really, that's used for fixed sized fields anyway (like product codes), so that's not an issue.
The idea that you should just blindly use TEXT fields is horrible advice. You should model your data accurately to make sure you can use your database to do its job - protect and store your data. Otherwise, why not just skip the pretenses and use a NoSQL storage engine.
If you need a TEXT field to store data that could be large, then do it. But don't make your "username" field a TEXT when VARCHAR(300) would do.
>the right tradeoff to prevent malicious action may not be the same across all interfaces
right, but it is the same at the same place, the database, so that's where you should put that constraint. There's not a lot of conceptual load here.
Put limits on the database so your database doesn't get knocked over. No, don't put limits in your client so your database doesn't get knocked over, put the limits for the database in the database.
I don't see where the gap is here. Sounds like premature optimization to me. Yes it does matter that Postgre abstracts the standard SQL datatypes away in the backend, no it doesn't matter what the performance impact of that is. Why? because unless you're committed to this database backend, trying to make it run faster is a waste of effort. Lets Postgre do that for you.
Do your job as a programmer and setup your database/schema right using the standardized standards at the standard level, then let the database do its job and setup the actual bits how it thinks is best.
>I have to disagree. You should always put limits on everything
like any orthodoxy it should have a limit put on it.
>But if there is multiple interfaces (such as a REST api etc) to your database then you have to remember to put them in place everywhere.
multiple interfaces going directly to database - that's a much bigger problem that the rest pales before it. In particular, multiple updates of limits would constitute only minor share of headache causing updates and the multiple updates itself would be just a minor share of the whole headache.
Good God, yes. I've worked on too many older projects where all the old business logic lived in the database, and all the new business logic lived in the application, and keeping them sorted out was a nightmarish pain. The defense for using horrible database-specific procedural languages for all that business code was that it meant we could have one back-end for all the front-ends. I've seen this pattern so many times, and it's pure agony.
> Sure, you should ideally do this in your application code. But if there is multiple interfaces (such as a REST api etc) to your database then you have to remember to put them in place everywhere.
Whenever I've done this I've just separated out the model/db access layer into a module that is shared between projects that need it and built the validation into that layer. DRY.
The article certainly doesn't advocate for removing any constraints; there are just much, much more flexible ways to accomplish them in postgres, some of which offer similar performance. A domain using a TEXT field and constraints is probably the most performant (and flexible).
One difference is that changing the length of a varchar column in an existing table can be a major pain if you have other database objects that must be the same type. For example, any views which also carry this column will need to be dropped and recreated (or otherwise modified), or else Postgres won't allow you to make the change.
Working with the text datatype and using check constraints on length makes this much easier.
Yes, but with some minor caveats[1]. I prefer always using check constraints since then you get all length constraints in the same place in the table definition.
1. The SQL standard requires truncation of trailing excess spaces on inserts which overflow the limit. Additionally the limit must be less or equal to 10485760 which is less than the maximum length of a string which is 1GB.
Another thing is to plan for a solution that can expand. Don't use a data type that requires massive table rebuild times if you ever increase its size. If adding a column or expanding a field takes O(n) time, don't expect n to be small forever. If an operation takes greater than O(n) time, realize that you'll pay that price eventually.
Basically, establish soft upper bounds.
And as a P.S.: don't expect anything sanitized by your application layer to be an invariant. Your database and the rules enforced by it are the only real invariants. If an unexpected character in a name field will blow up your application, you should fix it in the database (varyingly difficul with many RDBMS solutions) or treat it as user input and sanitize/scrub it at the application layer (more common with NoSQL solutions). Ideally, do both!
Wouldn't that kind of code live in a data layer that sits between the outside world and your database? So you may manipulate the database in multiple ways but they all go through the same/similar codepaths.
As others have pointed out, you don't have to put length constraints in multiple places if you have DB access go thru a suitable layer or module, and this is generally (AFAIK) good architecture.
Surprised someone hasn't pointed out, too, that catching this in code is faster/cheaper than a database hit and erroring back to the front end. (It's less of a difference with AJAX but significant in traditional server-side applications, of which there are more than you might think.)
I do have all my different applications go through a suitable layer to access the data. That layer is called postgresql. It stores all the data, and all the rules about what constitutes valid data. This way, there is never any chance of invalid data getting in from one of the half dozen applications written in various languages forgetting one of the hundreds of rules.
As long as the limits are actually generous and based on actual user studies. So many horror stories about varchar(1000)s in places such as medical reports where it can be a matter of live and death that the users can include everything they think is relevant, without awkward abbreviated language that's easily misunderstood...
If you want to change the max length to be larger, postgres will have to rewrite the table, which can take a long time and requires an exclusive table lock for the entirety of the operation.
As of (IIRC) 9.2, this is no longer true. If you alter a varchar column to be narrower than it currently is, you'll rewrite the table. If you make it wider, or convert from varchar(n) to text, you won't.
EDIT: And if you're willing to hack the system catalogs instead of using an ALTER TABLE, you can even get around the rewrite to narrow a varchar(n) column. See: http://stackoverflow.com/a/7732880
1. While the linked blog post is new today, its mostly a link back to a different 2010 blog post.
2. The linked blogged post and the 2010 blog post basically discuss performance considerations that have been documented clearly in the PostgreSQL documentation for character data types since version 8.3 (and less completely for several versions before that) regarding the performance considerations ( CHAR(X) worse than VARCHAR(X) worse than VARCHAR and TEXT.)
3. The linked blog post says "don't use CHAR or VARCHAR", but really, it should be "don't use CHAR(x) or VARCHAR(x)". VARCHAR is pretty much identical to TEXT.
Yes, because varchar is implemented as a fixed-length list of characters. Text fields are implemented as blobs, and as such can grow large enough to have to be stored off-page (with all of the associated performance hits associated with that).
As someone who regularly has to write SQL in 7 different dialects I much prefer VARCHAR to TEXT as the former is handled much more consistently across platforms. A TEXT column in MS SQL Server is an entirely different proposition to a VARCHAR column. VARCHAR on the other hand is treated with relative similarity between most of the systems I regularly use.
Additionally, one of the key benefits of more explicit datatypes is documentation. Knowing that a column is 30 characters wide is useful information to have at hand (without having to check check constraints) and often reflects a business rule. Where joins have to be performed on character columns it also helps to know if both sides of the join are both (say) CHAR(8). If every text column is TEXT then that's much less clear. Check constraints help but you don't always know if they've been applied to all current data (some platforms allow for constraints to ignore current data). If I know a column is VARCHAR(50) then I am 100% certain that there will be no value longer than 50 in it.
Couldn’t agree more. Database constraints should be thought of as the last line of defence against madness rather than as means to validate input. Database constraints are not really suitable to defend against attackers. Constraints might stops users from creating extremely large records but they won't stop users from creating an extremely large number of records etc. You need to sanitise your input thoroughly in the application layer.
There is also a maintenance cost by putting restraints on the database as almost assuredly your application will have similar constraints. Therefore if those constraints change the workload is not just doubled because you have to update the application in two places. But perhaps tripled or more as the developer tries to find any other locations where the same logic might have been put as well.
If the logic is in two places it might very well be in three or more.
Silly example, who decides what is 'an extremely large number of records'? Whoever has a view about this should monitor and police the limit. If you are saying that you can't protect against every eventuality so you may as well guard against none, then that is asinine.
Applications should enforce correct application behaviour regardless of user behaviour. Databases should maintain database integrity regardless of application bahaviour.
From my database course I learnt that nothing is slow in a database until you can't fit your join operation in memory. Having to do a a join operation on varchar(300) is predictable. Having to do one on text is unpredictable and can slow your system to a crawl if the operation needs to be done on disk rather than in memory.
FWIK the article did not talk about joins at all. I would love if someone has a good article comparing what happens when you do a join on varchar vs text.
EDIT: One question remains, how is the "text" stored when doing a join. For varchar you just make an array, because all values are the same length. With "text" do you first find the longest text and use that as the default array element size? This may not seem important with in memory operations, but seeking on disk is considerably faster when all your elements have the same size.
The memory usage of a join has many more variables than the size of the joined field (in some cases it might not matter at all as long as that field is not included in the result set) so I would not say that joining on a short string is that much more predictable than joining on a possibly long string. What matters the most is what the query actually does, what the data looks like, and what indexes you have.
Joins are usually implemented in one of these three ways, and the planner can select which one to use depending on the query and table statistics:
1. Hash join: Build a hash table of all the joined fields of set 1 and then traverse the rows of set 2 looking them up in the hash table.
2. Merge join: Sort both sets of rows and merge them. With the right indexes you may not even need to sort anything, just traverse in index order.
3. Nested loop join: For every tow in set 1 find matching entries in set 2.
use VARCHAR because constraints are a good thing, and use CHAR if you are storing strings of a fixed length, because semantics are a good thing. The point about padding being wasteful for variable-length data in CHAR is moot because CHAR is for storing fixed-size strings like state codes.
As the PG docs say, there is virtually no performance difference at all between all three, so stick with standard practices.
> and use CHAR if you are storing strings of a fixed length, because semantics are a good thing
The CHAR type is not what most people think it is. It is a blank padded string, not a fixed length one. Inserting 'a' into a CHAR(2) results in 'a ' being stored and retrieved from the database. You should always used VARCHAR or TEXT in PostgreSQL and never CHAR (at least I cannot think of a case when you would want it). CHAR is there for SQL standard compliance. Instead use one of these:
field VARCHAR(2) CHECK (length(field) = 2)
field VARCHAR CHECK (length(field) = 2)
field TEXT CHECK (length(field) = 2)
The rules for working with blank padded strings is in my personal experience a hell with non-obvious results when casting, sorting, comparing and other operations.
I do not see how either VARCHAR or CHAR would provide more or less information since both are variable size strings. One is space padded and one is not.
If you care about semantics you should create a domain based on VARCHAR.
the fact that one sees "CHAR" in the schema definition is self-documenting that this is a fixed length field (and CHAR is "fixed length", the size of the string you get back will always be N), that's what I mean by "semantics". E.g., "what does it mean that the developer used CHAR here and not a VARCHAR?"
The CHECK constraint you illustrated earlier can just as well be placed on a CHAR (using trim() as well to adjust for padding). Then there's no chance of any blank padding issues either.
But the semantics of CHAR are not what most people expect and almost never what you actually want. If you want an actual fixed length of non-blank data you need an additional check constraint to enforce the min-length.
CHAR semantically represents fixed length text fields from old data file formats not this data always has n (non-blank) characters.
If you do have different length then a VARCHAR is more appropriate. Also a lot of application frameworks that interact with the database only deal with VARCHAR so then as soon as you use a CHAR you have to start worrying about trimming your text data because one of the biggest database text type gotchas is accidentally trying to compare a VARCHAR and a CHAR improperly.
While I can see good reasons to include length checks there is never a good reason to use a CHAR unless you're trying to interoperate with COBOL programs written in the 80's
> But the semantics of CHAR are not what most people expect
how do you know that? did you take a survey? I've been working with DBAs for years, in my world everyone knows that's how CHAR works. The padding behavior is nothing new, and is intuitive - the value must be N characters in length, so if you stored less, you get back a right-padded string. This is exactly what I'd expect.
> CHAR semantically represents fixed length text fields from old data file formats
and two or three letter character codes like country codes, state codes, etc. are what we use CHAR for, these are fixed length text fields. They are still in modern use today. Plus lots of us still have to write apps that actually read old files too - CHAR is appropriate for these as well, assuming you are storing fields that aren't right-padded in the source datafile (such as social security numbers, etc.).
Your app will of course work with a VARCHAR instead, but the point of CHAR is that it's self-documenting as to the type of data to be stored in the field - fixed length, as opposed to variable length.
> If you do have different length then a VARCHAR is more appropriate.
if you are storing variable length, then you should absolutely use VARCHAR. That's why it's called "VAR", it means, "variable".
> Also a lot of application frameworks that interact with the database only deal with VARCHAR so then as soon as you use a CHAR you have to start worrying about trimming your text data
If you are using CHAR correctly, you don't have to trim anything, because you are storing a string that is exactly the length of the CHAR type. I'm not familiar with how an application framework would only know how to deal with VARCHAR and not CHAR, database adapters return strings for both types. And if there were such a framework, I'd not be using it.
> one of the biggest database text type gotchas is accidentally trying to compare a VARCHAR and a CHAR improperly.
all of which stems from the same, singular mistake - don't store variable length data in a CHAR - plus if you are comparing VARCHAR to CHAR, that is also usually doing it wrong, as an adequately normalized database wouldn't be repurposing some kind of fixed length datatype out into a VARCHAR of some kind elsewhere.
The aforementioned CHECK constraint is a good way to enforce that if the developers/frameworks in question tend to be error-prone about this kind of thing (it's not an error I've had much issue with, since I know how CHAR behaves).
> While I can see good reasons to include length checks there is never a good reason to use a CHAR unless you're trying to interoperate with COBOL programs written in the 80's
as it turns out a vast portion of the world economy is supported by mainframes and old software that often spits out fixed length datafiles, there is even Python code in my current work project just written in the past six months which is tasked with parsing such files (they are actually quite simple to parse, since you just pull out each field based on an exact position). Not to mention that boring things like state codes, country codes, and the like are often fixed length fields.
I say it's not what people expect because everyone emphasizes the "fixed length" rather than "blank padded" nature of CHAR. CHAR is only actually a fixed length if you actually ensure that it is so yourself. That's possible but then you're just using a CHAR as a placeholder for those semantics not as something that naturally enforces them.
If you actually really really have fixed length fields then yes CHARs could be appropriate. But for many things even though you intend for them to be static length codes things change when you start having to interoperate with systems designed with different constraints. (For example after mergers or aquiring a competitor.) And I know that mainframes still exist but they aren't the use case in mind when many say "USE CHAR".
Also the database adapter that handles CHAR poorly is none other than JDBC on oracle http://stackoverflow.com/questions/5332845/oracle-jdbc-and-o...
(Yes that is hilariously bad.) But the mysql way of always ignoring trailing whitespace is not standard in all databases.
Yes, I did read it and what I disagreed about is CHAR being semantically correct. Is is not. The explanation was provided for the benefit of other readers than myself and the parent poster.
> As the PG docs say, there is virtually no performance difference at all between all three
Btw, there is a clear [though infrequent] case where using CHAR(X) vs. VARCHAR(X) may cause huge performance penalty with iron platter HDD. CHAR(X) will always contribute X to the row length (even if data value is Y < X, so VARCHAR(X) with such value would contribute Y) thus making it more probable to have your row TOAST-ed ( i.e. have parts of row stored separately in separate table) thus causing a noticeable (when only few rows are accessed) extra disk head lookup on non-cached data. Similar to C/LOB in-row limit exceeding on other databases.
Is anything really fixed-length? ISO country codes had a 2-char standard that got upgraded to a 3-char one, and RFC 5646 language codes are variable length. Plan for an extensible API, and just make sure that you control what data ends up on those tables.
What are you referring to with ISO country codes? There were 2-char and 3-char options from the beginning, and AFAIK the 2-char option is still the widely-used one. http://en.wikipedia.org/wiki/ISO_3166-1
What if the performance changes? What if you decide to migrate to a different db at a later time? What if your software determines field types and sizes in the gui based on the database schema? IMHO always use the right field for the job..
If you're thinking of migrating to another database later, you're basically ruling out using half of postgresql's more compelling features up front. I kind of don't understand this line of thinking.
Sure and ruling them out is a very reasonable thing to do, if you don't need them. No reason to adopt gratuitously incompatible features. I've needed to move databases between Oracle and Postgres quite a number of times, and I can promise you it wasn't part of the initial "requirements".
> What if you decide to migrate to a different db at a later time?
Then chances are your VARCHAR will not work anyway because while VARCHAR exists everywhere its semantics and limitations change from one DB to the next (Postgres's VARCHAR holds text, its limit is expressed in codepoints and it holds ~1GB of data, Oracle and SQL Server's are bytes and have significantly lower upper bounds (8000 bytes IIRC))
> What if the performance changes? What if you decide to migrate to a different db at a later time?
Couldn't agree more.
Unless you're at the point where preventing the DB from performing a length check on data it receives is going to provide a tangible benefit this article is awful advice.
Nothing is preventing you from adding your own check constraints, it's just moving from specifying the data storage as having a length to explicitly specifying the constraint.
I think you missed the entire point of the GP's message.
Tomorrow, postgres could release an update that massively improves the performance of CHAR compared to TEXT. You can take advantage of that by using the correct (I.e. semantic) field.
Add to that, project requirements change - yea, it happens. What if you need to port to mysql, mssql, oracle etc? Using the correct field types will make that easier, using a premature optimisation combined with a check constraint (blowing away any gains of that premature optimisation) makes that harder
That's interesting. So can you put an index on a TEXT column in PG? From what I know, you can't do that in SQL Server/Oracle, you can only use full text searching (I think).
Note however that normal b-tree indexes cannot index column values larger than 8191 bytes. The 8191 byte limit is applied after possible compression of the string.
Uh, shouldn't you use the most appropriate type available to describe your data, since that will simplify the process if you ever need to migrate to a different DBMS?
Nowadays the most appropriate type available to describe your data is a non-standard type that's specific to the RDBMS you're using, often as not. With SQL databases, you can generally only pick one of the following:
1. Any kind of expectation of hassle-free migration to a different RDBMS.
2. Benefiting from much of anything that's happened in any dialect of SQL since about the time Windows 3.1 hit the market.
Personally, I generally prefer #2, because #1 is kind of a myth anyway. For example, PosgtgreSQL's VARCHAR type has different semantics from Oracle's: One supports Unicode and the other doesn't.
Microsoft follows Oracle's approach and uses NVARCHAR to describe their infernal 2-byte format that might be UCS2-wrongendian or it might be UTF-16 depending on their mood and the tool you're using at the moment. Naked VARCHAR means you have to pick an 8-bit character-set like a savage.
Ya, or for Oracle you might be better off using VARCHAR2, which uses UTF-8. That way you aren't mired in savagery, but don't have to pay the performance hit of storing 2 bytes per character for text that's mostly in western European languages. Whereas SQL Server users are stuck choosing between doubling up on I/O and suffering codepages. Meanwhile in PostgreSQL you just use regular VARCHAR and pick utf8 as your character set like a proper subgenius.
Fun fact: In earlier versions of Portal, it was database portability that GlaDOS promised to give you after the experiment.
I've used PostgresSQL quite successfully for the past few years at rather large scales and I can tell you, using TEXT everywhere is sooooooooo much easier on everyone involved. Especially on large teams (hundreds of developers) where migrations are a big deal. And especially when the business teams are essentially dictating the use cases. Those people change their minds ALL THE TIME ("Yeah, I know we agreed that we only need about 20 characters for the description here, but we now think 25 will really make the whole thing pop, ya know?").
And as far as the argument for keeping schemas in strict SQL so that some future database switch can be made more smoothly...I mean c'mon. Even if you actually do that (who does that?) you're going to have more annoying things to do than replacing some TEXT columns to VARCHAR.
I wish more articles had tl;drs like this one. Very helpful. Often you have to skim through preludes and side stories with jokes to piece the gist together.
I'm of the opinion that your data structure should model your data. If your piece of data is best represented by char or varchar, then use it. I'm not super familiar with Postgres, but among other things, modelling your data correctly helps when another developer has to step in and maintain your app. They can easily get a sense of how the presentation layer should look if you've done so.
So is there actually any benefit to using text over varchar when the constraint is actually 0 to X, or instead of char when your input actually needs to be exactly X characters? It seems the real point of the article is make sure that these are really the constraints you want.
Interesting, I was just thinking about a variable width item on the disk expanding or contracting, and if it had to be arranged in a serial physical order it would (necessarily) be forced to move the other bits if it wasn't inserted at the end.
"Put a limit on everything. Everything that can happen repeatedly put a high limit on it and raise or lower the limit as needed. Block users if the limit is passed. This protects the service. Example is uploading files of logos for subreddits. Users figured out they could upload really big files and harm the system. Don’t accept huge text blobs either. Someone will figure out how to send you 5GB of text."
Sure, you should ideally do this in your application code. But if there is multiple interfaces (such as a REST api etc) to your database then you have to remember to put them in place everywhere. I don't see a good reason to make a username field to be TEXT instead of a generous VARCHAR(300). If somebody wants to choose a longer username than that, he's probably malicious. It protects you with zero cost and allows you to make some user input sanitation mistakes (we're all humans) in your application code.
[1] http://highscalability.com/blog/2013/8/26/reddit-lessons-lea...