more rtm's comments

rtm · on March 25, 2009

this is a test. I think you have already honed in on the basics: "You have to actually do it". If it seems like there are too many things to try and they all look appealing, you just need to understand yourself better and self observation will help you there.

You are wrong to suggest that we grow to like anything life forces us to do. If you have an analytical mind, a job flipping burgers or pressing buttons as a tester is never going to satisfy you. You might be able to change the job to something you like more (e.g. write test scripts) but that's a whole other topic.

Some additional things I have learnt that may be useful for you are: (1) There is no one thing predestined to be your calling in life, there are several things you may like. (2) If you really want something, you'll usually get it eventually or find something that you like better.

It is useful to set a long term goal based on what you know you like so far and at least initially, while you are still in college, it is useful to state it broadly but don't tie yourself to a specific way of getting to that goal.

For example, I know I enjoy coding, technology, business and teaching. There are several combinations of these that could end up being my calling. The specific image I may have in mind is to work as an engineer / technical entrepreneur now and become a I think you have already honed in on the basics: "You have to actually do it". If it seems like there are too many things to try and they all look appealing, you just need to understand yourself better and self observation will help you there.

You are wrong to suggest that we grow to like anything life forces us to do. If you have an analytical mind, a job flipping burgers or pressing buttons as a tester is never going to satisfy you. You might be able to change the job to something you like more (e.g. write test scripts) but that's a whole other topic.

Some additional things I have learnt that may be useful for you are: (1) There is no one thing predestined to be your calling in life, there are several things you may like. (2) If you really want something, you'll usually get it eventually or find something that you like better.

It is useful to set a long term goal based on what you know you like so far and at least initially, while you are still in college, it is useful to state it broadly but don't tie yourself to a specific way of getting to that goal. NEW END.

rtm · on March 14, 2009

Old: 2.4 GHz Pentium 4, 4 GB RAM, 32-bit FreeBSD 5.3.

New: 3.0 GHz Core whatever, 12 GB RAM, 64-bit FreeBSD 7.1.

cperciva · on March 14, 2009

You were running FreeBSD 5.3 until recently? You know it reached its EoL at the end of October 2006, right? There have been lots of security issues over the past 2.5 years which weren't fixed in FreeBSD 5.3.

And by "FreeBSD 7.1" you mean 7.1-RELEASE-p3, right? :-)

lsb · on March 14, 2009

So, if you've got all those in-memory data structures, there are lots of lists of pointers to objects. How does it compare on a 32-bit OS, with 32-bit pointers, vs a 64-bit with 64-bit pointers?

rtm · on March 14, 2009

64-bit mzscheme uses about 50% more memory than 32-bit mzscheme to hold all of the Hacker News data.

jwilliams · on March 14, 2009

You mean compare in terms of memory usage? It will certainly eat more memory for the same data structures.

We've got a simple functional language here - being very unscientific, and looking at the key data structures, for my current workload, straight pointers are about 5-10% of the allocation. In practice I do see a 10-25% memory usage increase going from 32bit to 64bit (Linux).

(Anecdotally -- I've read 20% is a rough rule of thumb for 64bit anyway).

There is also an increase in data structure size due to data alignment, but this is going to vary quite a bit -- depends on the structure, the compiler -- and if you're doing dynamic memory allocation you may even find it makes no difference at all -- as malloc may have been delivering an oversize allocation anyway.

So, it's certainly not a trivial increase... Especially as we're currently running on a 256mb slice (I sort of wish they had a supported 32bit option)... but it's not massive.

However, for us, 64 bit still has a lot of advantages. For example, you can do more expansive memory mapping and the like.

Not sure if that helps.

notaddicted · on March 14, 2009

Obviously the pointers take twice as much space but I don't think that is what you are asking.

arc runs in MzScheme 352 (http://download.plt-scheme.org/mzscheme/v352.html)

the freeBSD binaries on this page are i386, if thats what the new server is using then it won't make a difference. The source is available as well, if pg compiled the source in 64 bit mode it would increase memory usage.

But using a 64 bit OS doesn't necessitate using a 64 bit address space.

cperciva · on March 14, 2009

I'd guess that most of the memory is used by blocks of text which are much larger than the pointers which reference them, so probably the size_t expansion has a minimal impact on the total memory usage.

jwilliams · on March 14, 2009

That's true, but 64 bit will affect other things as well - pointer size goes up, your stack will be bigger and the packing (alignment) of your structures changes.

If your compiler is unfriendly, your structure size can change significantly -- afaik, GCC is pretty clever about structure packing, but I'd imagine it varies a lot depending on the architecture.

vaksel · on March 14, 2009

thanks

rtm · on Feb 10, 2009

Nonsense.

rtm · on April 4, 2008

rtm · on April 24, 2007

167 small transactions per second on MySQL 4.1 / InnoDB / FreeBSD / FFS / SCSI, which looks like one per rotation. Presumably the time required to write a block at the end of the write-ahead log.

5000 per second with --innodb_flush_log_at_trx_commit=0, which only writes the log to disk once per second.

The MySQL documentation claims that this configuration does crash-recovery correctly, though you may lose the last second's worth of transactions.

paul · on April 24, 2007

BTW, in case anyone is thinking, "why would I need more than 167 transactions per second", imagine that you have a web site that displays everyone's "last active" time (for chat maybe). If users are performing some activity once every 30 sec, then you will be pretty much be at your limit with about 5000 simultaneous users.

Robert, did you test MySQL with a parallel load? To give it the best chance at maximizing tps, it will need to be able to combine multiple pending transactions in a single write. Also, can you tell if it's updating it's b-trees? (if not, it may not be able to sustain these rates)

rtm · on April 25, 2007

I ran the test with 1,000,000 INSERTs (taking 200 seconds), with the innodb cache size set to only 200,000 bytes. So there's a fair chance it updated the b-trees on disk.

rtm · on April 24, 2007

On MySQL 4.1 / InnoDB / FreeBSD / FFS / SCSI, small inserts run at about 167 transactions per second, i.e. one per rotation.

Same setup but --innodb_flush_log_at_trx_commit=0, a million transactions in 200 seconds, or 5000/second.

I don't know if InnoDB wrote its B-Tree to disk during the 200 seconds. Same performance even with InnoDB's buffer pool size set to 200 KB with --innodb_buffer_pool_size=200000.

The MySQL documentation claims that this configuration does crash-recovery correctly, though you may lose the last second's worth of transactions.

rtm · on April 24, 2007

An ordinary DB writes each update to disk twice, once in the log and once in the permanent DB. DBs do defer updating the permanent DB. But they tend to be eager about flushing the log in order to achieve durability. If you read the documentation for PostgreSQL, for example, the only knob that seems to allow you to defer log flushes also says it may corrupt the internals of the DB if there's a crash. (Though there's no fundamental reason PostgreSQL couldn't defer log writes and have correct crash recovery, sacrificing only durability of recent updates.)

dfranke · on April 24, 2007

That's a very strange knob. What's the point of doing any logging at all if it doesn't let you recover from a crash?

I don't imagine that deferred logging is a big deal, though, because log writes are by definition sequential. As Paul pointed out, sequential writes just aren't that slow. You can build an array of 62 commodity drives for maybe $4000.

rtm · on April 24, 2007

Deferred logging is a huge deal! Appending a record to a log takes one (or a half) of a rotation. A rotation takes about the same amount of time as a disk seek. So a DB that synchronously appends the on-disk log for each transaction will be slow.

The huge win is if you can append many transactions to the log in each rotation. To do that you have to gather up many updates per disk operation. So deferred logging is critical.

I suspect the reason PostgreSQL doesn't really support delayed log flush is that they are thinking about ACID transactions, where you really need the data to be on disk immediately. A more technical issue is that the log data must be on disk before the corresponding permanent data (otherwise crash recovery will break), and I suspect postgresql.conf's "fsync" option has the effect of not fsync()ing the log at all, which indeed would cause permanent corruption after a crash.

neilc · on June 23, 2008

[ Replying to a very old thread ]

Indeed, fsync = off just means that the WAL isn't fsync'd at all, which can cause permanent corruption after a crash.

PostgreSQL does support a "deferred logging" mode, in which one or more transactions can avoid fsync'ing the WAL without risking data corruption -- the only risk is that those particular transactions might not be durable if the system crashes before the next fsync. This allows you to mix must-be-durable transactions with more transient ones, which is a nice feature.

dfranke · on April 24, 2007

Don't drives have hardware buffers designed so that you don't have to worry about that?

rtm · on April 24, 2007

The drives whose documentation I've read say they may not copy the write-cache to the surface during a power failure. I don't know about other drives, or about why.

Such a feature would anyway be hard or impossible to use as part of a design to get fast writes and crash recovery. Crash recovery usually depends on constraints on the order writes were applied to the disk surface -- for example that all the log blocks were on the surface before any of the B-Tree blocks. Or (for FFS) that an i-node initialization goes to the surface before the new directory entry during a creat(). Drives that just provide write caching don't guarantee any ordering (much of the point of write-caching is to change the order of writes), and don't tell the o/s which writes have actually completed. So the write-order invariants that crash recovery depends on won't hold with write-caching. That's why tagged command queuing is popular in high-end systems: TCQ lets the drive re-order concurrent writes, but tells the o/s when each completes, so for example a DB can wait for the log writes to reach the surface before starting the B-Tree writes.

In our case, perhaps a pure log-structured DB could use a disk write-cache. Crash recovery could scan the whole disk (or some guess about the tail of the log) looking for records that were written, and use the largest complete prefix of the log. But we would not be able to use the disk for anything with a more traditional crash recovery design -- for example we probably could not store our log in a file system! Perhaps we could tell the disk to write-cache our data, but not the file system's meta-data. On the other hand perhaps we'd want to write the log to the raw disk anyway, since we don't want to be slowed down by the file system adding block numbers to the i-node whenever we append the log.

rtm · on April 24, 2007

You can configure a drive to delay writing to the disk surface, and instead just write into its cache, until some later point when it's convenient to write the surface. But the reason a DB issues a write to the disk is that the DB needs the data to be recoverable after a crash before the DB can proceed. So DBs cannot easily use the disk's write-to-cache feature; the disk's cache is no more durable than main memory.

You might imagine that the disk would write-cache only an amount of data that it could write to the surface with the energy stored in its capacitors after it detected a power failure. But this is not the way disks work. Typical disk specs explicitly say that the contents of the write-cache may be lost if the power fails.

You may be thinking of "tagged queuing", in which the o/s can issue concurrent operations to the disk, and the disk chooses the order in which to apply them to the surface, and tells the o/s as each completes so the DB knows which transaction can now continue. That's a good idea if there are concurrent transactions and the DB is basically doing writes to random disk positions. In the log-append case we're talking about, tagged queuing is only going to make a difference if we hand lots of appends to the disk at the same time. In that specialized situation it's somewhat faster to issue a single big disk write. You need to defer log flushes in either case to get good performance.

dfranke · on April 24, 2007

You might imagine that the disk would write-cache only an amount of data that it could write to the surface with the energy stored in its capacitors after it detected a power failure.

That's exactly what I assumed, at least for high-end disks. Any idea why they don't do that? It seems like a pretty trivial hardware feature that would save an awful lot of software complexity.

brlewis · on April 24, 2007

The knob is there to let the OS do its best buffering "if you trust your operating system, your hardware, and your utility company."

http://www.postgresql.org/docs/8.2/interactive/runtime-config-wal.html

rtm · on April 24, 2007

A challenge in a pure version of a log-structured database is reclaiming space from the log when data is no longer needed. Sometimes you don't need to, as in Venti ( http://citeseer.ist.psu.edu/531029.html ).

rtm · on Feb 20, 2007

I wonder what this means for practical system design. Do people currently build assumptions about hard drive failure patterns into their systems, in a way that they should change? I suppose independent failure (i.e. copying data to two drives is better than storing it on just one) is the main assumption behind e.g. RAID; I wonder whether Google has any new insight there.

jbert · on Feb 20, 2007

You should be able to improve over naive RAID by pairing a relatively-high-probability-of-failure drive with a low prob one. i.e. what you *shouldn't* do is the common practice of putting two new drives in a mirror, since they are both in the infant mortality part of the failure curve. What this data suggests is that you'll get a smaller chance of losing data (via simultaneous failure) if you pair a new drive with an older "proven" one (but not one so old that it is nearing end of life).

charliehotel · on March 24, 2007

infant mortality is practically non-existent for enterprise-class drives and rare for consumer-class drives.

there is something to gain by using drives from different manufacturers (or different lots from the same manufacturer) within an array.

Elfan · on Feb 20, 2007

Companies would be interested in saving on cooling costs if its not providing any significant benefit.

charliehotel · on March 24, 2007

yes, the classic RAID paper assumes that faults are independent. this is not the case.

some recent work extends the basic analysis to deal with correlated faults.