Programmers: how to make the systems guy love you

vilya · on Oct 17, 2010

One thing to add to the points about documentation: if you as a developer either don't write it or do a half-assed job of it, then you won't be able to just hand it off to the systems guy - you're going to end up doing 2nd/3rd/whatever line support for it because the documentation is YOU. And that, of course, will mean less time for coding.

megablast · on Oct 18, 2010

In my limited experience, you can never make the system guys love you, nobody can.

Simply put, system guys have a lot of downtime, where they do stuff that they enjoy, rather than work. So anytime anyone interrupts them from this, they are going to be annoyed, it is only natural. You are taking them away from fun, to something boring, and annoying to fix.

Regarding the article, I am not sure why hosting considerations aren't talked about at the beginning of the project. Even working with non-technical people, this has always been one of the first issues. Are we going to host it here, are you going to host it, can it use a shared platform, does it need a database. This is all thought about at the start of the project, and if need be a systems guy is bought in.

logicalstack · on Oct 18, 2010

If the system guys you have worked with have lots of downtime you have only worked with crap sysadmins.

scalyweb · on Oct 18, 2010

I'm guessing in this case downtime does not refer to systems being in a down state but rather the sysadmins have time to work on whatever interests them.

9oliYQjP · on Oct 18, 2010

logicalstack's comment still makes sense if you interpret downtime to mean what you've just explained. Here are some ways that good systems guys fill their time in between deployments and triaging problems:

1. Reading security bulletins and proactively trying to determine if systems are affected and what to do about it.

2. Reading about upcoming hardware and software so that they can plan the best platforms to deploy applications to.

3. Auditing applications for various problems.

4. Doing routine testing of the validity of backups and how easily they might be restored.

I upvoted logicalstack back to 0 because he might have implied that if your systems guys are reading Hacker News all day and commenting on threads about TechCrunch articles, they might have better places to spend their reading time. Believe it or not, there really are systems guys that have their hands full doing real work and don't just sit back in their chair surfing the net and playing video games.

nailer · on Oct 18, 2010

> Simply put, system guys have a lot of downtime, where they do stuff that they enjoy, rather than work.

Assuming you mean 'free time'.

It seems you think System Administration is mainly fire fighting and waiting around to fire fight. Most of the basic stuff got offshored in the early 2000s, and the stuff that hits third level will actually require serious investigation.

But most system administration these days isn't fixing broken stuff: it's design, automation, audits, monitoring, and an ongoing list of continual pre-emptive and proactive improvement.

The shit SAs who can't program and are still employed are living under a sword of Damocles. Their jobs have already gone to Bangalore, they're just don't know it yet.

logicalstack · on Oct 18, 2010

One missing item is instrumentation and metrics. Understanding and debugging a complex application is made much easier by having an abundance of easily collectable metrics that describe the running or cumulative state of the system.

gnubardt · on Oct 18, 2010

Totally. As important as logging is to profiling and debugging a system, collecting metrics is invaluable for keeping a system running. Being alerted to potential (or actual) problems can allow an admin to respond to them effectively, rather than trying to patch things together after it's hit the fan.

Goladus · on Oct 18, 2010

Another one: usually one of the first places a sysadmin is going to look when there is a problem is the log file. A good sysadmin will be able to read compiler-generated errors but still may not be able to do anything if the messages reference cryptic variable names and methods from code they've never seen. Spend a little time thinking about how an end-user, rather than a developer, would read the log file.

Obviously don't short-change yourself if you're the one who will be writing bug fixes and need detail, but a little can go a long way here, especially for problems a sysadmin may be able to diagnose and fix (connection issues, resources issues, etc.)

thirdstation · on Oct 18, 2010

It can also save you a 2a.m. phone call.

Some simple, "If you see message X, it means Y, so you should do Z. If that doesn't work _then_ call me."

Your debug-level messages can be esoteric but, make your error-level messages reflect what's broken from the perspective of the SA, who hasn't seen your source code.

CWIZO · on Oct 17, 2010

Great read, but I was hoping for some secret advice on how to be loved by sys guys/gals that can't be known to us programmers unless somebody tells us :)

thwarted · on Oct 17, 2010

Okay, here are some:

Be aware of how the init/startup system works on the machines your software is going to be deployed to. If you're writing a long running network service, then this means providing an init.d script that provides at least start, stop, restart, and status actions. Learn enough of shell scripting to write this script in sh or bash, not in some more heavyweight language like python or perl or php or ruby.

Write services that don't need to be running as root and/or know how to cease being root after their setup stage. Don't hardcode any settings (like port being listened on, or user to run as). Make a config file. And don't make it XML. Make it easy to build a package, deb or rpm, by providing the necessary debian/ dir contents or .spec file.

When the systems team says there's a bug in your code, fix it. If you don't, you risk having the notifications for the failed service being sent to your phone waking you up. When code is changing all the time, there's more likely to be a bug in your code than there is to have a random system configuration or hardware issue. In many cases, machines with hardware problems are most likely already removed from service before you even know about it.

Don't make the systems team debug your code and provide a patch. I know a lot of developers who think systems people don't/can't program or they only know how to "script"; that's not true, being able to program is one thing that, I think, is required to be on the systems team. We'll help you debug and provide the tools to get things resolved, but we shouldn't have patch your code for you.

Don't write services that keep a lot of internal, in-memory state that isn't saved to disk periodically such that it can't pick up where it left off if it is restarted.

Do write services that allow easy introspection into its state. The memcached "stats" command is an example. This allows us to easily hook your code into things like ganglia. Some programs, when sent a signal, dump info to a log file.

Speaking of log files, use techniques to detect if your log file has changed and reopen it if its rotated (stat(logfilename) != fstat(logfilefd)).

Those who don't learn /bin and /usr/bin are doomed to reinvent them, poorly. Don't waste time reimplementing things that already exist. Like xargs. Or head. Or wc. Or yes. Or host. Or watch.

Learn how to use the stuff in binutils, like objdump, nm and strings.

Don't write scripts in python or perl or ruby or php that do some weird setup (like parsing command line arguments) and then just invoke something via the shell with system(). On the off chance you do need to do this, use exec instead.

Don't name scripts that might be integrated with other tools or batch jobs with a language specific extension (unless this is required by your platform, cough ahem) or it's a language specific library. Reserve .py for python modules, for example. The reason here is that your perl script might be rewritten as ruby one day or optimized by porting it to C, and there might be stuff calling it, and then you end up with a ruby script named .pl. We don't name C binaries ending with a .c extension, your "production" scripts are logically "binaries" if not physically. (I've actually run across this).

Use or create proper library routines to access structured data files. As a contrived example, use getpwent, don't parse /etc/passwd by yourself.

Put a password on your ssh private key.

Don't architect distributed things that require passwordless ssh keys to run. If you need to distribute files to a bunch of machines, they should be pulled from an rsync or http server, rather than via scp or remoting invoking rsync over ssh.

Learn about sudo, because you're going to have to use it.

Learn the difference between tmpnam, mktemp and mkstemp, and why you should use mkstemp.

Honor the TMPDIR environment variable.

Don't fill up the disks. When you get automated quota emails, do something about it.

Don't work around the limits that have been put in place on systems, they are most likely there for a reason; if you are hitting a limit, we're open to having them changed. For example, if there's a 4 gig address space limit on a machine for user xyz, don't work around that by changing the user your code runs as.

Write runbook documentation. This means somewhat of a Q&A style "How do I..." or "What to do when X, Y, Z". Make them easy to visually scan, make keywords standout. Hopefully, your systems team has provided a place for these to be stored that is easily accessible and easy to update. A lot of people get bogged down when writing documentation, trying to figure out what's appropriate to document. A runbook, which is a living document, makes this easier, because it's obvious that you add to it as problems or corner cases crop up.

I'm sure I could come up with some more.

LogicHoleFlaw · on Oct 18, 2010

As a contrived example, use getpwent, don't parse /etc/passwd by yourself.

Absolutely not a contrived example. I still remember some $XXX,XXX-per-installation enterprise software which wouldn't integrate with our LDAP database because it was parsing /etc/passwd on the server it ran on. It introduced a huge headache in operations because we had to start maintaining in parallel login permissions in /etc/passwd which had previously been just a stub.

thwarted · on Oct 18, 2010

Oh, I've had to deal with all of these at one time or another. I said that was contrived because of this audience... I seriously hope no one is creating a web service that is reading /etc/passwd.

rhettg · on Oct 19, 2010

"Don't architect distributed things that require passwordless ssh keys to run. If you need to distribute files to a bunch of machines, they should be pulled from an rsync or http server, rather than via scp or remoting invoking rsync over ssh."

Is this statement directed at me ? It seems to me that if you need some combination of deploying files and executing commands, the chances that you're going to implement it more securely than SSH is pretty low.

It's true that compromising that one machine with the SSH key is going to compromise your whole cluster, but at least you had to compromise a machine rather than hijack some network service.

thwarted · on Oct 19, 2010

Heh, someone found their hacker news password.

ssh is a network service just like any other, one which can potentially be compromised, via any means (exposed key, social engineering, remote exploit, etc). Why expose things like passwordless keys and make it easier to be compromised? One machine that exposes inbound rsync to transfer files is more secure than a cluster of machines that allow passwordless key inbound ssh. ssh is also harder (more error prone) to secure for limiting commands that can be run or only allowing file transfers. When you configure rsyncd, you already know it's only supposed to allow file transfers, so it's straightforward to audit. ssh has many more capabilities thus making auditing more difficult. The security of the files themselves are already not an issue because they are distributed to the entire cluster, so transferring them over ssh doesn't gain any security advantage.

As for it being directed at you, this has occurred more than once in our environment, I suspect because something/someone set a precedent early on. If this is a serious problem depends on the environment. These are the things that keep systems people up at night. I mean, you don't actually expect me to say "put a passphrase on your ssh key" and "it's okay to have passwordless ssh keys scattered around the cluster" do you?

jnw2 · on Oct 18, 2010

I'm not sure I agree with the objection to passwordless ssh keys. A system that uses passwordless ssh keys may be more secure than one using an unauthenticated protocol.

thwarted · on Oct 18, 2010

True. I guess it really depends on the data being transfered. I have not seen a case where remotely invoked rsync over ssh was necessary given the type of data it was. It's mostly a security thing. With passwordless ssh keys scattered around the network in use by cronjobs, the machines those keys can access are at somewhat more risk. If you run an rsync server to transfer the files, you know exactly how that can be accessed. With ssh keys, you have to ensure that, say, certain keys can only do certain things (which is more complex to setup and get right in the authorized keys file).

FooBarWidget · on Oct 18, 2010

> init script

Doesn't seem very sensible to me. Why should the programmer write this, and not the packager or sysadmin? As I programmer I already have to deal with 2949832 platforms, isn't it the packager's/sysadmin's job to make sure it integrates into your specific system provided that the programmer writes general usage instructions?

> Honor the TMPDIR environment variable.

Funny. My software used to honor $TMPDIR until a sysadmin complained to me that it shouldn't, and should honor some other my-app-specific env var/config option instead. And he's a competent sysadmin.

> Don't work around the limits that have been put in place > on systems, they are most likely there for a reason

My experience is that the limits are very, very often not there for a reason and the administrator didn't even know about the limits until my software hits those limits.

One good example is the file descriptor limit. If too many clients connect to my server then the fd limit will be hit, but instead of raising the limit it would seem a considerable number of people to go my support forum to ask what's going on. I'm considering increasing the limit automatically on the next version just to get rid of the support emails.

Another example is the stack size. A major user accidentally set his system stack size to 80 MB. My software creates a bunch of threads but after a while it runs out of virtual memory address because of the huge stack size. We found out about the 80 MB stack size after a while and we had to explain to them why that was a mistake. They fixed it afterwards but we wanted to avoid support questions like this again so from that point on we hardcoded our thread stack sizes to 64 KB.

nailer · on Oct 18, 2010

Re: initscript. If you provide a set of start commands, stop commands, and a status method, a nice SA will write it and ask you to test if you like.

Re: $TMPDIR. Alas bad (ie most) enterprise monitoring tools assign calls for OS directories to systems teams directly, without looking at the files owner.

Some systems guy gets an alert because you filled up /tmp but he can't delete your data obviously, because he's a nice guy. All he can do is redirect the call to the app guys. It's better, in this case, to have your own dedicated storage in a custom folder.

thwarted · on Oct 18, 2010

This was a list of things that get your systems guy to love you. If you want your systems team to love you, make their job easier. If your systems guy creates the init script for you or just increases limits when you say, great: your systems guy already loves you and you need to do nothing.

[init script] Doesn't seem very sensible to me. Why should the programmer write this, and not the packager or sysadmin? As I programmer I already have to deal with 2949832 platforms, isn't it the packager's/sysadmin's job to make sure it integrates into your specific system provided that the programmer writes general usage instructions?

I assume that anyone who has a systems guy/team is working towards deploying to an internally controlled production environment. Production environments for the kind of software the audience of hacker news creates usually do not include 2.95 million different target systems. If they do, then your systems team is doing it wrong.

The programmer should write this because the programmer knows how their software should be started and stopped.

My software used to honor $TMPDIR until a sysadmin complained to me that it shouldn't, and should honor some other my-app-specific env var/config option instead. And he's a competent sysadmin.

The way to figure out where to write temporary files is whichever one of the following succeeds first 1) a configuration specific location that is writable, 2) the value of TMPDIR that is writable 3) /tmp. The point of this tip was to not just assume /tmp is the best place to put temporary files, and don't read some other environment variable when TMPDIR is the defacto place to specify the temporary directory via the environment. Obviously, you should favor what your local systems guy says, since you're trying to win his love, not mine. But if he's anything like me, he can appreciate this list.

My experience is that the limits are very, very often not there for a reason and the administrator didn't even know about the limits until my software hits those limits.

This is a function of different distributions, of any operating system, having different defaults. A good systems guy will be aware of defaults and be able to almost immediately know that a limit is being reached after a few questions about the problem.

One good example is the file descriptor limit. If too many clients connect to my server then the fd limit will be hit, but instead of raising the limit it would seem a considerable number of people to go my support forum to ask what's going on. I'm considering increasing the limit automatically on the next version just to get rid of the support emails.

You should be using getrlimit to determine how many file descriptors can be used and then issuing a meaningful error message ("out of file descriptors, try increasing per-process file descriptor limits, see ulimit" or something of the sort) if that limit is reached. Even better, use getrlimit to get the softlimit and the hardlimit and use setrlimit to expand to the hardlimit. If you reach that, then also issue a meaningful error message. This would save a lot of support time because your software's error messages indicate what to do, and any competent systems person will know how to increase that (or will know how to find out).

Another example is the stack size. A major user accidentally set his system stack size to 80 MB. My software creates a bunch of threads but after a while it runs out of virtual memory address because of the huge stack size. We found out about the 80 MB stack size after a while and we had to explain to them why that was a mistake. They fixed it afterwards but we wanted to avoid support questions like this again so from that point on we hardcoded our thread stack sizes to 64 KB.

This is the perfect example of what happens when one just randomly increases limits. "Mmh.. if 16k is good, then 80meg will be even better!" Someone who sets an 80meg stack size obviously doesn't know what they are doing and is just blinding increasing limits in an attempt to get rid of some problem they don't understand. This is an attempt at a bandaid solution, not a real solution.

nailer · on Oct 18, 2010

An excellent list, I would add:

* Understand that disk is 300,000 times slower than memory and not an adequate substitute for it.

* Understand your VM. Particularly, if it's Java, that the JVM will eventually 'walk' the entire heap you set with -Xmx. If you have a number of VMs with massive maximum heaps, that exceeds your physical memory, you will be screwed shortly.

pig · on Oct 18, 2010

I'm sure I could come up with some more.

Please do. This was very instructive.

thwarted · on Oct 18, 2010

Learn about ipcs and ipcrm. It seems few people actually use SysV IPC stuff these days, but software you use, like apache, uses it, and if you stop/kill apache wrong, it leaves semaphores around. There is both a per-user and per-system limit to how many can be created.

Related to the above, learn what possible error results are from system calls. ENOSPC, which expands to "No space left on device", is a possible errno value from calling semget. This doesn't mean "/tmp is full".

Read man pages. Use man -k.

Don't ask to have Some Random Editor(tm) installed on every system. Learn vim or emacs.

Know how to NFS mount from your OSX machine. Be familiar with setting up the autofs/automounter on OSX.

Know when and how to use the -L, -R, -g, -A, and -t ssh options.

Don't give things cute names, give them meaningful names. If it runs as a daemon, consider ending the name with a d.

If it's a daemon, give it the ability to daemonize itself. It's fragile to play weird shell tricks to make something daemonize that can't do it.

If it's a daemon, make a mode where it can run in the foreground for debugging purposes.

Know how to use strace.

Know how to use ltrace and ldd.

Set SO_REUSEADDR on listening sockets.

Use IM to ask throwaway questions.

Don't start conversations with "can I ask you question?"

Don't expect a response right away.

Read emails from me, even if they look long. Ask about things you don't understand.

If there isn't a package for something that you need, install it into your home directory first, test it, test your code with it, before you ask it to be packaged and installed.

Don't skirt the firewall.

Know how to change your hosts file as a debugging aid.

Challenge us.

FooBarWidget · on Oct 18, 2010

> SysV IPC

Which I find incredible stupid. The system should have support garbage collection of shm files from the beginning. It's almost impossible to write software that properly cleans up things like Unix domain sockets, FIFO files and shm files when killed by SIGKILL.

The best solution that I've found so far is putting garbage collection code in the admin tools. When you run the admin tool for querying the status of the server, it'll check for stale resource files and delete them.

thwarted · on Oct 18, 2010

Agreed, there should be some better way to see which of sysv ipc resources are actually in use currently. All of these are supposed to be persistent (you can reattach to a shared memory segment later on). Considering that they are persistent, they need to be cleaned up just like another much more common resource: files on the filesystem. There is no built in support for garbage collection of files either.

I find the whole key and id thing to be a somewhat weird part of sysv ipc too, but it does allow you to consistently reattach to existing resources and avoid some collisions, but it just moves the identifier collision management from ints to key_ts, or with ftok(3), to file paths. See the NOTES section for ftok(3).

However, I didn't design this stuff, I just manage and maintain it.

silverbax88 · on Oct 18, 2010

These are all things that any developer should know on his first day.

jedwhite · on Oct 18, 2010

This is one of the reasons for programmers making startups to use "platforms as a service" like Google App Engine, Heroku, Rackspace Cloud et al. They really do simplify deployment and managing systems, and let you focus on code, particularly for platforms that traditionally are hard to scale without a lot of knowledge and work.

v21 · on Oct 17, 2010

Can I just indulge in a bit of patriotism here? A few paragraphs in, I knew I was reading a British dude, and was confirmed right by the URL. Sure, no doubt there are many shibboleths to tell, but the main clue was the joy and easy humour of the writing. Just an incredibly British style, an utter lack of fear in mingling jokes and serious points.

The biggest thing the author won't dare admit - that he's a nice guy. That he cares about making it easier for others. I guess it's self-effacing, and I can see how others may prefer actual open honesty, but this is the way I work, and as such, this prose is like a warm bath to me.

"What are the five things most likely to break it? If somebody is trying to fix a problem and you left them a solution on page 5 of the manual they’re going to be really, really grateful. Once it’s working again they’ll buy you a beer to say thanks for all the trouble you saved them. Seriously, they will."

danac · on Oct 18, 2010

Most of those excuses mentioned really come down to one of two things: arrogance (I know best so you don't need to know) or laziness (why bother because see former), both of which are hardly unique to programmers.

moron4hire · on Oct 18, 2010

It's much simpler than this. Buy him/her a case of beer. That's all you have to do. Works wonders. People bend over backwards for you after you buy them a case of beer. It doesn't even have to be good beer; you could get some shitty $40/case beer and they will love you. Soma.

oomkiller · on Oct 18, 2010

$40/case what? That's shitty range? Where do you live, around here you can get shitty beer for $12/case!

Figs · on Oct 18, 2010

How big is a case?

oomkiller · on Oct 19, 2010

24? 30 for stones.

Kadrith · on Oct 18, 2010

Wouldn't work with my crew; one of the IT Security/Systems people brews his own beer. The best thing with us is being upfront and not wasting our time or we will find ways to waste your time.