Storing IP addresses as integers

yagibear · on Oct 7, 2009

Compatibility with IPv6 is much more important than saving a couple of bytes, IMHO. Better to keep as a string.

pmorici · on Oct 7, 2009

Storing the address as a string doesn't implicitly give you any better ipv6 compatibility than storing it as an integer.

I would think that integers would allow for better compatibility seeing as a numbers a number no matter what just expand that data type size of the column. The fact that ipv6 allows you to write the same address in multiple ways using the :: notion to denote a single span of zeros means there are a multitude of string representations of the same address.

tptacek · on Oct 7, 2009

Huh? Littering your code with u_int32_t declarations and comparisons against magic numbers like "3232235520" doesn't hurt IPv6 compatibility? Pants are shirts!

pmorici · on Oct 7, 2009

Maybe that's the way you'd do it. I was thinking more along the lines of...

inet_addr_t some_var = inet_aton("123.456.789.123");

The way you represent constants in code doesn't have to be a direct correlation with the way you store their representation.

tptacek · on Oct 7, 2009

I don't know what language this is in, but if it's C, you meant to say inet_aton(string, &addrstruct), and that string is probably more valuable than struct in_addr. Since C doesn't offer a 128 bit scalar type, I'm not sure how inet_addr_t helps you any more than u_int32_t.

pmorici · on Oct 8, 2009

pseudo code, the language isn't important. Point is, because there are a number of ways to write the same ipv6 address (assuming it contains runs of zeros longer than 4 or multiple runs of zeros) you are going to have to normalize your string input no matter what so why not normalize to an integer/byte representation which is more compact.

tptacek · on Oct 8, 2009

The conversation is about portability, and the observation is that the code you're talking about still assumes an address is a scalar variable, which effectively means it assumes IPv4 addressing. That's all.

I'm not super thrilled to debate the performance merits of optimizing 16-byte strings vs. 4-byte integers. I use whatever is most convenient. It's slightly easier to convert a charstar to a u_int32_t, but it's much easier to index a u_int32_t.

pmorici · on Oct 8, 2009

The conversation is about compatibility, not portability. ie: what is the best way to store ipv4 addresses so that they are compatible with ipv6 addresses when you need to start supporting those.

At any rate we are talking about two different things.

if you have the ipv6 address 1234:0000:5678:0000:0000:9212... you can write it a number of different ways, using the :: abbreviation for the various runs of zeros. ipv4 really doesn't have that pit fall in the string representation of its addresses.

So, if you store it as a string one way and then latter when you get input that has the same address written another way you are going to have problems if you don't normalize. Since we are talking about storing in a database, as the parent to this chain was you would likely want to normalize to the most compact representation to save space in your database. The precise type semantics of your language of choice are irrelevant.

dpifke · on Oct 7, 2009

Python has socket.inet_pton, which also supports IPv6:

http://docs.python.org/library/socket.html#socket.inet_pton

Edit: Ooh, and Py3k has the ipaddr module:

http://docs.python.org/dev/py3k/library/ipaddr.html

neilc · on Oct 7, 2009

Compatibility with IPv6 does not mean you need to store addresses as strings -- most IP address ADTs support both IPv4 and IPv6 in some way. But in reality, IPv6 is still not worth worrying about for the vast majority of software (YAGNI).

pmorici · on Oct 7, 2009

MySQL, also has built-in functions to do this, http://dev.mysql.com/doc/refman/5.0/en/miscellaneous-functio...

neilc · on Oct 7, 2009

PostgreSQL has IMHO a nicer solution: abstract data types to represent IPv4, IPv6, and MAC addresses, along with the functions over those types that you'd expect.

http://www.postgresql.org/docs/8.4/static/datatype-net-types...

http://www.postgresql.org/docs/8.4/static/functions-net.html

audidude · on Oct 7, 2009

why would you ever store a 32-bit ip address on disk or memory in anything other than a 32-bit word

lucumo · on Oct 7, 2009

Because it's easy. CGI scripts usually retrieve the IP address in dotted quad notation from the environment. Converting it can be a waste of time if the space doesn't matter that much.

ars · on Oct 7, 2009

Because of IPv6.

You'll need a 128bit word for that. Plus some bignum math libraries for 32 machines that don't do 128 bits natively. (Do 64 bit machines handle 128 bits natively?)

Also watch out that you use an unsigned int. In PHP for example (and probably most other dynamic languages that don't do bignum as well) all ints are signed. So you'll have to work with the number as a string.

dpifke · on Oct 7, 2009

I'd be curious if matching 128 bit words is slower or more painful than variable-length string comparisons in any RDBMS. I would bet not.

Plus, you can't do CIDR operations on strings, making it a pain to i.e. match all addresses within a given /27.

audidude · on Oct 7, 2009

You will often have to do a page-fault for both, so loading wont be too much an issue. However, you have to get it into an integer to compare it to begin with. So the string method is purely overhead.

tptacek · on Oct 7, 2009

A page fault, because you touched a 4 byte word, a 16 byte string, or a 16 byte binary address? We're talking about data types that fit in a single L1 cache line.

audidude · on Oct 7, 2009

Something has to load it from disk into the cache line ..

ori_b · on Oct 7, 2009

Compatibility with other addressing formats (hostname? ipv6? $OTHER_THING?) or human readability.

audidude · on Oct 7, 2009

Read my post again. I said store. There is no reason to store in a compiled format for humans.

tptacek · on Oct 7, 2009

You mean, besides the fact that storing binary IP addresses breaks "grep", and basically all of the rest of Unix too.

audidude · on Oct 7, 2009

Thats fair. I was a bit narrow sighted in my vision of high-performance scenarios.

brianobush · on Oct 7, 2009

I have done this in code that I have written (yes, IPV6 - haven't seen it yet and it has been talked about for years) that handles IP address.

One thing that is really easy once the ip address (x) is in the integer space: private address determination becomes a simple integer comparison. e.g., 0.0.0.0 is simply x > 0, in the range 192.168.0.0 to 192.168.255.255 is written as: x >= 3232235520 && x <= 3232301055, etc.

tghw · on Oct 7, 2009

Or, much more easier for anyone else reading your code:

  ip.startswith('192.168.') or ip.startswith('10.')

No, it's not as fast, but are you really doing that many is_ip_private() calls?

tptacek · on Oct 7, 2009

That works for /16's and /8's. Now do it for /27.

ars · on Oct 7, 2009

Grab the last octet, and see if it's <= 30

Personally I store a string until 128bit math becomes easier (so I can handle IPv6). But usually I just want to log it, not check netmasks.

tptacek · on Oct 7, 2009

There are, what, 8 /27s in a /24? How does seeing if the last octet is less than 30 help you?

Confusion · on Oct 7, 2009

This is explanation is clear to everyone that already understands it and will be inexplicable to anyone that doesn't. Foremost, he should explain that ipv4 addresses simply are 4 byte numbers and that www.xxx.yyy.zzz is just a human readable presentation. Then it is immedialy clear that the latter isn't necessarily the most common way to store the datum.

walesmd · on Oct 7, 2009

In a project here at work we are storing IP Addresses in both string format and in integer format (primarily, so we can sort the addresses intelligently). By sorting on the integer column, yet displaying the string column, you get the result set in the order that makes the most sense.

albertsun · on Oct 7, 2009

How often would this actually be worth it? My hunch is that the computational time involved in packing and unpacking IP addresses into integers is more valuable than the space saved by storing them as integers.

audidude · on Oct 7, 2009

Your hunch would be wrong.

  #include <stdio.h>

  int
  main (int   argc,
        char *argv[])
  {
    unsigned int ip = 3232235777u;
  
    printf ("%d.%d.%d.%d\n",
            (ip & 0xFF000000) >> 24,
            (ip & 0x00FF0000) >> 16,
            (ip & 0x0000FF00) >> 8,
            (ip & 0x000000FF));
  
    return 0;
  }

tptacek · on Oct 7, 2009

What does this code demonstrate, other than that you're unconcerned with endianness?

Converting "192.168.1.1" to an integer in Ruby involved creating multiple array, string, and integer objects, not to mention several multiply-indirected method calls.

haberman · on Oct 7, 2009

This code is not endian-dependent. You can only observe endianness when you address the same memory as two different types:

    int x = 5;
    char *chp = (char*)&x;
    printf("This value is endian-dependent: %hhd\n", *chp);

tptacek · on Oct 7, 2009

The code isn't endian-dependent because it doesn't do anything. The only thing you can do with "3232235777" on a little-endian machine is compare it to another number to see if it's also "3232235777".

If you're going to store IP addresses as 32 bit integers, or work with them that way in your C code, 192.168.1.1 should be "16885952", so you can do > and <.

But your point is well taken, and audiguy, I'm sorry for being such an asshole in my comment.

audidude · on Oct 7, 2009

ip addresses are always stored in network-byte-order

tptacek · on Oct 7, 2009

Your ALU doesn't care what the RFC says. What's the point of storing addresses in binary if you can't do math on them? There is no point, is the point.

jbyers · on Oct 7, 2009

The computational time to pack and unpack an IP to an integer is vanishingly small. My old MacBook Pro does 500,000 per second of the corresponding PHP function.

The difference between an integer and a fifteen byte string is eleven bytes. Our database has a few hundred-million row tables (barely considered big by today's standards) that store IPs. Storing IPs as integers saves us a GB per hundred million rows in addition to a substantial index size reduction.

Your application may not need to store that much data, but it's my experience that tables with IPs are the ones that tend to get big. :)

potatolicious · on Oct 7, 2009

Storage is cheap - the primary win here is computation time.

if(ip1 == ip2) is a lot faster as ints than strings.

tptacek · on Oct 7, 2009

Seeing as how the largest dotted-quad IP address fits inside rax:rdx on a modern CPU, and that two of them fit in a single cache line, I'm guessing x == y, while faster, is not "much" faster with strings than integers.

I wouldn't populate an address trie with strings, but I also wouldn't give a second thought to passing them around a random C program as charstars either.

potatolicious · on Oct 7, 2009

Most string libs aren't that smart though - string comparisons are still byte by byte. You're now comparing something 15 times instead of 1. You can do an int32/int64 (depending on architecture) compare in a single op.

The point I guess is, you can keep 'em around as charstars, but eventually you'll have to do this cast to compare them...

tptacek · on Oct 7, 2009

All memcmp's are this smart. But that's kind of besides the point, right? 1 time, 14 times, if we're talking about L1 cache, we're really epsilon from pure reg/reg ALU operations, implementing effectively constant-time algorithms.

I agree, int32 is faster. Like I said, it's just not "much" faster.

sunkencity · on Oct 7, 2009

But what if I want to count all the 10.x requests, not sure I can bitshift in an SQL query.

cema · on Oct 7, 2009

  select count(*) from ipTable
  where ip >= 167772160 and ip < 184549376

IP is four digits in a 256-base integer. You are looking for ips with the first digit 10. So, the value boundaries are: 167772160 = 256 * 256 * 256 * 10; 184549376 = 256 * 256 * 256 * 11.

there · on Oct 7, 2009

i've written ip address database management systems that did everything in integers not necessarily because of the size benefits, but just because it's easier to sort ips stored as integers, do addition/subtraction easily when they cross network boundaries (10.10.10.254 + 6 is what?), and do cidr calculations.

if nothing else, storing ips in a sql database as integers will make searches on their indexes faster.

tptacek · on Oct 7, 2009

Last time I looked, Ruby's IPAddr code had a horrible DNS dependence.