HD Moore (creator of metasploit) had a talk about doing something like this. He scanned all of IPv4 hitting common TCP and UDP ports, and collected data about the services running.
He found a lot of cool stuff. For instance, there are apparently 7 Windows NT 3.5.1 boxes with SNMP open sitting on the internet. And about 300 IIS servers that give out the same session ID cookie when you log on.
That's actually quite interesting! Thanks for the link, it's always very awesome to hear speeches by the people "behind the magic."
Also, I thought this quote on the page you linked was pretty humorous: "[HD Moore] is the Chief Architect of Metasploit, a popular video game designed to simulate network attacks against a fictitious globally connected network dubbed “the internet”.
They recorded the whole response. "...and the second one simply writes down the response packets received." They also randomized the order in which they scanned so they might have needed to keep track of already scanned addresses. "Although we have sent just a single packet per IP, we messed the scans to prevent a network receiving a high number of consecutive packets."
You can use a LFSR[0] to efficiently (it is deterministic, minimal state) convert a sequential power of two range [0, 2^n-1] into one that appears to "randomly" walk around.
This reminds me of a research project that was done at my university a few years back[1], except they were specifically scanning for web servers. Out of the 3,690,921,984 addresses they scanned, 18,560,257 had web servers running on port 80[2].
Based on Shodan there are at least 87,494,410 web servers [1] on port 80 now. For port 443 there are currently 14,918,407 servers listed [2], and for port 8080 there are 7,570,586 [3].
Myspace lowres'd it and I can't find the original, but if there's any interest I can keep looking. I did a ping 'scatter' scan of a random set of IP addresses and then mapped them into 2d space with the boxes representing the ping times. I thought it looked cool in the end :)
As someone who works with this, I would like to know how they can be sure their results are reliable? Just starting a sender and receiver thread simply wont do. At that rate congestion happens, and we'll start seeing packet loss. With a stateless approach, the only thing you can do to prevent this is arbitrarily slowing down the rate of packets being sent. Using that approach, it is going to take way longer than ten hours if you scan from one location.
It works perfectly well as long as your results are not used for anything important I guess. But if you have customers who needs reliable results, this naïve approach simply don't cut it in my experience.
During the scan we monitor the bandwidth, and we have control pings in order to check all the time the server can send and receive pings. We took certain monitoring and slow down things. Sure it was nor perfect, but the reliability were considered during the experiment.
The problem is not sending packets fast enough. It's not about bandwidth. The problem is sending them just fast enough, which is impossible if you're scanning statelessly with just ICMP echoes.
Let's say you're on a 100 mbit ethernet, your uplink is only 8 mbit. If you send packets at a rate of 10 mbits, packet loss will happen. And you're not the only one using the network either, so this can happen way earlier. And that's only the part of the network that you control. There might be a lot of hops between you and the host you're sending packets to. And with your approach (the way I understand it) you're not gonna notice packet loss.
I might make too many assumptions here, but ten hours is just too short of a time period for a network of that size for a reliable result. I'm very sceptical. But please prove me wrong, because it will def. make my job easier.
I guess you could publish the code, so I could test it myself.
What if you did a much slower (i.e., reliable) scan of a small sample? Then you could compute the probability of false negatives in the fast search and get a much more accurate count.
I guess it depends on what kind of results you want. It's possible to scan the public IPv4 address space reliably, but it requires a bit more effort than just sending out packets to see what you get back over a relatively short time frame.
You can split the address space up across several different scanners on different physical links. You can estimate the RTT to a network segment you're scanning and base your timings on that. Probing with TCP packets can yield better results than ICMP packets for this type of activity. There's so many variables involved.
Build a tool that allows you to send ICMP packets at a fixed rate (preferably in the kernel, or even without an OS at all if you're into that. Getting precise timings in user land is hard) or just a tool that sleeps between packets with the possibility of not sleeping at all. It's an educational experience. Scan a relatively small range of addresses bound to hosts on the other side of the world at different speeds and see the diff in results. Maybe there's a good tool for that already.
Whenever I read about "We've scanned/product X can scan the internet in X hours" I'm very sceptical. Unless the results are verifiable in some way (which is hard to guess/estimate for such a large sample) or the approach they took seems like a sane one (very subjective I guess), I assume they don't know what they're doing. The reason I assume this is because I've been there myself.
2^32 pings in 10 hours is ~120kHz. So I imagine that the complaining companies were the ones assigned /8 blocks (IBM, Apple, Ford etc.). They probably noticed the 500 pings/second aimed at their address space.
I'd love to know the story behind the three responses they received from 10.0.0.0/8.
I am also curious about this: "With the extracted data more interesting analysis can be done,...such as the issue with network and broadcast addresses (.0 and .255)." Why do responses from .0 or .255 have to be an issue? My cable modem sits on a /20. It seems that there are a number of valid ip addresses ending in 0 or 255 in this range:
Yes, if you're using ranges larger than /24, addresses ending with 0 and 255 are valid addresses; however, because some people making network equipment are dumb, you will have reduced reachability compared to an address not ending with 0 or 255.
Re: responses from 10/8; they may have some connectivity to local 10/8 resources; or it's possible someone was sending them fake ping responses, and the network path they're on doesn't do proper ingress filtering (many don't).
Some consumer routers filter traffic to/from addresses ending in .0 or .255 in a naive effort to prevent SMURFing
I've seen ISPs that use 10.x.x.x addresses internally, so a tracepath from the local network would show an intermediate router with a 10.x.x.x address.
I'm aware of the signifigance of the netblock, that is why I brought the issue up. Why is it that you think they could not have conducted the test on a machine that is sitting behind a NAT box?
But is this that only 7% of all IP addresses are allocated to real live hosts, or (more likely given the amount of 0% counts) large swathes of the Internet just routinely ignore ICMP and drop it on the carpet.
This has been complained about for years and means the most obvious approach to diagnosis of problems fails 93% of the time...
I remember throwing together a super basic python script that would query random IP addresses at port 80 and see which ones sent back a response. Basically an attempt to find random web pages by IP address. Seemed like most of my hits were router status pages or Apache server responses.
10 hours is rather impressive! I am also working on similar project and always wonder who else may be doing this and learn from their methods. Not too many of these that have been published though. This is also a good reminder to tell us that the Internet is not that vast after all.
I mean the internet address space for IPv4 is now so tiny relative to our computing resources that visualizing and interpreting the data is fairly easy.
Of course storing a response packet for every IPv6 address might cost slightly more on S3.
Interesting stuff. The data would be a lot easier to read if they only reported a couple of significant figures, though. Having nine decimal digits actually obscures easy interpretation by humans.
The article says "After 10 hours", so I'm assuming that's how long it took (that took me a while to find, also). Actually pretty amazing that they were able to do it so quickly.
To ping the entire internet in 10 hours you would generate approximately 60 Mbps of ICMP traffic.
The site is down for me, but I assume they used multiple machines to do this. I SYN scanned about 70% of the globally routed prefixes last month and it took a little over 4 days from a single box (but I was doing some detailed packet captures that hurt disk IO).
He found a lot of cool stuff. For instance, there are apparently 7 Windows NT 3.5.1 boxes with SNMP open sitting on the internet. And about 300 IIS servers that give out the same session ID cookie when you log on.
The visualizations are also really nice. The talk is here: http://www.irongeek.com/i.php?page=videos/derbycon2/1-1-2-hd...