Hacker News new | past | comments | ask | show | jobs | submit login

This demonstrates the dangers of loose path resolution rules.

Traditionally, consecutive slashes in a path name are treated as equivalent to a single slash, presumably to simplify apps that need to join two path fragments -- they can safely just concatenate rather than call a library function like path.join().

Unfortunately, this makes it much harder to write code that blacklists certain paths, as robots.txt is designed to do. Clearly, Google's implementation of robots.txt filtering does not canonicalize double-slashes, and so it thinks //search is different from /search and only /search is blacklisted.

My wacky opinion: Path strings are an abomination. We should be passing around string lists, e.g. ["foo", "bar", "baz"] instead of "foo/bar/baz". You can use something like slashes as an easy way for users to input a path, but the parsing should happen at point of input, and then all code beyond that should be dealing with lists of strings. Then a lot of these bugs kind of go away, and a lot of path manipulation code becomes much easier to write.




We should be passing around string lists, e.g. ["foo", "bar", "baz"] instead of "foo/bar/baz".

But that doesn't in and of itself solve the problem, because "foo/bar//baz" would map to ["foo" "bar" "" "baz"/] without any additional convention.

This is actually not that unusual. this site does not treat two consecutive slashes as a single slash. There are likely others implementation differences.

Certainly in posix consecutive slashes count as one for file paths, but URL query strings are not file paths.


... "foo/bar//baz" would map to ["foo" "bar" "" "baz"/] ...

No, I think it'd be more like proto://host/thing?foo&bar&baz (put an =1 on each of those if you like).

Yeah, I'm employing a convention, but so to is the concept of list of strings that the commenter invoked.


Does the HTTP standard or robots.txt specification mandate the collapse of consecutive slashes, though? I agree that it is common, but if it is server-side implementation detail, then a correct implementation of robots.txt should not collapse them, as they might mean different things to a particular server.


I agree. If there's a bug here, it's in the server which collapses slashes seen in request paths, not in the indexer's interpretation of robots.txt.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: