When I designed Cloudflare's filters, a.k.a. Firewall Rules... we had an API written in Go, and the edge written in Lua. The risk I foresaw was that building rules that act on URLs and perform matching, and testing those, in one language (Go) that were then deployed and executed in another (Lua) would lead to a risk that the difference in the behaviour of the two engines (no matter how much they implemented the same spec) would result in a rule being created that behaved ever so differently elsewhere... for me, that was a huge hole as if it was known it would be leveraged.
The first piece of Rust in Cloudflare was introduced for this reason and for this project. We needed a single piece of code to implement the filters that could be executed from Lua, and that Go could call too. It needed to ensure that only a single piece of code was responsible for things like UPPER(), LOWER(), URL_PARSE() precisely to prevent any differences being leveraged.
Yup, and the filters described above replaced the pcre regex matching in lua.
That was already planned at the time of this incident, but this incident accelerated the pace at which the new system was put in place as well as introduced controls and process for the release of waf rules (regardless of what engine they were applied in)
Uses wireshark display filters syntax to implement the ability to match traffic at any part of the stack.
It's still used internally, but clearly they stopped updating the OSS version just after I left. They were updating things to make it fully compatible for everything owasp when I left which meant additions beyond the wireshark definition of display filters. Everything firewall and WAF was moving to the filters and also other traffic matching features too.
Thank you for releasing it! I adopted wirefilter for a firewall rule testing project, firewalker [1]. But indeed, I wish Cloudflare kept maintaining its OSS version.
Unfortunately cloudflare has a poor OSS track record. Either not maintaining the public version, or promising to open things that then don't see the light of day (quicksilver database and replication, and their Rust reverse proxy and nginx/lua replacement - both of which were announced but never released).
Most of what is OSS at cloudflare came in from elsewhere (V8) or was needed for collaboration (IETF), rather than started at cloudflare and opened.
Lots of reasons that on their own were not enough to decide but together made it compelling.
We weren't going to use C (Cloudbleed), cgo didn't give us as much control over the FFI as we wanted and produced slightly harder to read output (maintainability for me, is everything — always write code that the drunk at 2am future version of yourself can understand when you're paged), we didn't necessarily want another garbage collected process despite being good at this (Cloudflare's DNS server is written in Go), we wanted memory safety and wanted to keep things small (Go or Rust stand out), we wanted compile time safety over runtime errors, we believed that Rust macros would help make the filter code more readable / maintainable than rolling our own parser in Go or using YACC (see earlier point about your future self maintaining this at an unGodly hour)... it's not like anything was a determining factor, but the reasons accrued until it felt overwhelming.
We certainly were not looking to add another language at the time as being first to do so incurs pushback from the org, adding into all of the build and release chains, and the typical higher bar for proving you know what you're doing. Once done everyone can benefit from having the option of another language and being able to select the right tool for the job, but going first incurs a legitimate cost.
A student and I have been using coverage-guided grammar-aware differential fuzzing to discover bugs in URL parsers for a while now. There is extreme variation in this space; it's trivial to turn up meaningful bugs in widely-used URL parsers.
".://" is a particularly egregious example. (and, by the same principle, "evil.com://good.com")
- Python 3.6's urllib.parse sees the "." as the URL's scheme, and an empty authority.
- Python 3.11's urllib.parse sees the entire ".://" as the URL's path.
- urllib3.util.parse_urlsees the "." as the URL's hostname, the ":" as the separator for an empty port number, and the "//" as the path. (this is one of the most downloaded packages on PyPI)
- Boost::URL rejects the URL outright.
If you're going by RFC 3986, then only Boost::URL is exhibiting the correct behavior.
If you're going by the WHATWG URL standard, then I don't know which one of these behaviors (if any) is correct.
If you're interested in collaborating on this project, please send me email. My address is in the footer at https://kallus.org
I recently published bindings for ada (an implementation of the WHATWG URL Spec) for Python with the hope of having something that follows a single standard.
Indeed, ".://" is a hard error under the WHATWG URL spec. If the URL doesn't start with an ASCII alpha character, then the scheme start state transitions to the no scheme state [0]. In that state, if there's no base URL that the input is relative to, then parsing must fail [1].
However, "evil.com://good.com" is a valid URL string per WHATWG, since its state machine accepts "." within the scheme after the first codepoint. The resulting URL object has a scheme of "evil.com", a host of "good.com", an empty path, and a null port, query, and fragment.
It’s not fair to call it a hard error: it’s only invalid as an absolute URL. As a relative URL, it’s fine, just like “example.com” is invalid as an absolute URL but valid as a relative URL.
They used to do the protocol, username, password, host, pathname, etc. But scammers used it to have a user name that looked like a well-known URL, while actually directing the user to a domain under the scammer's control. Not honoring the spec was therefore a security feature.
Which means if you're presented with that and don't send it to that host, you've violated the RFC. But this was demonstrably resulting in confused users being sent to scammer's domains.
That ("if you're presented with that and don't send it to that host, you've violated the RFC") has nothing to do with the comment you responded to, which described a UI choice compatible with RFC 3986, which says "Applications should not render as clear text any data after the first colon (":") character found within a userinfo subcomponent" (and which also goes on to say "Applications may choose to ignore or reject such data when it is received as part of a reference").
URL parsing semantics are defined by and dependent upon the scheme. (That's spec.) By definition, if you don't recognize the scheme, you cannot guarantee a correctly parsed URL. From RFC 1738:
The Internet Assigned Numbers Authority (IANA) will maintain a
registry of URL schemes. Any submission of a new URL scheme must
include a definition of an algorithm for accessing of resources
within that scheme and the syntax for representing such a scheme.
The behavior described (extracting the scheme and treating the rest as an opaque string) is pretty much the only thing you can do when the scheme is unrecognized. (The other options being to throw an exception or return null.)
Based on the description, it sounds like neither are breaking spec—it's just that Node supports "postgres". That is, unless it's true that Node's URL implementation is supposed to match what browsers do, in which case Node is breaking spec—its own.
RFC 3986 provides a generic URI grammar that is not scheme-specific, though other standards that define URL schemes may choose to subset the subset that grammar as they see fit. If a URL parser does not recognize a scheme, I would expect it to parse the URL using the generic parsing procedure.
When I worked at one of the tech giants that develops a certain suite of office apps, the C++ class they used to model URLs in the cross platform layer had a flaw around escaped characters. I worked in the group that developed the iOS/macOS versions of the apps and we had a number of kludges in place to deal with conversion between these C++ classes and NSURL and CFURL. In my time on different occasions devs discovered the flaws and composed lengthy emails explaining the fixes that needed to be made, but the C++ class was too entrenched to be fixed. That was about 5 years ago, I doubt it has changed.
I remember having some difficulty with characters for HTML URLs on the clipboard (for images) that happened only on the Mac version of a certain office app, but not the windows version
I recently had a wild ride with using java.net.URL and java.net.URI, to parse and deconstruct URIs so they can be stored in a database with at least a little normalization.
The documentation page is great but the API is bizarre. So many different constructors with different permutations of whether or not they expect various segments to be URL-encoded.
If nothing else, the complexity convinced me not to write my own!
One of the weirdest things about Java's URL API is this:
equals
Two hosts are considered equivalent if both host names can be resolved into the same IP addresses; else if either host name can't be resolved, the host names must be equal without regard to case; or both host names equal to null.
If you're doing bulk operations on stored URLs, you may just be sending out tons of DNS resolutions without ever calling an obvious networking API (like openConnection). If your DNS is down or slow, equals() will block.
I'm sure most Java programmer here has learned this lesson, but new programmers get surprised by this every time.
So I'm certainly not a Java developer and I barely know very much about urls, but isn't that a semantically surprising comparison as well? You can host wildly different websites on the same server/IP as a temporary or permanent arrangement, but I'm not sure I would expect that to make you group them together
Learned, but completely forgot about this. What remained was a vague suspicion that perhaps the benefits of clear type separation that would be lost by using a plain string might not necessarily be worth it. (and then you eventually start rolling your own and then one day someone goes wild with the allowed characters liberties within the username:password prefix...)
To sum things up: never ever use java.net.URL, it's a bad http client and it's worse at everything else you might expect it to be.
It doesn't really make sense to use URL today (it does not even support encoding). URI can be used to identify the resource and then more explicit API can be used for I/O instead of URL.openConnection().
It is worth mentioning that Java started supporting URIs even before RFC 2396 became a standard. And then RFC 3986 came and made incompatible changes, e.g. by moving asterisk to reserved characters (I could not find an explanation why in W3C email archive - someone asked this question at the draft stage, but it was not answered there).
One thing that drives me mad on the Python side of things is the way frameworks and middleware (even the WSGI spec) corrupt the most basic "quoting" mechanism of RFC 3986. All of these layers of trying to be clever or friendly subvert the correctness of the original concept. And these broken approaches become precedents that reinforce in the culture of web breakage and compatibility.
Specifically, "reserved" characters have a different meaning from their URL quoted form. It is obvious that the intent of this was to allow their use as meta-syntax to separate parts of a URL that might then have the quoted form embedded in them. But every stupid middleware layer that tries to unquote _before_ parsing and routing is throwing away this information and making it impossible to know what was the original meta-syntax and what was embedded quoted material that might be a fragment derived from arbitrary, user-generated content.
I think XXE would have been fine if it followed the same rules as ajax does in web browsers. Although i guess it is a glaring example of xml not being sure what level of abstraction its trying to be.
The real wtf is recursive entity expansion (billion laughs)
The first piece of Rust in Cloudflare was introduced for this reason and for this project. We needed a single piece of code to implement the filters that could be executed from Lua, and that Go could call too. It needed to ensure that only a single piece of code was responsible for things like UPPER(), LOWER(), URL_PARSE() precisely to prevent any differences being leveraged.