How I might approach it. Interested in feedback from people closer to the space.
First, split the load in to simple asset-specific data streams with a front-end FPGA for raw speed. Resist the temptation to actually execute here as the friction is too high for iteration, people, supply chain, etc. Input may be a FIX stream or similar, output is a series of asset-specific binary event streams along low-latency buses, split in to asset-specific segments of a scalable cluster of low-end MCUs. Second, get rid of the general purpose operating system assumption on your asset-specific MCU-based execution platform to enable faster turnaround using low-level code you can actually find people to write on hardware you can actually purchase. Third, profit? In such a setup you'd need to monitor the overall state with a general purpose OS based governor which could pause or change strategies by reprogramming the individual elements as required.
Just how low are the latencies involved? At a certain point you're better off paying to get the hardware closer to the core than bothering with engineering, right? I guess that heavily depends on the rules and available DCs / link infrastructure offered by the exchanges or pools in question. I would guess a number of profitable operations probably don't disclose which pools they connect to and make a business of front-running, regulations or terms of service be damned. In such cases, the relative network geographic latency between two points of execution is more powerful than the absolute latency to one.
The work I do is all in the single-to-low-double-digit microsecond range to give you an idea of timing constraints. I'm peripheral to HFT as a field though.
> First, split the load in to simple asset-specific data streams with a front-end FPGA for raw speed. Resist the temptation to actually execute here as the friction is too high for iteration, people, supply chain, etc.
This is largely incorrect, or more generously out-of-date, and it influences everything downstream of your explanation. Think of FPGAs as far more flexible GPUs and you're in the right arena. Input parsing and filtering are the obvious applications, but this is by no means the end state.
A wide variety of sanity checks and monitoring features are pushed to the FPGAs, fixed calculation tasks, and output generation. It is possible for the entire stack for some (or most, or all) transactions to be implemented at the FPGA layer. For such transactions the time magnitudes are mid-to-high triple digit nanoseconds. The stacks I've seen with my own two eyeballs talked to supervision algorithms over PCIe (which themselves must be not-slow, but not in the same realm as <10us work), but otherwise nothing crazy fancy. This is well covered in the older academic work on the subject [1], which is why I'm fairly certain its long out of date by now.
HRT has some public information on the pipeline they use for testing and verifying trading components implemented in HDL.[2] With the modern tooling, namely Verilator, development isn't significantly different than modern software development. If anything, SystemVerilog components are much easier to unit test than typical C++ code.
Beyond that it gets way too firm-specific to really comment on anything, and I'm certainly not the one to comment. There's maybe three dozen HFT firms in the entire United States? It's not a huge field with widely acknowledged industry norms.
First, split the load in to simple asset-specific data streams with a front-end FPGA for raw speed. Resist the temptation to actually execute here as the friction is too high for iteration, people, supply chain, etc. Input may be a FIX stream or similar, output is a series of asset-specific binary event streams along low-latency buses, split in to asset-specific segments of a scalable cluster of low-end MCUs. Second, get rid of the general purpose operating system assumption on your asset-specific MCU-based execution platform to enable faster turnaround using low-level code you can actually find people to write on hardware you can actually purchase. Third, profit? In such a setup you'd need to monitor the overall state with a general purpose OS based governor which could pause or change strategies by reprogramming the individual elements as required.
Just how low are the latencies involved? At a certain point you're better off paying to get the hardware closer to the core than bothering with engineering, right? I guess that heavily depends on the rules and available DCs / link infrastructure offered by the exchanges or pools in question. I would guess a number of profitable operations probably don't disclose which pools they connect to and make a business of front-running, regulations or terms of service be damned. In such cases, the relative network geographic latency between two points of execution is more powerful than the absolute latency to one.