Even with a PCIe FPGA card you're still going to be memory bound during inference. When running LLama.cpp on straight CPU memory bandwidth, not CPU power, is always the bottleneck.
Now if the FPGA card had a large amount of GPU tier memory then that would help.
Now if the FPGA card had a large amount of GPU tier memory then that would help.