How much it depends on hard IP blocks ? I mean, can it be ported to FPGAs of other vendors, like Lattice ECP5 ? Did you implement PCIe in HDL or used vendor specific IP block ? Please, provide some resource utilization statistics. Thanks.
The GPU uses https://github.com/alexforencich/verilog-pcie + the Xilinx PCIe hard IP core. When using the device-independent DMA engine, that library supports both Xilinx and Intel FPGAs.
Implementing PCIe in the fabric without using the hard IP would be foolish, and definitely not the kind of thing I'd enjoy spending my time on! The design makes extensive use of the DSP48E2 and various BRAM/URAM blocks available in the fabric. I don't have exact numbers off the top of my head, but roughly it's ~500 DSP units (primarily for multiplication), ~70k LUTs, ~135k FFs, and ~90 BRAMs. Porting it to a different device would be a pretty significant undertaking, but would not be impossible. Many of the DSP resources are inferred, but there is a lot of timing stuff that depends on the DSP48E2's behavior - multiple register stages following the multiplies, the inputs are sized appropriately for those specific DSP capabilities, etc.