I just started playing around with PIO and DMA on a Pico, and it’s really fun just how much you can do on the chip without invoking the main CPU. For context, PIO is a mini-language you can program at the edge of the chip that can directly respond to and write to external IO. DMA allows you to tell the chip to send a signal based on data in memory, and can be programmed to loop or interrupt to limit re-invoking. The linked repo uses these heavily for its fast Ethernet communication.
Thanks, and you're correct; not sure why you got downvoted for this. For anyone curious here are the data sheets for RP2040 [for original Pico] and RP2350 [for Pico 2], which describe the systems in detail.
> receive side uses a per-packet interrupt to finalize a received packet
This has made much faster systems not being able to process packets at line speed. A classic was that standard Gigabit network cards and contemporary CPUs were not able to process VoIP packets (which are tiny) at line speed, while they could easily download files (which are basically MTU-sized packets) at line speed.
Fortunately, the receive ISR isn't cracking packets, just calculating a checksum and passing the packet on to LWIP. I wish there were two DMA sniffers, so that the checksum could be calculated by the DMA engine(s), as that's where a lot of processor time is spent (event with a table driven CRC routine).
You can do it using PIO. I did that for emulating memory stick slave on rp2040. One PIO SM plus two dma channels with chained descriptors. XOR is achieved using any io reg you don’t need, with 0x3000 offset (manual mentions this as the XOR alias)
Luckily the RP2040 has a dualcore CPU so one core can be dedicated entirely to receiving the interrupts, passing it to user code on the other core via a FIFO or whatever else you fancy.
Why would there be context switching? One core is exclusively running user code and polls for new pre-processed packages in some loop, the other core is exclusively running low-level network code and dealing with interrupts.
It's a Cortex M33, so there's no meaningful cache to speak off. Access to all memory takes essentially the same amount of time. If you're really worried about access time you could probably use SRAM banks 8&9 (each 4k, with their own connection to the AHB crossbar) and flip-flop between the two - but I highly doubt it's going to have a measurable impact.
if interrupt and usespace code run on the same core, there is a chance that the data will still be in the cache line of the processor and it wont have to go thru main memory.
I expect the RP2350 to perform much better in this scenario! At the minimum, one of the DMA channels should be eliminated, and I'm hoping the CRC calculation will
get faster.
> Is there enough room to have it control the ethernet port for another weaker or perhaps more powerful microcontroller?
Well there is a whole unused core and plenty of built in SRAM. Seems like a good way to have an open-source version of Wiznet chips [1]. It could support full protocol offloading like Wiznet's or a lower-level raw packet sender/receiver like the ENC424J600.
I just quickly tried to fit the whole rp2040+ethernet phy in the WIZ850io formfactor (mainly because I already used that module in some projects before) and have not yet been able to make it fit without using the more expensive jlcpcb features like burried vias. It would be very cool to have though since the W5500 really needs an update.
I'm unable to respond to your deeper comment, but I don't see any issue at all with this. Your concern about the vias doesn't make sense as you just tent the vias anywhere you are concerned about shorts. I'm 100% certain you can fit both chips, all passives, etc, in this formfactor. If the flash size is a concern, RP2350 (the new version of the 2040) has integrated flash for some of their packages. Or just use a chip scale (or similar) flash instead of the one normally used on RP2040 designs.
A 4-layer in that form factor should be pretty doable with no fancy features like blind vias. The RP2040 and W5500 are the same size, and ethernet PHYs can be found in about 3x3mm or even smaller. There should be about 20x25mm of usable space in that module form factor (even conservatively, like 18x23).
I don't have the time to give it a shot myself, but I could try to help if needed.
The issue is more the space needed by all the passives, the crystal, the massive flash chip. I can just about make it fit but now I have the issue that the phy needs some vias to the center pad for gnd but that's always right at the point where my ethernet jack is on the other side.
Make a package that has a rp2050 mounted on a microSD and you've got a NAS that nobody will ever find.
Back when I was doing a dumb-server/smart-client desktop environment. Something like this would have been pretty cool. It needed a tiny API to save files, but the bulk of the environment worked as a static server.
This stuff all already exists, Raspberry Pi Zero 2 W. Board is slightly bigger than a Pico but has a full blown Linux system, 4 core arm64 cpu, 512MB ram, SD card slot, wifi, no ethernet though (add-ons are available). Or you could use a larger Pi.
It would be interesting to see a short writeup of what kind of magic was required to achieve this, as there have been multiple failed attempts before this.
I'm also curious about the performance boost from 2.81Mbit/link failure at 150MHz to 65.4Mbit/31.4Mbit at 200MHz. That doesn't sound like basic processor bottlenecks, but rather some kind of catastrophic breakdown at a lower level? Does it just occasionally completely fail to lock onto an incoming clock signal or something?
I did some further investigating - it's apparently due to not having enough setup
time on the RX pio SM. Even though the PIO clocking is fixed at 100 MHz, there are CRC errors at the lower system clocks. I tried changing the delay in the PIO instruction that starts the RX sampling, but that only made things worse (as expected). Also tried disabling the synchronizers with no improvement.
Hmm, interesting. Am I understanding it correctly that you're doing some kind of reset on the RX PIO from regular C code, and the time for "RX finish -> interrupt CPU -> reset RX PIO" is longer than the gap between packets?
If so, might it be possible to use two RX PIOs, automatically starting the next one via inter-PIO IRQ when a packet is finished? That'd give you an entire packet receive time to reset the original PIO, which should be plenty.
Nothing nearly so complex. Here's the code in question:
.wrap_target
irq set 0 ; Signal end of active packet
start:
wait 1 pin 2 ; Wait for CR_DV assertion
wait 1 pin 0 ; Wait for RX<0> to assert, signalling preamble start
wait 1 pin 1 [2] ; Wait for Start of Frame Delimiter, align to sample clk
sample:
in pins, 2 ; accumulate di-bits
jmp PIN, sample ; as long as CRS_DV is asserted
.wrap
It's run at a fixed 100 MHz, regardless of system clock speed, via controlling
the PIO execution rate a fraction of the system clock speed. So, for a 300 MHz
system clock, the PIO is clocked once every three system clocks. I'm speculating that the extra two clocks (at 300 MHz) allows more setup time to
the PIO inputs. The [2] above enables an extra two PIO clock delays before
executing the next instruction. I tried changing this from zero to three at 100 MHz system clock (i.e. a PIO system clock divisor of one), and wasn't able
to fix the problem. Though it should be noted that the LAN8742 isn't a very
forgiving chip - I've seen RX Data Valid (DV) go metastable when the TX clock
is interrupted/changed, so another pass through might be worthwhile.
BTW, Sandeep's original code clocked the RX PIO SM at 50 MHz, pushing all the samples to the output FIFO, and relied on the processor getting interrupted
at the falling edge of DV to figure out what samples constituted a packet.
Usually I can grok the significance of almost any item on HN that catches my eye, but here I'm at a loss. Can someone explain why this matters?
As far as I can tell, someone has figured out how to send Ethernet packets at a relatively high rate using hardware with very limited CPU. Cool, but what can you _do with that_? If the RPi Pico has the juice to run interesting network _application-level traffic_ at line rate it's more intriguing, but I doubt that anyone's going to claim that can serve web traffic at line rate on this device, for example.
Its quite popular in the retro-computing scene, for example, to bring these old machines into the 21st century with modern microcontrollers being used to add peripheral support.
For example, the Oric-1/Atmos computers recently got a project called "LOCI" which adds USB support to the 40-year old computer[1], by using an RP2040's PIO capabilities to interface the 8-bit DATA bus with a microcontroller capable of acting as the 'gateway' to all of the devices on the USB peripheral bus.
This is amazing, frankly.
And now, being able to do Ethernet in such a simple way means that hundreds of retro-computing platforms could be put on the Internet with relative ease ..
RP2040/2350 are IO monsters. You could for example make a logic analyzer that transfers logic data through ethernet.
This "very limited" microcontroller has two cores. Both of them can execute about 25 instructions per byte for generating "application-level traffic". You could definitely saturate a 100 Mbps connection with just one core.
Now that you mention it, I think I would like to see a logic analyzer that does just that. No buffering, just straight up shovel the data to a mac address, or even IP address, and be done with it (maybe lose a few frames here and there). Let the PC worry about what to do with it, like triggers etc.
Should be cheap, right? Though 1Gbit version might still be expensive..
How is this different from the cheap salae clones now? Just sub out Ethernet for usb and that’s how they work now: a cheap ic with nothing but a2d and a usb phy samples and sends as fast as it can..
Back in the day, in the x86 world, there was this "rule of thumb" that you needed about 1GHz of CPU speed to saturate a 1Gbit network link. So a server with four 2GHz CPUs could saturate eight 1gbit links and still be somewhat useful.
This was AFAIR based on empirical knowledge, nothing scientific.
So a Pi Pico running at 300MHz pushing 100Mbit is something that is not totally unexpected, if you consider the low-power, low-cost CPU design in a Pi Pico (and the fact that you have to push the bits manually on the wire).
"Line rate" is "fill the 100Mbit link with 100 million bits each second". Of course the overhead is included in that, since the overhead also goes over the wire
I'm many years away from such topics but I don't remember this being the case, moreover specs for net equipment was (is) on pps with the details stating usually 2-3 packet size categories. I'm interested on some reference on what you wrote
As the article calls it, the gold standard. If a device is capable of forwarding/switching packets at the smallest packet size line rate on all interfaces at the same time you don't have to think too much about its performance when designing your network. Haven't worked much with hardware for a few years but it was common that Cisco switches were not capable of this.
Vendors I've seen usually use one of a few "standard" packet size mixes e.g. imix. Nobody uses smallest size frames because nobody can hit their headline perf numbers for that, and it's not representative of real-world usage anyway.
That 8200 for example is capable of line rate at the smallest packet size so that imix marketing is kinda useless. When evaluating these kinds of devices this is what matters.
IMIX makes sense on devices that are not capable of small packet line rate like firewalls where bandwidth is much more costly and need to be sized appropriately.
I don't have any Cisco core routers, not have I personally tested any, but that document I provided found their Q200 ASIC (in the 8000 series) required at least 170B frames to hit line rate:
> Both DUTs can achieve line rate performance on all ports with an NDR of 170 Bytes for the
88-LC0-36FH-M line card and 215 Bytes for 8201-32FH router. Same values were observed
for both IPv4 and IPv6 traffic. This exceeds all real-life deployments requirements regardless
of position in the network.
The 9000 series analysis reports something like 400B packets to hit line rate.
Fundamentally, everyone has to scale their internal bus width and clock rate to hit the headline numbers, always at the cost of small frame performance.
This is a lazy definition and won’t get you past “Go” when making network equipment. Why not use 9000 byte “Jumbo” frames? You’ll only need to process 1,383 packets per second to fill the link!