Raspberry Pi Pico does line rate 100M Ethernet

ghayes · 2024-08-29T16:43:44 1724949824

I just started playing around with PIO and DMA on a Pico, and it’s really fun just how much you can do on the chip without invoking the main CPU. For context, PIO is a mini-language you can program at the edge of the chip that can directly respond to and write to external IO. DMA allows you to tell the chip to send a signal based on data in memory, and can be programmed to loop or interrupt to limit re-invoking. The linked repo uses these heavily for its fast Ethernet communication.

stackghost · 2024-08-29T17:35:56 1724952956

For added clarity, the Pico includes an RP2040 which is where the PIO runs.

ghayes · 2024-08-29T19:03:26 1724958206

Thanks, and you're correct; not sure why you got downvoted for this. For anyone curious here are the data sheets for RP2040 [for original Pico] and RP2350 [for Pico 2], which describe the systems in detail.

RP2040: https://datasheets.raspberrypi.com/rp2040/rp2040-datasheet.p...

RP2350: https://datasheets.raspberrypi.com/rp2350/rp2350-datasheet.p...

SahAssar · 2024-08-29T20:46:10 1724964370

"the Pico includes an RP2040 which is where the PIO runs" to me sounds like it implies either

- The original Pico was not built around the RP2040 as its central part ("includes" sounds to me like it was an addition)

- The Pico 2 includes a RP2040 (in addition to the RP2350) which runs PIO

Neither of which are true. I'm guessing some other people had a similar reaction.

__michaelg · 2024-08-29T16:42:42 1724949762

> receive side uses a per-packet interrupt to finalize a received packet

This has made much faster systems not being able to process packets at line speed. A classic was that standard Gigabit network cards and contemporary CPUs were not able to process VoIP packets (which are tiny) at line speed, while they could easily download files (which are basically MTU-sized packets) at line speed.

rscott2049 · 2024-08-29T17:49:13 1724953753

Fortunately, the receive ISR isn't cracking packets, just calculating a checksum and passing the packet on to LWIP. I wish there were two DMA sniffers, so that the checksum could be calculated by the DMA engine(s), as that's where a lot of processor time is spent (event with a table driven CRC routine).

dmitrygr · 2024-08-30T12:47:40 1725022060

You can do it using PIO. I did that for emulating memory stick slave on rp2040. One PIO SM plus two dma channels with chained descriptors. XOR is achieved using any io reg you don’t need, with 0x3000 offset (manual mentions this as the XOR alias)

crote · 2024-08-29T17:47:38 1724953658

Luckily the RP2040 has a dualcore CPU so one core can be dedicated entirely to receiving the interrupts, passing it to user code on the other core via a FIFO or whatever else you fancy.

sulandor · 2024-08-30T13:06:49 1725023209

almost

context switching between processors will reduce cache coherence and hence hits, but yea, it might be worth the tradeoff on busy systems

crote · 2024-08-31T22:10:08 1725142208

Why would there be context switching? One core is exclusively running user code and polls for new pre-processed packages in some loop, the other core is exclusively running low-level network code and dealing with interrupts.

It's a Cortex M33, so there's no meaningful cache to speak off. Access to all memory takes essentially the same amount of time. If you're really worried about access time you could probably use SRAM banks 8&9 (each 4k, with their own connection to the AHB crossbar) and flip-flop between the two - but I highly doubt it's going to have a measurable impact.

sulandor · 2024-09-01T09:13:28 1725182008

if interrupt and usespace code run on the same core, there is a chance that the data will still be in the cache line of the processor and it wont have to go thru main memory.

montecarl · 2024-08-29T16:50:43 1724950243

Why is the transfer rate non-linear with respect to the system clock? At 100 MHz the rate is 1.38 Mbit/s and at 200 Mhz it is 65.4 Mbit/s.

nyrikki · 2024-08-29T18:30:02 1724956202

Latency kills...and Ethernet uses exponential backoff.

crote · 2024-08-29T19:25:57 1724959557

More specifically TCP uses exponential backoff. Ethernet will happily keep drowning you in packages at line rate, if I'm not mistaken.

daymanstep · 2024-08-29T20:07:35 1724962055

CSMA/CD does use exponential back off, though I'm not sure if anyone is still using it.

fach · 2024-08-29T20:30:48 1724963448

This is only for half-duplex ethernet communication so no one apart from some archaic systems.

ryan-c · 2024-08-29T23:42:29 1724974949

Like WiFi?

edward28 · 2024-08-30T02:50:45 1724986245

CSMA/CA but close.

pantalaimon · 2024-08-30T09:42:36 1725010956

There is also 10Base-T1 which is a rather recent addition.

KennyBlanken · 2024-08-30T03:48:08 1724989688

> Ethernet will happily keep drowning you in packages at line rate, if I'm not mistaken.

It's a physical layer, so yeah, of course it will.

vardump · 2024-08-29T17:16:52 1724951812

Maybe a lot of CRC errors or something. Just a guess.

rscott2049 · 2024-08-29T17:43:09 1724953389

Wish I could answer that! All I can guess is that the slower processing speed creates a bottleneck in the LWIP stack somewhere...

vardump · 2024-08-29T16:39:34 1724949574

Impressive.

At first I thought it was the new Pico 2 (RP2350), but no, it’s the old Pi Pico with RP2040.

rscott2049 · 2024-08-29T17:51:38 1724953898

I expect the RP2350 to perform much better in this scenario! At the minimum, one of the DMA channels should be eliminated, and I'm hoping the CRC calculation will get faster.

molticrystal · 2024-08-29T18:34:22 1724956462

I see some examples that show this can be used as a lite http daemon.

Is there enough room to have it control the ethernet port for another weaker or perhaps more powerful microcontroller?

Can you combine multiple picos with one being the ethernet stack and another that modifies certain packets?

Are there any other interesting things that can be done?

bangaladore · 2024-08-29T21:05:30 1724965530

> Is there enough room to have it control the ethernet port for another weaker or perhaps more powerful microcontroller?

Well there is a whole unused core and plenty of built in SRAM. Seems like a good way to have an open-source version of Wiznet chips [1]. It could support full protocol offloading like Wiznet's or a lower-level raw packet sender/receiver like the ENC424J600.

[1] https://docs.wiznet.io/Product/iEthernet

MartijnBraam · 2024-08-30T13:18:43 1725023923

I just quickly tried to fit the whole rp2040+ethernet phy in the WIZ850io formfactor (mainly because I already used that module in some projects before) and have not yet been able to make it fit without using the more expensive jlcpcb features like burried vias. It would be very cool to have though since the W5500 really needs an update.

bangaladore · 2024-08-30T21:48:30 1725054510

I'm unable to respond to your deeper comment, but I don't see any issue at all with this. Your concern about the vias doesn't make sense as you just tent the vias anywhere you are concerned about shorts. I'm 100% certain you can fit both chips, all passives, etc, in this formfactor. If the flash size is a concern, RP2350 (the new version of the 2040) has integrated flash for some of their packages. Or just use a chip scale (or similar) flash instead of the one normally used on RP2040 designs.

bangaladore · 2024-08-30T21:32:34 1725053554

A 4-layer in that form factor should be pretty doable with no fancy features like blind vias. The RP2040 and W5500 are the same size, and ethernet PHYs can be found in about 3x3mm or even smaller. There should be about 20x25mm of usable space in that module form factor (even conservatively, like 18x23).

I don't have the time to give it a shot myself, but I could try to help if needed.

MartijnBraam · 2024-08-30T21:42:47 1725054167

The issue is more the space needed by all the passives, the crystal, the massive flash chip. I can just about make it fit but now I have the issue that the phy needs some vias to the center pad for gnd but that's always right at the point where my ethernet jack is on the other side.

Lerc · 2024-08-29T23:13:17 1724973197

Make a package that has a rp2050 mounted on a microSD and you've got a NAS that nobody will ever find.

Back when I was doing a dumb-server/smart-client desktop environment. Something like this would have been pretty cool. It needed a tiny API to save files, but the bulk of the environment worked as a static server.

throwaway81523 · 2024-08-30T03:12:31 1724987551

This stuff all already exists, Raspberry Pi Zero 2 W. Board is slightly bigger than a Pico but has a full blown Linux system, 4 core arm64 cpu, 512MB ram, SD card slot, wifi, no ethernet though (add-ons are available). Or you could use a larger Pi.

crote · 2024-08-29T17:55:34 1724954134

Very impressive!

It would be interesting to see a short writeup of what kind of magic was required to achieve this, as there have been multiple failed attempts before this.

I'm also curious about the performance boost from 2.81Mbit/link failure at 150MHz to 65.4Mbit/31.4Mbit at 200MHz. That doesn't sound like basic processor bottlenecks, but rather some kind of catastrophic breakdown at a lower level? Does it just occasionally completely fail to lock onto an incoming clock signal or something?

rscott2049 · 2024-08-30T01:47:09 1724982429

I did some further investigating - it's apparently due to not having enough setup time on the RX pio SM. Even though the PIO clocking is fixed at 100 MHz, there are CRC errors at the lower system clocks. I tried changing the delay in the PIO instruction that starts the RX sampling, but that only made things worse (as expected). Also tried disabling the synchronizers with no improvement.

crote · 2024-08-31T22:15:29 1725142529

Hmm, interesting. Am I understanding it correctly that you're doing some kind of reset on the RX PIO from regular C code, and the time for "RX finish -> interrupt CPU -> reset RX PIO" is longer than the gap between packets?

If so, might it be possible to use two RX PIOs, automatically starting the next one via inter-PIO IRQ when a packet is finished? That'd give you an entire packet receive time to reset the original PIO, which should be plenty.

rscott2049 · 2024-09-01T16:18:31 1725207511

Nothing nearly so complex. Here's the code in question:

  .wrap_target
     irq set 0          ; Signal end of active packet
  start:
      wait 1 pin 2      ; Wait for CR_DV assertion
      wait 1 pin 0      ; Wait for RX<0> to assert, signalling preamble start
      wait 1 pin 1 [2]  ; Wait for Start of Frame Delimiter, align to sample clk
  sample:
      in pins, 2        ; accumulate di-bits
      jmp PIN, sample   ; as long as CRS_DV is asserted
  .wrap

It's run at a fixed 100 MHz, regardless of system clock speed, via controlling the PIO execution rate a fraction of the system clock speed. So, for a 300 MHz system clock, the PIO is clocked once every three system clocks. I'm speculating that the extra two clocks (at 300 MHz) allows more setup time to the PIO inputs. The [2] above enables an extra two PIO clock delays before executing the next instruction. I tried changing this from zero to three at 100 MHz system clock (i.e. a PIO system clock divisor of one), and wasn't able to fix the problem. Though it should be noted that the LAN8742 isn't a very forgiving chip - I've seen RX Data Valid (DV) go metastable when the TX clock is interrupted/changed, so another pass through might be worthwhile.

BTW, Sandeep's original code clocked the RX PIO SM at 50 MHz, pushing all the samples to the output FIFO, and relied on the processor getting interrupted at the falling edge of DV to figure out what samples constituted a packet.

drones · 2024-08-29T18:53:55 1724957635

> Achieves 94.9 Mbit/sec when Pico is overclocked to 300 MHz, as measured by iperf

Is this an effective rate, or just the reflection of a hardware limit?

gonzo · 2024-08-30T07:45:06 1725003906

A 1500 byte (octet) MTU frame is 1538 bytes “on the wire”.

7 byte preamble

1 byte SFD

6 byte dst MAC

6 byte src MAC

2 byte ethertype or length

46-1500 bytes of payload (ignoring “Jumbo” frames and 802.1q tags)

4 byte CRC

12 byte IFG (which is silence, but still counts for time on the wire)

Add it up and you have 1538 bytes “on the wire”.

TCP overhead for IPv4 is 20 bytes for IP(v4) (no options) and 20 bytes for TCP (again, no options).

So 1460 bytes of data for 1538 bytes on the wire. 1460/1538 = 0.949284

So for 100M Ethernet, 94.9284Mbps is “perfect”.

dlenski · 2024-08-30T05:15:30 1724994930

Usually I can grok the significance of almost any item on HN that catches my eye, but here I'm at a loss. Can someone explain why this matters?

As far as I can tell, someone has figured out how to send Ethernet packets at a relatively high rate using hardware with very limited CPU. Cool, but what can you _do with that_? If the RPi Pico has the juice to run interesting network _application-level traffic_ at line rate it's more intriguing, but I doubt that anyone's going to claim that can serve web traffic at line rate on this device, for example.

What am I missing?

boffinAudio · 2024-08-30T08:02:31 1725004951

Its quite popular in the retro-computing scene, for example, to bring these old machines into the 21st century with modern microcontrollers being used to add peripheral support.

For example, the Oric-1/Atmos computers recently got a project called "LOCI" which adds USB support to the 40-year old computer[1], by using an RP2040's PIO capabilities to interface the 8-bit DATA bus with a microcontroller capable of acting as the 'gateway' to all of the devices on the USB peripheral bus.

This is amazing, frankly.

And now, being able to do Ethernet in such a simple way means that hundreds of retro-computing platforms could be put on the Internet with relative ease ..

[1] - https://forum.defence-force.org/viewtopic.php?t=2593&sid=2d3...

vardump · 2024-08-30T11:25:43 1725017143

RP2040/2350 are IO monsters. You could for example make a logic analyzer that transfers logic data through ethernet.

This "very limited" microcontroller has two cores. Both of them can execute about 25 instructions per byte for generating "application-level traffic". You could definitely saturate a 100 Mbps connection with just one core.

_flux · 2024-08-30T12:24:44 1725020684

Now that you mention it, I think I would like to see a logic analyzer that does just that. No buffering, just straight up shovel the data to a mac address, or even IP address, and be done with it (maybe lose a few frames here and there). Let the PC worry about what to do with it, like triggers etc.

Should be cheap, right? Though 1Gbit version might still be expensive..

eternityforest · 2024-08-31T16:11:43 1725120703

Can't you do reads with very basic compression faster than most of these chips can push data to Ethernet?

baby_souffle · 2024-08-30T14:40:57 1725028857

How is this different from the cheap salae clones now? Just sub out Ethernet for usb and that’s how they work now: a cheap ic with nothing but a2d and a usb phy samples and sends as fast as it can..

_flux · 2024-08-30T15:39:27 1725032367

It would be over the network :), which is—I imagine—tiny bit simpler than over USB.

dark-star · 2024-08-30T07:33:29 1725003209

Back in the day, in the x86 world, there was this "rule of thumb" that you needed about 1GHz of CPU speed to saturate a 1Gbit network link. So a server with four 2GHz CPUs could saturate eight 1gbit links and still be somewhat useful.

This was AFAIR based on empirical knowledge, nothing scientific.

So a Pi Pico running at 300MHz pushing 100Mbit is something that is not totally unexpected, if you consider the low-power, low-cost CPU design in a Pi Pico (and the fact that you have to push the bits manually on the wire).

It's still a nice feat that they pulled this off!

gonzo · 2024-08-30T07:32:18 1725003138

“Line rate” is not “fill the link with TCP”. Line rate is “fill the link with 84 octet (including all overhead) frames.”

For 100M Ethernet this requires 148,809 packets per second.

Edit: for 1538 octet frames, one need only process 8,127 packets per second.

dark-star · 2024-08-30T07:34:42 1725003282

"Line rate" is "fill the 100Mbit link with 100 million bits each second". Of course the overhead is included in that, since the overhead also goes over the wire

Hikikomori · 2024-08-30T07:58:14 1725004694

Smallest packet line rate is usually the definition network engineers use when discussing performance of devices.

gbil · 2024-08-30T08:16:00 1725005760

I'm many years away from such topics but I don't remember this being the case, moreover specs for net equipment was (is) on pps with the details stating usually 2-3 packet size categories. I'm interested on some reference on what you wrote

Hikikomori · 2024-08-30T08:41:27 1725007287

https://www.fmad.io/blog/what-is-10g-line-rate

As the article calls it, the gold standard. If a device is capable of forwarding/switching packets at the smallest packet size line rate on all interfaces at the same time you don't have to think too much about its performance when designing your network. Haven't worked much with hardware for a few years but it was common that Cisco switches were not capable of this.

Dylan16807 · 2024-08-30T14:21:55 1725027715

Gold standard sure, but that doesn't make it the definition of line rate.

nsteel · 2024-08-30T08:47:13 1725007633

Vendors I've seen usually use one of a few "standard" packet size mixes e.g. imix. Nobody uses smallest size frames because nobody can hit their headline perf numbers for that, and it's not representative of real-world usage anyway.

https://en.m.wikipedia.org/wiki/Internet_Mix

Hikikomori · 2024-08-30T10:04:44 1725012284

I think this applies more to firewalls used for simple internet connections than switches/routers.

nsteel · 2024-08-30T10:08:58 1725012538

I was specifically talking about enterprise/core routers where the packet processing can be non-trivial and small frames hurt performance.

Hikikomori · 2024-08-30T11:00:00 1725015600

What core routers uses imix?

nsteel · 2024-08-30T11:29:57 1725017397

Not sure what you mean, but an example of using imix to measure perf of a core router is Cisco 8000 series https://miercom.com/wp-content/uploads/2024/02/Performance-V...

Hikikomori · 2024-08-30T11:56:44 1725019004

That 8200 for example is capable of line rate at the smallest packet size so that imix marketing is kinda useless. When evaluating these kinds of devices this is what matters.

IMIX makes sense on devices that are not capable of small packet line rate like firewalls where bandwidth is much more costly and need to be sized appropriately.

nsteel · 2024-08-30T13:06:16 1725023176

I don't have any Cisco core routers, not have I personally tested any, but that document I provided found their Q200 ASIC (in the 8000 series) required at least 170B frames to hit line rate:

> Both DUTs can achieve line rate performance on all ports with an NDR of 170 Bytes for the 88-LC0-36FH-M line card and 215 Bytes for 8201-32FH router. Same values were observed for both IPv4 and IPv6 traffic. This exceeds all real-life deployments requirements regardless of position in the network.

The 9000 series analysis reports something like 400B packets to hit line rate.

Fundamentally, everyone has to scale their internal bus width and clock rate to hit the headline numbers, always at the cost of small frame performance.

gonzo · 2024-08-30T07:51:34 1725004294

This is a lazy definition and won’t get you past “Go” when making network equipment. Why not use 9000 byte “Jumbo” frames? You’ll only need to process 1,383 packets per second to fill the link!

guenthert · 2024-08-30T08:06:44 1725005204

That's actually what NAS vendors do.

gonzo · 2024-08-30T08:17:57 1725005877

RPi Pico NAS at nearly 600Mbps!

What could go wrong?

thetinguy · 2024-08-29T18:56:33 1724957793

Can it do 10BASE-T with no overclocking?

crote · 2024-08-29T19:27:28 1724959648

Yes, see for example the project which this repo acknowledges at the end.[0]

[0]: https://github.com/sandeepmistry/pico-rmii-ethernet

RantyDave · 2024-08-30T02:42:09 1724985729

This is a golden age of something.