LaiNES – Cycle-accurate NES emulator in around 1000 lines of code

glandium · on Nov 28, 2016

This makes me wonder. The 6502 in the NES ran at 1.79MHz.

According to https://en.wikipedia.org/wiki/Instructions_per_second , a modern i7 processor handles north of 100,000 MIPS.

According to https://en.wikipedia.org/wiki/Transistor_count, the 6502 had 3510 transistors.

At 100,000 MIPS, a modern CPU would have a budget of ~56k instructions to process one cycle of that CPU, or about 16 instructions per transistor.

So it would seem it might now be possible to simulate those old processors, at the transistor level, in real time. Is anyone aware of experiments in this domain? If not really useful, that sounds like an interesting fun (side) project.

mattbee · on Nov 28, 2016

http://www.visual6502.org/ does it in the browser & flashes transistor states at you!

And if you liked that you'll probably also enjoy http://www.megaprocessor.com/progress.html

vincnetas · on Nov 28, 2016

Also: Introducing the MOnSter 6502 http://www.evilmadscientist.com/2016/6502/

nathan_f77 · on Nov 28, 2016

This is incredible! But it's also nowhere near realtime. The max clock rate of a 6502 is 1 MHz to 2 MHz, while the simulation runs at around 7.3 Hz in my browser.

SeanDav · on Nov 28, 2016

It turns out that for several reasons, creating an exact emulation of even much older gaming technology can be very computationally intensive.

http://arstechnica.com/gaming/2011/08/accuracy-takes-power-o...

anonbanker · on Nov 28, 2016

When people link to this, it's a good idea to mention that 65816 emulation runs as one enormous switch statement. Yes, it's accurate, but at a rather large performance penalty.

I'm a daily/weekly higan user, nut I make no excuses for how it's written. take a peek into the code.

mathw · on Nov 28, 2016

I'm aware that hardware simulation of processors is something which is done, particularly by the people who design those processors.

It takes a lot of grunt to simulate a modern GPU or CPU, and of course getting near realtime is impossible, but it's handy if you want to run a load of code through it to see if it actually works.

So... yes, you can unit test the design of your processor. If you have a big enough server farm.

throwbsidbdk · on Nov 28, 2016

Verilog does exactly this. You design the logic than simulate all the gates. I'm sure there's some open cores out there ready for simulation if someone wants to fiddle with it

nsteel · on Nov 28, 2016

Simulations of complex modern chips don't run in real time.

ashmud · on Nov 28, 2016

You might find ladder logic for PLCs interesting. It simulates old hardware relay systems.

raldi · on Nov 28, 2016

What does "cycle-accurate" mean? The README assumes the reader already knows; Wikipedia via Google is totally unhelpful: "A cycle-accurate simulator is a computer program that simulates a microarchitecture on a cycle-by-cycle basis."

0xcde4c3db · on Nov 28, 2016

The traditional way of coding a console emulator was to figure out the time interval to the next interrupt in units of CPU cycles, emulate enough instructions to cross that threshold, then emulate the interrupt and hook other routines (redrawing the screen, filling sound output buffers, reading input, etc.) off of those events (see e.g. Marat Fayzullin's classic Emulator HOWTO [1]). This approach runs a lot of stuff just fine because it does synchronize to the most important events, but can cause problems. For example, "well-behaved" code generally only writes to graphics registers or sprite tables during blanking periods, as writing during active display is usually undefined behavior. Some code breaks the rules. Sometimes this is done intentionally to do cool effects with the hardware. Other times it's a side effect of a bug that wasn't caught because there are coincidentally no symptoms with the timing of the actual hardware. But then you plug it into an emulator with only roughly accurate timing and it blows up.

In reality, the clocks for the various components don't necessarily run at the same rate, or even at integer multiples of the CPU clock. You can have a situation where, for example, there are 3.5 clock cycles on the graphics hardware for every one CPU clock cycle. For a lot of the classic systems, this happened because a single higher master clock is divided down for each component.

A "cycle-accurate" emulator is one that operates as if the emulated state of all hardware were updated on every tick of the master clock. This wasn't generally done in the past because it was far too slow ~20 years ago when emulation of classic consoles and computers really took off.

More sophisticated hardware doesn't necessarily have any single master clock in this sense, so it doesn't make much sense to talk about a "cycle-accurate" emulator of a modern PC, for example.

[1] http://fms.komkon.org/EMUL8/HOWTO.html

colanderman · on Nov 28, 2016

I played around with this in my own NES emulator, which works roughly the way you describe. I found that, at least for major-brand titles (Mario/Zelda/Metroid/Kirby), cycle accuracy actually doesn't matter. Everything's based off PPU/APU/mapper interrupts.

In fact, doubling (or more) the CPU clock enhanced some of the games I tried. Animations in Kirby's Adventure became more smooth (e.g. the Spark ability). Screens full of enemies in Metroid ran with no slowdown. The glitchy first scanline of status overlays in various games cleared up. I haven't yet observed negative effects in major-party titles.

I had designed my emulator as an experiment: treat the NES as an abstract specification, rather than a concrete implementation. Turns out that a lot of games seem to have actually been designed following the same principle. It was a wonderful feeling to see these games as I imagine they were intended to be experienced, as if I opened a letter from the game designers left unopened for thirty years.

makomk · on Nov 28, 2016

Bear in mind that sometimes, people want to do tricks that depend on the slowdowns and glitches that happen on real hardware.

colanderman · on Nov 28, 2016

I'm sure. That wasn't the point of my experiment.

emodendroket · on Nov 28, 2016

And in fact some games won't work without them. http://arstechnica.com/gaming/2011/08/accuracy-takes-power-o...

emodendroket · on Nov 28, 2016

Yeah, every emulator is pretty much going to be fine with Mario, Zelda, and Kirby, but that isn't really the point.

colanderman · on Nov 28, 2016

In my case it was.

emodendroket · on Nov 28, 2016

I guess what I mean to say is that someone endeavoring to write a cycle-accurate emulator wants every game to work exactly like it did originally regardless of what bizarre hardware-specific hacks it relies on. For someone who just wants to play any of the top 20 most popular games on a system and isn't too concerned with minor deviations pretty much any emulator is going to suit their needs (in fact it's a common model to have hacks in the emulator just to make a particular popular game work right).

Narishma · on Nov 28, 2016

> It was a wonderful feeling to see these games as I imagine they were intended to be experienced

Did you mean the way they were not intended to be experienced? Otherwise it doesn't make sense. The way they were intended to be experienced is on a real console connected to a CRT TV.

0xcde4c3db · on Nov 28, 2016

It really depends on the game and what the designers and programmers considered ideal or non-ideal at that time. There might not even be unanimous opinion among the creators of what the ideal behavior would be; that's often seen in film, where an actor, director, and screenwriter all have somewhat different interpretations of a character. On one hand you have stuff like Cave shooters where they've tried to reproduce the slowdown in ports because it's expected in the genre and affects difficulty. On the other hand you have stuff like Shadow of the Colossus, which was almost certainly not intended to have extreme framerate drops.

We can objectively talk about various metrics of accuracy compared to the original hardware. Authorial intent is, pretty much by definition, a matter of opinion (and authors themselves, over the years, often change their ideas of what their intent was).

colanderman · on Nov 28, 2016

Well-said. For my experiment I took "authorial intent" to include:

* no framerate drops

* audio synthesized with band-limited step functions (including the "triangle" channel)

* video composed of band-limited scanlines

* video interrupt running at NTSC rate

* audio interrupt running at 240 Hz

* free-running audio synthesis

Like you said, it is entirely a matter of opinion what authorial intent is. For my purposes this definition provided an interesting experiment, and the result was aesthetically pleasing.

WorldMaker · on Nov 28, 2016

There's also the part of "authorial intent" where particularly at that time many of the games were designed and developed sometimes on much bigger hardware (corporate mainframes, for instance) targeting the consoles and then later QAed on development consoles (that might not even be the same hardware as production consoles). There certainly are questions for some games if the authors intended something much better that they could get their development hardware to perform. It's not that many console hardware generations back where even development consoles still varied in hardware from production consoles (at least as recently as the early history of the PS3/Xbox 360).

All told, it's all a part of game's version of the "the artwork is never quite finished/realized, it's just eventually published".

kLeeIsDead · on Nov 28, 2016

You still have the source?

userbinator · on Nov 28, 2016

More sophisticated hardware doesn't necessarily have any single master clock in this sense, so it doesn't make much sense to talk about a "cycle-accurate" emulator of a modern PC, for example.

Even in a modern PC, many of the clocks are divided down using PLLs which maintain a fixed frequency and phase ratio, so it is theoretically possible to make such an accurate emulator, but it would be difficult to write and many orders of magnitude slower than the real hardware.

For an example of PC software which doesn't run correctly on anything but real hardware or really accurate cycle-emulation, see this amazing demo:

https://trixter.oldskool.org/2015/04/07/8088-mph-we-break-al...

blt · on Nov 28, 2016

It seems like the "cycle-accurate" emulator would be a lot easier to write, is this true?

andxor · on Nov 28, 2016

I'm the author.

I disagree with this statement. It takes a considerable amount of effort and research to reach cycle-accuracy, especially on the PPU side. Understanding how the PPU pipeline works will take you more time than everything else.

If you settle for frame-based emulation you can write the functions to emulate the opcodes without worrying about the order and duration of the internal operations. Once the opcode is emulated, you just add cycles to a global counter based on a table that contains the number of cycles per instruction.

After 29781 cycles (you may end up being a little off each time, if you are not cycle-accurate), you call a function to update the state of the PPU. There you can use familiar iterative constructs to perform the rendering.

Compare this with the state machine approach in my code (ppu.cpp) and how much more careful I have to be composing microinstructions to form opcodes (cpu.cpp). I came up with a (I think) clean design but it wasn't trivial.

0xcde4c3db · on Nov 28, 2016

I've only actually written one in the classic style. I doubt that cycle-accurate is easier in general. It definitely isn't if you count the effort that goes into discovering the timing information by running experiments on the original hardware. If you start with everything documented, I guess it might be the least effort needed to bridge the gap between 99.5% compatibility and 100%.

throwbsidbdk · on Nov 28, 2016

TLDR: each CPU instruction takes the exact amount of time it does on the real cpu.

You generally only care about cycles in real-time stuff. In simpler CPUs like those used for microwaves and such, each CPU instruction takes constant time. You can count the number of instructions in your assembler loop and know how long the loop will take. Mind you sometimes each instruction takes constant time but some can take more time than others. A cpu cycle is defined as 1/clockspeed . A fast instruction can take 1 cycle, others 3 or more.

Usually there's multiple cpu instructions for GOTO and loading from memory, a fast set for things "close by" and a slow set for far away code or data. Loops and conditional instructions even on simple cpus can take different amounts of time depending on the outcome, so it gets pretty complicated to time things sometimes.

So when doing really low level timing critical code, you can use a super accurate signal generator to drive your CPU clock, then count your instructions to time things. Common uses include generating audio from bit banging and similar.

In more complex CPUs a number of things made it difficult or impossible to figure out exactly how long an instruction will take. None of the software for these CPUs is written to depend on exact cycle timings. Because nothing depends on cycle accuracy you don't have to worry about the timings when emulating these kinds of CPU's.

khedoros1 · on Nov 28, 2016

For extra speed, a lot of emulators take shortcuts. So, you might have a "frame-accurate" emulator that doesn't match the state of a "real" Nintendo on a cycle-by-cycle basis, but does by the end of the frame. There are a limited number of things triggered in the hardware (interrupts, set register flags, etc), and sometimes, like when the CPU is in an idle loop, the emulator can just skip ahead by a bunch of instructions until the next thing that needs to be handled.

Or in the graphics processor, maybe you'll blit out whole sprites/tiles at once, instead of rendering them pixel-by-pixel, the way that the hardware does.

In a cycle-accurate emulator, you're going to run every opcode, process the graphics pixel-by-pixel, etc.

stonogo · on Nov 28, 2016

Some understanding of context is expected to grasp implications of software. I'm not sure a github readme is the place to provide a complete education about the domain of the repository.

In this case, a 'cycle' is a processor cycle. Most emulators are imperfect, usually due to speed considerations, or to lack of access/understanding of the specifications of the original hardware. A short codebase that is entirely accurate is an achievement.

pubby · on Nov 28, 2016

NES instructions are broken down into a series of smaller, simpler steps by the CPU. Each of these steps executes in a fixed amount of time - a cycle. For example, the "load value from memory into register A" instruction might take two steps: 1) fetch memory on the bus 2) move the fetched value into the register.

Cycle-accurate means the emulator is emulating all of the little steps that make up an individual instruction. Instruction-accurate emulators ignore the smaller steps and treat instructions as indivisible.

Cycle-accurate NES emulators only really matter for emulating certain graphical and sound effects.

johndoe90 · on Nov 28, 2016

I guess it's about emulation accuracy. Some emulators fail to emulate correctly for the purpose of speed, which causes visual and sound artifacts.

This comes to the host CPU power. The more accurate the emulator, the more power from the host will be needed.

UPD: Found a wiki article: http://emulation-general.wikia.com/wiki/Emulation_Accuracy

posterboy · on Nov 28, 2016

I take that to mean that the timing of instruction execution is true to the original, hence no unintended glitches.

tlack · on Nov 28, 2016

People always say that terse code is hard to read and understand. I'd say I just learned a lot from a quick skim of the concise source here. If it had been structured with tons of white space and split over dozens of files/folders/modules, I'd have no chance of understanding it at a glance.

qwertyuiop924 · on Nov 28, 2016

Terse code is easy to read. Gratuitously compact code isn't

The rule of thumb is that if it's short but in no way obfuscated, it's terse. If it's short, and impossible to read because of it, it's gratuitously compact.

posterboy · on Nov 28, 2016

It is in multiple files, but still very terse, eg. using single letter variables.

   /* CPU state */
   u8 ram[0x800];
   u8 A, X, Y, S;
   u16 PC;
   Flags P;
   bool nmi, irq;

I like it.

andxor · on Nov 28, 2016

I'm the author.

The reason for the single letters is that those are the actual names of the 6502 registers.

Glad you like it overall. :)

pawadu · on Nov 28, 2016

It's really nice that you keep it so close to the actual hardware (not just the variable names, but also the way the logic works in for example PPU).

I also love the way you use C++ templates to simplify things!

posterboy · on Nov 28, 2016

accumulator xchange ... y, something stacky perhaps?

program counter

flagPort

non maskable interrupt, interrupt request

I'm sure you knew that though, it's common micro processor nomenclature.

Two9A · on Nov 28, 2016

Specifically on 6502-derived processors, you have the following registers: Accumulator, X and Y indexes, Stack Pointer, Program Counter, Flags. Of these, only the Program Counter is double-width.

Additionally, the zeroth "page" of memory (the lowest 256 bytes) are accessible with a dedicated addressing mode which saves space and time in the program, and can be used as a form of cache.

serge2k · on Nov 28, 2016

the variables are named for the registers they represent, it's not single letter variables for the sake of compactness.

posterboy · on Nov 29, 2016

OK, that was an unfair pick, other variables have longer strings. I just picked the example, because I thought it was cute how short the CPU status is, and because I couldn't remember what nmi ment, at first. Trying to talk about it brought it back to mind, so that worked for me, but I blamed the project in vain, because the decision was made elsewhere. Those electronic engineers!! Let's argue about better names for x and y. is anyone even reading this? I'm prolly not gonna come back to this either.

Graziano_M · on Dec 2, 2016

Absolutely agree. I'm writing an NES emulator myself and just glancing at this code has given me a few line-saving ideas.

qwertyuiop924 · on Nov 28, 2016

That is very cool.

I'm always kind of in awe of this sort of thing. Maybe I should try to do it. Should take some of the awe away.

But I should probably focus on sucking less, first.

andxor · on Nov 28, 2016

I'm the author.

Everything you need is on the NESdev wiki or it's linked there. A couple of particularly helpful resources are linked in my README. The PPU diagram and the 6502 reference were especially useful.

I definitely encourage you to do it, it's a great learning experience and very rewarding. When games start running is pure programming ecstasy :)

qwertyuiop924 · on Nov 28, 2016

Hey, thanks.

khedoros1 · on Nov 28, 2016

Do it. Emulation has always fascinated me, so I wrote a (kind of crappy) NES emulator a few years ago. It demystifies things, there's a ton of documentation out there, and you'll get the joy of increasing numbers of games working as you make progress.

bluedino · on Nov 28, 2016

Try making a small virtual machine first - https://github.com/tekknolagi/carp/blob/master/README.md

tekknolagi · on Nov 28, 2016

Whoa, that's my project. Wouldn't recommend looking at that one, but definitely take a look at it's successor (linked on the README).

colanderman · on Nov 28, 2016

I recommend this reference quite highly: http://wiki.nesdev.com/w/index.php/NES_reference_guide

Jach · on Nov 28, 2016

Another interesting project to take away some magic: https://github.com/ssloy/tinyrenderer/wiki

Arnt · on Nov 28, 2016

I did it precisely in order to not suck. The courses at the university seemed too toy-like, so I decided to write something bigger and with a real goal. 7500 lines of 386 and then some Modula 2, and a bit of AWOL from university.

In my case the Sinclair Spectrum, which I did cycle-accurately on about a 25MHz 386 and fast enough to play Jetpac on the slowest 386 ever sold. Cycle accuracy really only helped with pitch-perfect sound.

It was fun, and pushing something from a blank sheet to a working program is... shall we call it a practical exercise in not sucking?

xem · on Nov 28, 2016

Starting with a chip8 emulator is also a good exercise. We made one in JS in less than 1kb for this 13kb games compilation: http://js13kgames.com/entries/26-games-in-1 (unminified version here: http://xem.github.io/chip8/c8.html)

adefa · on Nov 28, 2016

I agree! I started a chip8 emulator in Rust. Once I get video and input finished, I'm considering writing a Gameboy emulator.

https://github.com/TrevorS/rustychip8

plandis · on Nov 28, 2016

As someone currently writing an NES emulator, do it! There are a ton of references to help you figure everything out it's been a pretty interesting experience.

toast0 · on Nov 28, 2016

There's a lot of docs out there, and some test roms to help. A CPU emulator is actually pretty straightforward. Adding the PPU is more complex, of course.

rounce · on Nov 28, 2016

First I'd like to say this project represents fantastically impressive achievement regardless.

However, putting this repo through `cloc` reveals the HN title to be rather misleading.

EDIT: I had initially read through {cpu,apu,ppu,gui,joypad,mapper}.{cpp,h} and noticed that the mental tally I was taking had run well over 1000LOC. In my haste I quickly cloned the repository and ran cloc against the current dir which massively inflated the result (Doh!). See author's comment below for a more sensible figure.

andxor · on Nov 28, 2016

Sorry, but that's really not fair. You are running that on all the folders inside the src/ directory (and the README?), including blargg's libraries that I'm using. I'm not counting that. Go ahead and include those if you want. It's code I didn't write, and bigger than the rest of the emulator! I don't think that's representative.

CPU and PPU implementations tend to be in the order of the thousands of lines -- they are around 200 and 300 lines respectively in LaiNES. In fact, most of the code is in the GUI that I could easily strip away if this was a competition. And this wasn't written to be small - it was written to be simple. It also came out small, but that's incidental.

Here's how I counted the lines and how I decided on the description for the repository, which by the way has been catapulted from totally unknown to worldwide attention overnight, and it's now object of unexpected, ruthless scrutiny that I couldn't foresee.

  [andrea@manhattan src]$ rm -rf boost nes_apu Sound_Queue.*
  [andrea@manhattan src]$ cloc .
        24 text files.
        24 unique files.                              
         1 file ignored.

  github.com/AlDanial/cloc v 1.70  T=0.03 s (780.3 files/s, 63170.2 lines/s)
  -------------------------------------------------------------------------------
  Language                     files          blank        comment           code
  -------------------------------------------------------------------------------
  C++                             11            210            110           1163
  C/C++ Header                    12             87              7            285
  -------------------------------------------------------------------------------
  SUM:                            23            297            117           1448
  -------------------------------------------------------------------------------

rounce · on Nov 28, 2016

I agree with this figure, please see my edited comment above.

Still though, my gripe with the HN title still stands: to say this is ~1000 LOC is a bit rich. It's still bloody small so why try to shoehorn it into this category?

> And this wasn't written to be small - it was written to be simple. It also came out small, but that's incidental.

I think that why this is pure gold! Because it's simple, it's easy to understand, I would have been hopping with glee if this had been available to me as a teenager, instead of having to read tons of articles/textfiles of varying quality with lots of trial, error and head-scratching. It's size is besides the point, and why the title is still - in my opinion - misleading. I can't help but feel the very people that would benefit the most from this might possibly be put off that they're going to be presented with some indecipherable demo comp entry.

Either way keep up the good work.

andxor · on Nov 28, 2016

No problem, we are cool. I see your point.

I believe there is still a lot of room for improvement in terms of accuracy, clarity and code size. Note that this repository was more than 3 years old. Maybe this will motivate me to improve it even further.

rounce · on Nov 28, 2016

I think that'd be pretty cool to see. I'd be especially be interested how quickly the returns in size diminish given your starting point. Likewise, how much it'd would have to change architecturally as it gets smaller. Sounds like a slightly more extreme form of the game Shenzen-I/O.

While reading through the repo one thing I kept thinking was it would be nice IMO would be decoupling the everything from the GUI so that it was a little 'flatter'. So `main` would call `NES::run()` (or something), and `NES` would leverage GUI. GUI would be just drawing stuff. (In my head at least,) it feels like that way it might be easier to mentally partition things, for those using it as a learning project. As `NES` would be responsible for ownership and interop that GUI is doing now. I'll add it to my todo list and perhaps in god-knows-when I'll fork it and do this if you haven't gotten round to it :) Having said that it's inspired me to finish my first Go project which was a GameBoy emulator. Was started mainly to deep-dive the language, but I think it could be useful in a similar way if cleaned up and documented for folks.

mentat · on Nov 28, 2016

I'm guessing that that figure is just for the emulation code. The UI and sound interface code eat up quite a few lines of code. Doesn't hurt to actually dig instead of karma farming...

rounce · on Nov 28, 2016

I had read the repo first and just from counting in my head it was deffo over 1000 lines. Turns out the author agrees, that it's almost 1500 LOC. I was a bit hasty (read: dumb) in generating the table is all :)

Maybe calm down with the "karma farming" accusations, my problem was with the HN title, not the repository.

andxor · on Nov 28, 2016

Also I'm not the author of the HN submission... ;)

anonbanker · on Nov 28, 2016

Any plans for a libretro port?

ggggtez · on Nov 28, 2016

Maybe of you inline entire functions...

userbinator · on Nov 28, 2016

Functions which are only "called" once, which I think is a perfectly reasonable thing to do.