Notes on the WDC 65c816

And why it isn’t more popular

Architecture

If you’re not familiar with the NMOS 6502, this article may be a bit over your head. There are plenty of great introductions around, and I would recommend Easy6502 JS. Have fun!

As a refresher, the 6502 had these registers:

Register name	6502
A - Accumulator	8 bits
X - Index	8 bits
Y - Index	8 bits
S - Stack	8 bits
PC - Program Counter	16 bits
P - Flags	8 bits

As you can see, not a lot of registers were added. In point of fact, the main problem in programming this processor is probably the lack of registers, and I’ll explain why.

Juggling registers under pressure

Register pressure is a term used to describe when you want to have a lot of variables in your program at one time, and how they fill up your available registers. On contemporary 16/32-bit designs like the Motorola 68000, people were already getting luxurious access to 8 data registers and 7 general-use address registers. You can do arithmetic with them however you please, and life is easy!

On 6502-derived architectures, however, you have just the Accumulator. Want to store a value to memory? Load it into the Accumulator and then store that to memory. Want to add two numbers? Load one of the operands into that single Accumulator. Want to set your Data Bank register? Push the Accumulator to stack (making sure you’re in the correct processor mode first), then pull the DBR from the stack. Want to increment X by something other than 1? Save your Accumulator, Transfer X to A, add whatever number, transfer A to X, and load back your accumulator.

It gets so that programming this processor feels like a juggling act. It is eased a little bit by the “Direct Page” addressing mode, which gives you a bunch of easily-addressible variables in a convenient, fast, and space-efficient format, with lots of operations available on them, but that’s nothing that can’t be done on the Motorola 68000 or equivalents.

I recently came up with a very good example of how much a problem this can be, and I’ll be writing a post about this in more depth. To get “Hello, world!” to render on a very simple, monochrome bitmap display, the first pass at the algorithm I worked on took around 180,000 cycles, or about 108 cycles per pixel. After spending a few hours re-factoring the code, I was eventually able to get that down to 25,000 cycles, or about 15 cycles per pixel, by exploiting instructions with faster addressing modes, and quirks in the architecture.

With that said, it’s not all bad. It’s an improvement over the 6502. One major optimization was to move from 8-bit memory reads and writes, to 16-bit ones, when drawing to the screen.

Separating memory for the…loss?

Almost as bad is the segmented memory structure and variety of addressing modes. You may have noticed in the table above, that the PC, S, X, Y, and D registers are all 16 bits. How do you address all the memory in the 24-bit address space?

The 65816 segments memory into 64k “banks.” 64K is how much RAM is addressable by a 16-bit register with an 8-bit data bus.

You have three banks of memory you’ll be easily working with. The zero-bank is hard-coded into many instructions, and is very useful. It’s just Bank 0, the first 64k of memory. The PBR points to the bank that the PC will fetch from, and the DBR points to a bank that some instructions will operate on. However, due to the opcode encoding, insisting on only using 1 byte for every complete opcode, many useful operations are not available to work on data where it’s convenient in memory.

The usual practice then, becomes indirect long pointers. You want to work on data in two different banks of memory? Store a 3-byte “long” pointer to one in the Zero Bank, and use an Indirect Long addressing mode to read the address to perform the operation you want to. Of course, this will add 3-4 cycles to your operation, but who’s counting cycles?

Counting cycles, modes, and bugs.

Literally everyone counts cycles, because early processors from WDC had a bug in them. The traces for the REP and SEP instructions were too long, so if you clocked the processor much faster than about 3.5MHz, using them would cause serious issues.

WDC fixed this with later revisions, but it was too late for the poor SNES. Some carts have an “SA-1” enhancement chip, and this is just a newer revision 65c816 clocked around 10.74MHz.

REP and SEP are extremely vital instructions, too. They allow you to set and clear bits on the P flag. There are two specific flags that are very frequently set and cleared in this matter: M and X.

When M is set to 1, operations that deal with the Accumulator and Memory (such as incrementing a value in memory, or loading or storing from accumulator, or adding, etc.) are operated on as 8 bits. When M is set to 0, 16-bit mode is used. Want to make an 8-bit write to a memory location? You’ll be changing that flag.

The X flag is similar, but affects the X and Y registers. When set to 8 bits, their MSBs are cleared, leaving them unable to index more than 256 values. I’m actually not too sure of the use for this, outside of emulation mode.

Wait, emulation?

Remember that * in the field about the P flag? That’s where this comes in. You see, the 65c816 has a “hidden flag” called E, accessed through an instruction that swaps it out with another bit inside P. The 65816 boots up in 6502 Emulation mode. In this mode, it supports all 6502 opcodes you know and love from programming on the NES. It even adjusts the timings of many instructions in order to match the 6502. As I’m sure you can imagine, this is a LOT of fun to emulate. All you need to do to get out into native mode is set that E to 0.

This emulation mode worked so well that a pin-compatible version - the 65802, which is basically just a 65816 with only 16 address lines - was offered as an in-place upgrade for many home computers. As long as programs didn’t rely on undocumented opcodes, they would work almost exactly the same, and other programs could take advantage of the added features.

Lack of instructions

One last, perhaps fatal flaw in the 65c816 is a lack of hardware support for multiply, divide, remainder, and barrel shifter operations. This was such a large oversight that in the Ricoh 5A22 - a custom chip made for Nintendo, which included a 65816, as well as some additional timers and other functions - included a separate hardware multiplier/divider that was accessible from game code. Without it, an optimized 8x8-bit multiply could take around 80 cycles, and divisions could take 151 cycles. For reference, the 65816 only has about 170-220 cycles per scanline in an SNES.

It only supports shift left or right by 1 bit at a time, too, making a 9-bit shift require a minimum of 9 minimum-2-cycle instructions, or a general-purpose loop that takes even longer.

It may not be the best processor for the time, but…

The 6502 was designed at a time when microprocessor costs were astronomical, and it removed a lot of important features in order to get the cost down. The 65816 didn’t really add them back, it just expanded it to work with more addressable RAM and 16-bit basic arithmetic.

It was, however, familiar for developers who worked on a system with a 6502, which was a VERY large amount of developers. You could jump right in in Emulation mode, and the new instructions mostly worked, and if you wanted the newer features, that’s just an XCE away.

With all this said, I do have a soft spot in my heart for this processor. It may have been a victim of very outdated design decisions, but a lot of good software was written for it on the SNES and Apple IIGS. I would rather use it than a 6502 if I were working on a homebrew PC.

Also, it’s JUST simple enough that you can accurately, fully emulate the pin states for each cycle, even in JavaScript. I’ll be writing another post on what that looks like in the near future, or you can check out the (admittedly very badly formatted) code at my JSMoo repository.

Blog Archive

Archive of all previous blog posts

Emulating in Javascript, and also, emulating the 65c816 with cycle accuracy and speed