OS Game Techniques
                           ==================

	(c) Copyright 1992-1999 Amiga, Inc.  All Rights Reserved


Reasons to use the OS for games:

o	Why reinvent the wheel? Spend your time doing things that only you can
	do.

o	Compatibility with future chipsets. For instance, some planned future
	chipsets are not register-level compatible with AA.

o	Easier adaptation to future hardware. For instance, it takes
	less time to convert a 16 color ECS game which uses the OS into
	a 256 color AGA game than it does to convert a hardware-banging
	ECS game.

o	RTG compatibility possible for some games.

o	The OS automatically supports pre-ECS, ECS,
	AA, and future chips.

o	Easier integration with other system components (CD-ROM, networks,
	serial ports, etc).

o	Easy hard-disk install.

o	Less code to write. OS has routines for handling all screen positions
	and scrolls, mouse movement, etc. = less development time. You can spend
	more time making the game more playable and less on getting the hardware
	to work.

o	More robustness. For instance, the OS floppy-disk code is far less
	picky about drive parameters than 99% of custom floppy i/o code.

o	Hides bugs and quirks of the chipset.  The AA chip set has a few bugs which
	the OS hides from you.

o	The code runs out of ROM, which is faster than running the code out of
	CHIP RAM.

o	Multiple platforms. OS code will run on all Amiga-based machines,
	whatever their flavour.

o	Tools exist to help you debug your code rely on the OS being around (eg
	Mungwall and Enforcer).

o	A properly written game can be promoted, and thus work on cheap VGA
	monitors.


Things the OS can't currently do:

o	Scroll individual scanlines of a viewport

o	AA colour copperlist fades

o	dynamically update user copper lists.

	All these are planned to be addressed in future OS releases.
One of our goals is to make it possible to perform as many amiga tricks
in normal intuition screens as is possible.

ECS-AA incompatibilities that the OS handles:

o	Vertical counter behaves differently in programmable beam modes.

o	No SuperHires color scrambling.

o	Bitplane alignment problems.


Future envisioned chip changes that the OS will handle correctly:

o	Chips with no fetch-mode selections. All selections automatic.

o	Different DDFSTART/STOP calculations.

o	Color loading differences

o	Exact copper timings differences

o	No SuperHires

o	Multiple blitters


Game programming problems and solutions:

Q:	What is the graphics rendering routines are much slower than my own
	blitter code?

A:	Use the blitter yourself. Call OwnBlitter, do setup, call WaitBlit(),
	poke the blitter registers, and then DisownBlitter() when all blits are
	done.

	Note: OwnBlitter() is only 3 instructions (counting rts) when no-one
	else is trying to use the blitter.


Q:	What if input.device eats too many cycles?

A:	Install a high priority input handler which chokes off all events. This
	handler is also a convenient way to get keys and mouse events yourself.
	Simply store the raw keypresses and mouse moves in your own variables.

Q:	How do I change both bitmap pointers and colors in sync?

A:	Use a user-copper list to cause a copper interrupt on line 0 of your
	viewport. The copper interrupt handler will signal a high-priority
	task which calls LoadRGB32 and ChangeVPBitMap (or ScrollVPort) to
	cause the changes. This allows perfect 60hz animation on an A1200,
	even while moving the mouse as fast as possible, and inserting floppy
	disks.

	Under 3.0, you can also do this in an exclusive screen. You can
	tell if it was your screen which caused the copper interrupt by
	checking the flag VP_HIDE in your ViewPort->Modes.

	
Q:	I need to use the blitter in an interrupt driven manner instead of
	polling it for completion. Aren't the QBlit routines too slow?

A:	The QBlit/QBSBlit system was completely re-written for 3.0, and now
	has quite low overhead.

Q:	How do I determine elapsed time in my game?

A:	A simple, low overhead way to determine elapsed time is to call ReadEClock.
	This returns a 64 bit timer value which counts E Clocks, and returns how
	many EClocks happen per second. If you use these results properly,
	you can ensure that your game runs at the proper speed regardless of
	CPU type, chip speed, or PAL/NTSC clocking.


A1200 speed issues:

	The A1200 has a fairly large number of wait-states when accessing chip-ram.
ROM is zero wait-states. Due to the slow RAM speed, it may be better to use
calculations for some things that you might have used tables for on the A500.

	Add-on RAM will probably be faster than chip-ram, so it is worth segmenting
your game so that parts of it can go into fast-ram if available.

	For good performance, it is critical that you code your important loops
to execute entirely from the on-chip 256-byte cache. A straight line loop 258 bytes
long will execute far slower than a 254 byte one.

	The '020 is a 32 bit chip. Longword accesses will be twice as fast when they
are aligned on a long-word boundary. Aligning the entry points of routines on
32 bit boundaries can help, also. You should also make sure that the stack is always 
long-word aligned.

	Write-accesses to chip-ram incur wait-states. However, other processor 
instructions can execute while results are being written to memory:

	move.l	d0,(a0)+	; store x coordinate
	move.l	d1,(a0)+	; store y coordinate
	add.l	d2,d0		; x+=deltax
	add.l	d3,d1		; y+=deltay

	will be slower than:

	move.l	d0,(a0)+	; store x coordinate
	add.l	d2,d0		; x+=deltax
	move.l	d1,(a0)+	; store y coordinate
	add.l	d3,d1		; y+=deltay
		
	The 68020 adds a number of enhancements to the 68000 architecture, including
new addressing modes and instructions. Some of these are unconditional speedups, while
others only sometimes help:

	Adressing modes:

o	Scaled Indexing. The 68000 addressing mode (disp,An,Dn) can have a scale
	factor of 2,4,or 8 applied to the data register on the 68020. This is totally
	free in terms of instruction length and execution time. An example is:

	68000			68020
	-----			-----
	add.w	d0,d0		move.w	(0,a1,d0.w*2),d1
	move.w	(0,a1,d0.w),d1

o	16 bit offsets on An+Rn modes. The 68000 only supported 8 bit displacements
	when using the sum of an address register and another register as a memory
	address. The 68020 supports 16 bit displacements. This costs one extra cycle
	when the instruction is not in cache, but is free if the instruction is in
	cache. 32 bit displacements can also be used, but they cost 4 additional clock
	cycles.

o	Data registers can be used as addresses. (d0) is 3 cycles slower than (a0),
	and it only takes 2 cycles to move a data register to an address register,
	but this can help in situations where there is not a free address register.

o	Memory indirect addressing. These instructions can help in some circumstances
	when there are not any free register to load a pointer into. Otherwise,
	they lose.

	New instructions:

o	Extended precision divide an multiply instructions. The 68020 can perform
	32x32->32, 32x32->64 multiplication and 32/32 and 64/32 division. These
	are significantly faster than the multi-precision operations which are
	required on the 68000.

o	EXTB. Sign extend byte to longword. Faster than the equivalent EXT.W EXT.L
	sequence on the 68000.

o	Compare immediate and TST work in program-counter relative mode on the 68020.

o	Bit field instructions. BFINS inserts a bitfield, and is faster than 2 MOVEs
	plus and AND and an OR. This instruction can be used nicely in fill routines
	or text plotting. BFEXTU/BFEXTS can extract and optionally sign-extend a bitfield
	on an arbitrary	boundary. BFFFO can find the highest order bit set in a field.
	BFSET, BFCHG, and BFCLR can set, complement, or clear up to 32 bits at arbitrary
	boundaries.


o	On the 020, all shift instructions execute in the same amount of time,
	regardless of how many bits are shifted. Note that ASL and ASR are slower
	than LSL and LSR. The break-even point on ADD Dn,Dn versus LSL is at
	two shifts.


Hardware resources:

1) Blitter.
	Use OwnBlitter()/DisownBlitter() to claim and relinquish ownership of
	the blitter.

	YOU MUST USE THE GRAPHICS.LIBRARY WAITBLIT(). This is as fast as
	possible, uses no CPU registers, and knows about blitter bugs. You
	cannot possibly write one that is more efficient and works on all
	Amigas.

2) Copper.
	If you really have to take over the copper, get the LoadView(NULL),
	do 2 WaitTOF()s, and then install your own copperlists in the cop1/2jmp
	registers. I do not recommend this though. Future chipsets may have 
	faster and more efficient coppers with 32 bits, and we will
	want to use these. If you load the old	copper registers behind
	graphics' back, we have no way of switching back to the old 16-bit mode.

	temp=GfxBase->ActiView;
	LoadView(NULL);
	WaitTOF();
	WaitTOF();
	/* custom.cop1lc = ??? */

	...

	WaitTOF();
	WaitTOF();
	LoadView(temp);
	custom.cop1lc=GfxBase->copinit;

3) Audio.
	Use the Audio device. There are functions to change the volume, period,
	frequency, data etc that is played on any of the channels. If you must
	hit the audio hardware, you can ask for the channel you need with the
	highest priority (127), and the audio channel will never be stolen from
	you until you give the channel back to the system.

4) Timers.
	Use the timer device. Some of the timer.device functions work as
	libraries, and so are easy to use. This allows you to be compatible
	should we use a 3rd cia time, say.

	The vertical blank can be used as a special low-frequency timer. See
	below.

	CIA timers can be allocated via the resource allocation calls. The 
	"Resources" chapter of the V37 RKM: Devices manual has a good example.


5) Input.
	Input will usually come from keyboard, mouse, joystick, infra-red etc.
	Mouse and joystick can be easily read from the hardware keyboard input 
	could come from the keyboard.device, which knows how to handle keyboard
	timings, but it is easier by far to open an intuition window and read 
	either RAWKEY or VANILLAKEY IDCMP messages. These either give the raw 
	key number pressed, or the character the key pressed represents
	(useful for international games).

6) Interrupts.
	Set up interrupt servers with high priority. Your server will
	then be the first called.

7) Disk drives.
	Just use the DOS.library. It's so much easier, works on all possible
	drives, past, present and future, and makes s/w so much more friendly
	to the user. Floppy based copy protection can be accomplished by
	allocating the blitter and inhibiting the drive while checking for 
	the key track.

Do's and Don'ts:


o	DO clear unused bits when writing, and mask out unused or unneeded bits
	when reading.

o	DON'T use timing loops. The reasons should be obvious.

o	DON'T write self-modifying code unless you know how instruction caches
	work.

o	DON'T steal memory. You can always call CloseWorkbench().

o	If you are hardware banging, don't assume anything about the initial
	contents of the display registers when your program is started.
	Initialize everything.

o	If using ViewPorts, be sure to have a properly allocated ViewPortExtra.
	Some graphics calls are faster when one is present.

o	DO NOTE WELL THE WARNING AROUND THE COPINIT STRUCTURE.


CPU Differences:

o	Caches.

o	Copyback and write-through modes.

o	Access to CHIP RAM.

o	'020, '030, '040 instruction and effective addresses.

o	MMUs and FPUs