mirror of
https://github.com/dingusdev/dingusppc.git
synced 2025-01-25 04:31:03 +00:00
zdocs: improve MMU emulation documentation.
This commit is contained in:
parent
212cd58f40
commit
40d0aa70da
23
zdocs/cpu/powerpc/mmu.md
Normal file
23
zdocs/cpu/powerpc/mmu.md
Normal file
@ -0,0 +1,23 @@
|
||||
## Disabling BAT translation
|
||||
|
||||
BAT translation can be disabled by invalidating BAT registers. This is somewhat CPU specific.
|
||||
MPC601 implements its own format for BAT registers that differs from the PowerPC specification.
|
||||
|
||||
MPC601-specific lower BAT registers has the "V" bit. If it's cleared, the corresponding BAT pair
|
||||
is invalid and won't be used for address translation. To invalidate BATs on MPC601, it's enough
|
||||
to write NULL to lower BAT registers. That's exactly what PowerMac 6100 ROM does:
|
||||
```
|
||||
li r0, 0
|
||||
mtspr ibat0l, r0
|
||||
mtspr ibat1l, r0
|
||||
mtspr ibat2l, r0
|
||||
```
|
||||
|
||||
PowerPC CPUs starting with 603 uses the BAT register format described in the PowerPC specification.
|
||||
The upper BAT registers contain two bits: Vs (supervisor state valid bit) and Vp (problem/user state valid bit).
|
||||
PowerPC Architecture First Edition from 1993 gives the following code:
|
||||
|
||||
```BAT_entry_valid = (Vs & ~MSR_PR) | (Vp & MSR_PR)```
|
||||
|
||||
If neither Vs nor Vp is set, the corresponding BAT pair isn't valid and doesn't participate in address translation.
|
||||
To invalidate BATs on non-601, it's sufficient to set the upper BAT register to 0x00000000.
|
107
zdocs/cpu/powerpc/mmuemu.md
Normal file
107
zdocs/cpu/powerpc/mmuemu.md
Normal file
@ -0,0 +1,107 @@
|
||||
# PowerPC Memory Management Unit Emulation
|
||||
|
||||
Emulation of a [memory management unit](https://en.wikipedia.org/wiki/Memory_management_unit)
|
||||
(MMU) in a full system emulator is considered a hard task. The biggest challenge is to do it fast.
|
||||
|
||||
In this article, I'm going to present a solution for a reasonably fast emulation
|
||||
of the PowerPC MMU.
|
||||
|
||||
This article is based on ideas presented in the paper "Optimizing Memory Emulation
|
||||
in Full System Emulators" by Xin Tong and Motohiro Kawahito (IBM Research Laboratory).
|
||||
|
||||
## PowerPC MMU operation
|
||||
|
||||
The operation of the PowerPC MMU can be described using the following pseudocode:
|
||||
|
||||
```
|
||||
VA is the virtual address of some memory to be accessed
|
||||
PA is the physical address of some memory translated by the MMU
|
||||
AT is access type we want to perform
|
||||
|
||||
if address translation is enabled:
|
||||
PA = block_address_translation(VA)
|
||||
if not PA:
|
||||
PA = page_address_translation(VA)
|
||||
else:
|
||||
PA = VA
|
||||
|
||||
if access_permitted(PA, AT):
|
||||
perfom_phys_access(PA)
|
||||
else:
|
||||
generate_mmu_exception(VA, PA, AT)
|
||||
```
|
||||
|
||||
A HW MMU usually performs several operations in a parallel fashion so the final
|
||||
address translation and memory access only take a few CPU cycles.
|
||||
The slowest part is the page address translation because it requires accessing
|
||||
system memory that is usually an order of magnitudes slower than the CPU.
|
||||
To mitigate this, a PowerPC CPU includes some very fast on-chip memory used for
|
||||
building various [caches](https://en.wikipedia.org/wiki/CPU_cache) like
|
||||
instruction/data cache as well as
|
||||
[translation lookaside buffers (TLB)](https://en.wikipedia.org/wiki/Translation_lookaside_buffer).
|
||||
|
||||
## PowerPC MMU emulation
|
||||
|
||||
### Issues
|
||||
|
||||
An attempt to mimic the HW MMU operation in software will likely have a poor
|
||||
performance. That's because modern hardware can perform several tasks in parallel.
|
||||
However, software has to do almost everything serially. Thus, accessing some memory
|
||||
with address translation enabled can take up to 300 host instructions! Considering
|
||||
the fact that every 10th instruction is a load and every 15th instruction is a store,
|
||||
it will be nearly impossible to achieve a performance comparable to that of the
|
||||
original system.
|
||||
|
||||
Off-loading some operations to the MMU of the host CPU for speeding up emulation
|
||||
isn't feasible because Apple's computers often have hardware being accessed like an
|
||||
usual memory. Thus, an emulator needs to distinguish between accesses to real memory
|
||||
(ROM or RAM) from accesses to memory-mapped peripheral devices. The only way to
|
||||
do that is to maintain special software descriptors for each virtual memory region
|
||||
and consult them on each memory access.
|
||||
|
||||
### Solution
|
||||
|
||||
My solution for a reasonable fast MMU emulation employs a software TLB. It's
|
||||
used for all memory accesses even when address translation is disabled.
|
||||
|
||||
The first stage of the SoftTLB uses a
|
||||
[direct-mapped cache](https://en.wikipedia.org/wiki/Cache_placement_policies#Direct-mapped_cache)
|
||||
called **primary TLB** in my implementation. That's because this kind of cache
|
||||
is the fastest one - one lookup requires up to 15 host instructions. Unfortunately,
|
||||
this cache is not enough to cover all memory accesses due to a high number of
|
||||
collisions, i.e. when several distinct memory pages are mapped to the same cache
|
||||
location.
|
||||
|
||||
That's why, the so-called **secondary TLB** was introduced. Secondary TLB is a
|
||||
4-way fully associative cache. A lookup in this cache is slower than a lookup in the
|
||||
primary TLB. But it's still much faster than performing a full page table walk
|
||||
requiring hundreds of host instructions.
|
||||
|
||||
All translations for memory-mapped I/O go into the secondary TLB because accesses
|
||||
to such devices tend to be slower than real memory accesses in the real HW anyways.
|
||||
Moreover, they usually bypass CPU caches (cache-inhibited accesses). But there
|
||||
are exceptions from this rule, for example, video memory.
|
||||
|
||||
When no translation for a virtual address was found in either cache, a full address
|
||||
translation including the full page table walk is performed. This path is the
|
||||
slowest one. Fortunately, the probabilty that this path will be taken seems to be
|
||||
very low.
|
||||
|
||||
The complete algorithm looks like that:
|
||||
```
|
||||
VA is the virtual address of some memory to be accessed
|
||||
PA is the physical address of some memory translated by the MMU
|
||||
AT is access type we want to perform
|
||||
|
||||
PA = lookup_primary_tlb(VA)
|
||||
if VA in the primary TLB:
|
||||
perform_memory_access(PA, ART)
|
||||
else:
|
||||
PA = lookup_secondary_tlb(VA)
|
||||
if VA not in the secondary TLB:
|
||||
PA = perform_full_address_translation(VA)
|
||||
refill_secondary_tlb(VA, PA)
|
||||
if is_real_memory(PA):
|
||||
refill_primary_tlb(VA, PA)
|
||||
perform_memory_access(PA, ART)
|
||||
```
|
Loading…
x
Reference in New Issue
Block a user