82 lines
3.6 KiB
Markdown
82 lines
3.6 KiB
Markdown
|
Here's an overview of v86's workings. For details, check the
|
||
|
[source](https://github.com/copy/v86/tree/master/src).
|
||
|
|
||
|
The major limitations of WebAssembly are (for the purpose of making emulators with jit):
|
||
|
|
||
|
- structured control flow (no arbitrary jumps)
|
||
|
- no control over registers (you can't keep hardware registers in wasm locals across functions)
|
||
|
- no mmap (paging needs to be fully emulated)
|
||
|
- no patching
|
||
|
- module generation is fairly slow, but at least it's asynchronous, so other things can keep running
|
||
|
- there is some memory overhead per module, so you can't generate more than a few thousand
|
||
|
|
||
|
v86 has an interpreted mode, which collects entry points (targets of function
|
||
|
calls and indirect jumps). It also measures the hotness per page, so that
|
||
|
compilation is focused on code that is often executed. Once a page is
|
||
|
considered hot, code is generated for the entire page and up to `MAX_PAGES`
|
||
|
that are directly reachable from it.
|
||
|
|
||
|
v86 generates a single function with a big switch statement (brtable), to
|
||
|
ensure that all functions and targets of indirect jumps are reachable from
|
||
|
other modules. The remaining control flow is handled using the "stackifier"
|
||
|
algorithm (well-explained in
|
||
|
[this blog post](https://medium.com/leaningtech/solving-the-structured-control-flow-problem-once-and-for-all-5123117b1ee2)).
|
||
|
At the moment, there is no linking of wasm modules. The current module is
|
||
|
exited, and the main loop detects if a new module can be entered.
|
||
|
|
||
|
In practice, I found that browsers don't handle this structure (deep brtables,
|
||
|
with locals being used across the entire function) very well, and `MAX_PAGES`
|
||
|
has to be set to fairly low, otherwise memory usage blows up. It's likely that
|
||
|
improvements are possible (generating fewer entry points, splitting code across
|
||
|
multiple functions).
|
||
|
|
||
|
Code-generation happens in two passes. The first pass finds all basic block
|
||
|
boundaries, the second generates code for each basic block. Instruction
|
||
|
decoding is generated by a [set of
|
||
|
scripts](https://github.com/copy/v86/tree/master/gen) from a [table of
|
||
|
instructions](https://github.com/copy/v86/blob/master/gen/x86_table.js). It's
|
||
|
also used to [generate
|
||
|
tests](https://github.com/copy/v86/blob/master/tests/nasm/create_tests.js).
|
||
|
|
||
|
To handle paging, v86 generates code similar to this (see `gen_safe_read`):
|
||
|
|
||
|
```
|
||
|
entry <- tlb[addr >> 12 << 2]
|
||
|
if entry & MASK == TLB_VALID && (addr & 0xFFF) <= 0xFFC - bytes: goto fast
|
||
|
entry <- safe_read_jit_slow(addr, instruction_pointer)
|
||
|
if page_fault: goto exit-with-pagefault
|
||
|
fast: mem[(entry & ~0xFFF) ^ addr]
|
||
|
```
|
||
|
|
||
|
There is a 4 MB cache that acts like a tlb. It contains the physical address,
|
||
|
read-only bit, whether the page contains code (in order to invalidate it on
|
||
|
write) and whether the page points to mmio. Any of those cases are handled in
|
||
|
the slow path (`safe_read_jit_slow`), as well as walking the page tables and
|
||
|
triggering page faults. The fast path is taken in the vast majority of times.
|
||
|
|
||
|
The remaining code generation is mostly a straight-forward, 1-to-1 translation
|
||
|
of x86 to wasm. The only analysis done is to optimise generation of condional
|
||
|
jumps immediately after arithmetic instructions, e.g.:
|
||
|
|
||
|
```
|
||
|
cmp eax, 52
|
||
|
setb eax
|
||
|
```
|
||
|
|
||
|
becomes:
|
||
|
|
||
|
```
|
||
|
... // code for cmp
|
||
|
eax <- eax < 52
|
||
|
```
|
||
|
|
||
|
A lazy flag mechanism is used to speed arithmetic (applies to both jit and
|
||
|
interpreted mode, see
|
||
|
[`arith.rs`](https://github.com/copy/v86/blob/master/src/rust/cpu/arith.rs) and
|
||
|
[`misc_instr.rs`](https://github.com/copy/v86/blob/master/src/rust/cpu/misc_instr.rs)).
|
||
|
There's a wip that tries to elide most lazy-flags updates:
|
||
|
https://github.com/copy/v86/pull/466
|
||
|
|
||
|
FPU instructions are emulated using softfloat (very slow, but unfortunately
|
||
|
some code relies on 80 bit floats).
|