This changes registers to be temporarily stored in wasm locals, across
each complete wasm module. Registers are moved from memory to locals
upon entering the wasm module and moved from locals to memory upon
leaving. Additionally, calls to functions that modify registers are
wrapped between moving registers to memory before and moving back to
locals after. This affects:
1. All non-custom instructions
2. safe_{read,write}_slow, since it may page fault (the slow path of all memory accesses)
3. task_switch_test* and trigger_ud
4. All block boundaries
5. The fallback functions of gen_safe_read_write (read-modify-write memory accesses)
The performance benefits are currently mostly eaten up by 1. and 4. (if
one calculates the total number of read/writes to registers in memory,
they are higher after this patch, as each instructions of typ 1. or 4.
requires moving all 8 register twice). This can be improved later by the
relatively mechanical work of making instructions custom (not
necessarily full code generation, only the part of the instruction where
registers are accessed). Multi-page wasm module generation will
significantly reduce the number of type 4. instructions.
Due to 2., the overall code size has significantly increased. This case
(the slow path of memory access) is often generated but rarely executed.
These moves can be removed in a later patch by a different scheme for
safe_{read,write}_slow, which has been left out of this patch for
simplicity of reviewing.
This also simplifies our code generation for storing registers, as
instructions_body.const_i32(register_offset);
// some computations ...
instruction_body.store_i32();
turns into:
// some computations ...
write_register(register_index);
I.e., a prefix is not necessary anymore as locals are indexed directly.
Further patches will allow getting rid of some temporary locals, as
registers now can be used directly.
This commit prevents creation of entry points for jumps within the same
page. In interpreted mode, execution is continued on these kinds of
jumps.
Since this prevents the old hotness detection from working efficiently,
hotness detection has also been changed to work based on instruction
counters, and is such more precise (longer basic blocks are compiled
earlier).
This also breaks the old detection loop safety mechanism and causes
Linux to sometimes loop forever on "calibrating delay loop", so
JIT_ALWAYS_USE_LOOP_SAFETY has been set to 1.
Makes the following a block boundary:
- push
- Any non-custom instruction that uses modrm encoding
- Any sse/fpu instruction
This commit affects performance negatively. In order to fix this, the
above instructions need to be implemented using custom code generators
for the memory access.
This commit makes the return type of most basic memory access primitives
Result, where the Err(()) case means a page fault happened, the
instruction should be aborted and execution should continue at the page
fault handler.
The following primites have a Result return type: safe_{read,write}*,
translate_address_*, read_imm*, writable_or_pagefault, get_phys_eip,
modrm_resolve, push*, pop*.
Any instruction needs to handle the page fault cases and abort
execution appropriately. The return_on_pagefault! macro has been
provided to get the same behaviour as the previously used JS exceptions
(local to the function).
Calls from JavaScript abort on a pagefault, except for
writable_or_pagefault, which returns a boolean. JS needs to check
before calling any function that may pagefault.
This commit does not yet pervasively apply return_on_pagefault!, this
will be added in the next commit.
Jitted code does not yet properly handle the new form of page faults,
this will be added in a later commit.
This commit contains the final changes requires for porting all C code
to Rust and from emscripten to llvm:
- tools/wasm-patch-indirect-function-table.js: A script that rewrites
the wasm generated by llvm to remove the table limit
- tools/rust-lld-wrapper: A wrapper around rust-lld that removes
arguments forced by rustc that break compilation for us
- src/rust/cpu2/Makefile: A monstrosity to postprocess c2rust's output
- gen/generate_interpreter.js: Ported to produce Rust instead of C
- src/rust/*: A few functions and macros to connect the old Rust code
and the new Rust code
- src/*.js: Removes the loading of the old emscripten wasm module and
adapts imports and exports from emscripten to llvm
The generated rust code doesn't call read_imm* functions for custom
instructions now for the memory variant branches when both immediate
values and modrm byte is used
The following files and functions were ported:
- jit.c
- codegen.c
- _jit functions in instructions*.c and misc_instr.c
- generate_{analyzer,jit}.js (produces Rust code)
- jit_* from cpu.c
And the following data structures:
- hot_code_addresses
- wasm_table_index_free_list
- entry_points
- jit_cache_array
- page_first_jit_cache_entry
Other miscellaneous changes:
- Page is an abstract type
- Addresses, locals and bitflags are unsigned
- Make the number of entry points a growable type
- Avoid use of global state wherever possible
- Delete string packing
- Make CachedStateFlags abstract
- Make AnalysisType product type
- Make BasicBlockType product type
- Restore opcode assertion
- Set opt-level=2 in debug mode (for test performance)
- Delete JIT_ALWAYS instrumentation (now possible via api)
- Refactor generate_analyzer.js
- Refactor generate_jit.js
- introduce multiple entry points per compiled wasm module, by passing
the initial state to the generated function.
- continue analysing and compiling after instructions that change eip, but
will eventually return to the next instruction, in particular CALLs
(and generate an entry point for the following instruction)
This commit is incomplete in the sense that the container will crash
after some time of execution, as wasm table indices are never freed
This commit consists of three components:
1. A new generated x86-parser that analyses instructions. For now, it
only detects the control flow of an instruction: Whether it is a
(conditional) jump, a normal instruction or a basic block boundary
2. A new function, jit_find_basic_blocks, that finds and connects basic
blocks using 1. It loosely finds all basic blocks making up a function,
i.e. it doesn't follow call or return instructions (but it does follow
all near jumps). Different from our previous analysis, it also finds
basic blocks in the strict sense that no basic block contains a jump
into the middle of another basic block
3. A new code-generating function, jit_generate, that takes the output
of 2 as input. It generates a state machine:
- Each basic block becomes a case block in a switch-table
- Each basic block ends with setting a state variable for the following basic block
- The switch-table is inside a while(true) loop, which is terminated
by return statements in basic blocks which are leaves
Additionally:
- Block linking has been removed as it is (mostly) obsoleted by these
changes. It may later be reactived for call instructions
- The code generator API has been extended to generate the code for the state machine
- The iterations of the state machine are limited in order to avoid
infinite loops that can't be interrupted
These instructions, if included within a compiled JIT block, may alter the
state_flags of a block entry (such as whether flat segmentation is used or not),
which may invalidate the block that is running - this caused bugs in OpenBSD
because of a block like this being compiled:
0xF81F2: 8E DB mov ds, bx
0xF81F4: 8E D3 mov ss, bx
0xF81F6: 66 8B 26 B8 F5 mov esp, dword ptr [0xf5b8] <--
0xF81FB: 66 89 36 B8 F5 mov dword ptr [0xf5b8], esi <--
The memory accesses implicitly use DS. If we include flat-segmenetation as a
flag within state_flags and optimize calls to get_seg based on it, this behavior
would cause issues (and did, in OpenBSD).
By marking these instructions as block boundaries, we remediate that issue.
Previously a callback was used to insert the read_imm call after
modrm_resolve. The new api uses two calls, allowing the removal of some
edge cases and deletion of extraneous functions
We generate a version of the push/pop instruction with the stack_size_32 fixed,
since the state tends not to change much. If it does change, state_flags won't
match the output of pack_current_state_flags and the cache entry will therefore
be invalidated.
In order of precedence:
--all generates all tables
--table jit{,0f_16,0f_32} / interpreter{,0f_16,0f_32}
And optionally:
--output-dir /path/to/output (defaults to v86 build directory)
This is in prep to let the make system generate individual tables as required
using this script instead of the script generating all 3.
Have output of generate table files use .c suffix
Remove write_sync_if_changed
The function existed to stop make from recompiling v86*.wasm everytime from
having the tables regenerated. With the upcoming change, this becomes unnecessary.
Correct Makefile to show dependency structure for generate scripts
See:
https://gist.github.com/AmaanC/faff7066d16f1dee4bbbd6b73a72d831
From the geek32[1] sheet, the criteria for nonfaulting instructions used was:
```
groups = ['arith', 'logical', 'conver', 'datamov', 'datamov arith', 'shftrot', 'flgctrl'];
// Excluded because they may trigger faults, so the optimization can't apply to them
excluded_opcodes = [
// May trigger_ud
'0x8C',
// switch_seg may fault
'0x8E',
// mov to/from seg:offset (memory accesses)
'0xA0',
'0xA1',
'0xA2',
'0xA3',
// Unimplemented in v86
'0xD50A',
'0x0F38F0',
'0x0F38F1',
];
// Keywords that indicate a group/instruction which may fault
excluded_row_words = ['fpu', 'simd', 'mmx', 'sse', 'vmx', 'XLAT', 'DIV', 'AMX', 'AAM', 'CLI', 'STI', 'CMPXCHG8B'];
```
[1] http://ref.x86asm.net/geek32.html#x0F90
This gets rid of the old global variable that let us return status flags from
jit_instruction. As a caveat, though, all prefix instructions also need to
return the status / tack it onto the existing status. This could give rise to
subtle bugs if instructions that mean to update the status don't propogate the
return status all the way up the chain.
Since we can only know an instruction's length (and other characteristics) after
its been decoded, this change allows us to first decode the instruction and
generate the instruction function calls to a scratch buffer, which can then be
copied over to the main (cs) buffer as appropriate.
Tests for the scratch buffer were added too.
Since jit_opcode is called multiple times for prefix instructions (for the
prefixes and then for the eventual opcode too), calling it jit_opcode seems more
accurate.
jit_instruction now encapsulates the entire decoding process of one instruction,
and returns all the status flags - for now, only regarding whether it was a jump
instruction.