Kickoff Meeting JP, SJH, KIE, JB
Actions for two weeks time
- Where should the optimization pass go?
- What representation should the superoptimizer operate on?
- Does the SMT approach work for AVR?
- Analyse some basic blocks
- Average instruction count
- Metrics for complexity
- Number of memory ops
- Interdependency between memory ops and rest of instructions
- Find a set of basic blocks for further discussion
Bristol Meeting #1 JP, SJH, KIE
Discussed cost models to bias the search for instructions
- Instruction sequences length. Generate instruction sequences which are the same size or shorter is likely to be a good heuristic. Need to consider instruction size however, since some instructions could be 2 words
How can we handle larger basic blocks?
- Hopefully be able to generate instruction sequences up to length 10 (maybe 15).
- The set of instructions used for superoptimization should be restricted (conservatively).
- E.g. if the sequence does not load or store memory, then perhaps we do not need the memory instructions
- Some basic blocks are larger, however we may be able to split the basic block
- Consider the dataflow DAG, cyclomatic complexity. Can the DAG be partitioned (possibly make a cut which cuts the shortest amount of edges – where, how, how many?)
- The edges represent the live registers/flags. The search can be restricted based on liveness information
- Possibly not too difficult if only considering a single basic block
- Memory liveness needs to be taken into account (To discuss next week)
- Skip instructions
- Possibly difficult to encode. Consider as seperate basic blocks for now?
- Naively modelling flags results in a large number of constraints
- However, the constraints should fall naturally out of the bitvector arithmetic in most cases.
- Increase in constraint-generator complexity for reduction in final constraint complexity (which should reduce superoptimization time)
- Few instructions that use the flags (except branches), so maybe low effectiveness in superoptimization
Optimizing simultaneous sequences.
- Peephole superoptimizer study could bruteforce multiple sequences simultenously, by hashing the test vectors
- Unclear to see how this can be done for SMT based methods
- Naively attempting to find multiple solutions at the same time will limit the 'branch pruning' of the solver – it will be unable to rule out large areas of the search space due to multiple divergant solutions
- Could be the thing that limits the effectiveness of the SMT method
- Possible solutions
- Attempt to find multiple solutions to 'similar' sequences. Based on the assumption that the solutions will be similar. How to define a similarity metric?
Bristol Meeting #2 JP, KIE
- Potential for superoptimization often inhibited by load/stores dispersed through the code.
- The load/store are a single chain which cannot be reordered/eliminated, since the volatility of the address is not known (see Illustration 1).
- Code to be superoptimized could chosen by selecting a subgraph with few connections to the load/store chain (highlighted area on the right).
- The complexity of the DAG and size of the basic blocks are higher than expected. The number of instructions per basic block could greatly slow down superoptimization, although partitioning the DAG into smaller areas may help.
- [[Image:|thumb|Illustration 1: This basic block has a sequence of memory operations which will be difficult to superoptimize]]Many basic blocks are either too small, dominated by memory operations, or are too closely couple with the memory operations (see Illustration 2).
[[Image:|thumb|Illustration 2: Dataflow DAG. Arrows indicate a data dependency. Blue arrows are register dependencies, red are flags, and green is memory]]
AVR instruction set
- The simplicity of the AVR instruction set means there are few instructions with 'interesting' side effects.
- It is possible that since the instructions are simply, there is not much scope for superoptimization.
- The simplicity of the instructions further hinders superoptimization since a large number of operations is require to perform one 'high level' operation – this slows down the search for a replacement sequence.
- Since the operations are simple, is the compiler already exploiting them close to optimality?
- Survey other instruction sets – is there a more appropriate instruction set?
- Needs scope for superoptimization
- Trade offs between number of registers, number of instructions and simplicity of instructions
Higher level superoptimization
- The SMT solver will find a sequence of instructions for any given relationship between input and output machine states. Can we use information from earlier in the compiler (perhaps before register allocation and spillcode generation?)
- Explore intermediate representation in GCC – however these may suffer from the lack of instruction complexity, similar to AVR assembly, and a disconnect from the assembly instructions.
- Does unoptimized code have more 'scope' for superoptimization? If the optimization process loses some information about the sequence, or transforms it in a way that makes superoptimization difficult, then operating on unoptimized code may be better.
- In much of BEEBS the operations are on 16 or 32 bit integers, resulting in many 8bit integer ops to emulate the bit width.
- It may be that the superoptimizer can cope with restricted instances like this, which can then be reused, rather than generic instruction sequences.
Lymington Meeting #1 JP, JB
Where should the optimization pass go?
- Intermediate representations suffer from the same problems as
What representation should the superoptimizer operate on?
- The selection of the region to superoptimize should be done via the dataflow DAG. This allows only the necessary instructions to be superoptimized, and composes nicely if the whole basic block is not to be superoptimized.
- To enable multiple basic blocks to be considered, the representation needs to be simplified. With the DAG representation the need to track register assignments is unnecessary, due to the data dependencies being encoded in the transitions of the DAG.
Does the SMT approach work for AVR?
- Early experimental work suggests there isn't any particularly difficulty in encoding the AVR instruction set, however the efficiency of the approach suffers from AVR's lack of complexity.
Analyse some basic blocks
- Overall instruction count per basic block is high for AVR
- Instruction complexity is low for AVR – more varied types of instructions for similar operations
Summary + Next
Seems like optimizing basic blocks is not productive enough. The compiler does a good job with what is possible over the basic block, minimizing what is possible for the super optimizer to find. Rather than changing the instruction set, is it possible to consider more than just a single basic block.
More than basic blocks
- To superoptimize AVR code we need to consider multiple basic blocks as input – this is where many other superoptimizers found potential. For example, Massalin's superoptimizer considered a target function with branches, even if the superoptimizer itself could not generate branches.
- Loops would be the best target, since these are where the majority of the time is spent. However there is significant complexity in verifying loop equivalence, and likely even more complexity generating loops. Is there a hybrid approach that can be used, between SMT and bruteforce?
- Just optimizing the sequential code inside a loop (without changing the overall loop structure) may be sufficient for performance gains.
- At the very least straight-line code should be considered as input (even if the final output cannot contain branching).
- Optimizing larger amounts of code will make it more difficult to find common sequences between different programs, resulting in the superoptimizer having to be rerun more often.
Lymington Meeting #1.5 JP, JB
- Talk to Andrew – superoptimization wiki, put meeting notes on the wiki
- Talk to Kerstin about relational verification, verification of programs with loops
- Email David Greaves
- Analyse loops, regions bigger than basic blocks, common structures before and after optimization.
- Look at libgcc + compiler-rt (longer, one-shot superoptimization)
- Comparison with hand-written, compiler generated and superoptimized
Bristol Meeting #3 JP, SJH
What are we aiming to speed up?
- Are we targeting the common case? This should be handled well by the compiler already since there are operations commonly performed.
- Can we target more specific things, but which happen less frequently?
- The speed up in Henry Massalin's signum exampe came not just from exploiting the hardware flags, but from eliminating the control flow.
- The calculations performed by the branches became implicit, handled by values stored in the flags. Similarly, superoptimizers have been used to eliminate branches by executing multiple possible basic blocks then conditionally moving the results into the appropriate registers (if conversion).
- A more generic form of this could be speculative execution. Potentially eliminates difficult to handle control flow for the superoptimizer. Merge in a probable basic block, allowing more extensive superoptimization, then undo the result if the branch should not have been taken?
- Memory operations remain difficult to handle.
Metrics for superoptimization potential
- How does the instruction set relate to the potential of superoptimization?
- The size of the search space
- A larger search space means it is more difficult to searching
- More registers → larger search space
- Larger instruction set → larger search space
- The complexity of the operations that operate on the search space
- More complex instructions → smaller instruction set (?)
- More complex instructions → shorter instruction sequence
- Each instruction transforms the search space
Bristol Meeting #4 JP, KIE
- What do we expect we can do by superoptimizing loops? To be discussed on Tuesday.
- Different iteration patterns
- Different numbers of loops
Verification of two programs with loops
- Usually only verify if you already know the loop invariant. The loop invariant is very difficult to calculate from existing code automatically. See Alan Bundy work on rippling.
- “Relational verification using product programs” may provide insight into if programs with loops can be proved to be equivalent.
- If there is little to be gained by changing the loop structure, only consider the loop body
Generating programs with loops
- Harder than verifying whether two loops are the same
- Iteratively unroll loops, constraining each unroll to the same instructions, until reach a fixed point? Constraining the loop to the same instructions reduces the search space, but the number of constraints increases.