Mapping with Fast & Fusiest#
Mapping workloads onto accelerators uses the Fast and Fusiest Mapper (FFM), which includes of two parts:
The Turbo-Charged Pmapper: This part makes all Pareto-optimal pmappings for all Einsums.
Fast and Fusiest Mapper (FFM): This part takes the Pareto-optimal pmappings and joins them into full mappings.
This document will walk you through how to use FFM to map a workload onto an accelerator.
This document follows the notebooks/tutorials/mapper.ipynb notebook.
Creating a Spec#
Before we dive into the mapper, we need to set up a
Spec object with the input spec. We can initialize
Spec objects from YAML files.
import accelforge as af
# Set the number of parallel threads that the mapper can use. If you are running out of
# memory, you may decrease this number. By default the number of threads is set to the
# number of cores on your machine.
import os
af.set_n_parallel_jobs(os.cpu_count(), print_message=True)
# Initialize the spec and show the workload.
BATCH_SIZE = 1
N_TOKENS = 8192
FUSE = False
spec = af.Spec.from_yaml(
examples_dir / "arches" / "tpu_v4i.yaml",
examples_dir / "workloads" / "gpt3_6.7B.yaml",
)
# Fusion happens when tensors bypass the outermost Memory object, so, to disable fusion,
# force all tensors to be in the outermost memory.
if not FUSE:
for node in spec.arch.nodes:
if isinstance(node, af.arch.Memory):
print(f'Keeping all tensors in {node.name}')
node.tensors.keep = "All"
break
We can set optimization metrics for the mapper by setting the spec.mapper.metrics
attribute to one of the Metrics enum
values or a logical OR (|) of multiple values.
The following optimization metrics are available:
Calling the Mapper#
We call the Turbo-Charged Pmapper with the
map_workload_to_arch() function.
# Commenting this will be slower, but may generate better mappings. Limits the number of
# fused loops that can exist in a single pmapping.
spec.mapper.max_fused_loops = 1
mapping =spec.map_workload_to_arch()
# Render the mapping with mapping.render(), or in the last line of a notebook:
mapping
In this code, there is a max_fused_loops
parameter that makes mapping faster by limiting the number of fused loops that can exist
in a single pmapping. The FFM class has a
variety of knobs that can be used to speed up mapping:
force_memory_hierarchy_order: If set to true, storage nodes for lower-level memories must be placed below storage nodes for higher-level memories. For example, all MainMemory storage nodes must go above all LocalBuffer storage nodes. This constraint always applies to same-tensor storage nodes (e.g., MainMemory reusing Output must go above LocalBuffer reusing Output); turning it off will permit things like MainMemory reusing Output going above LocalBuffer reusing Input.info_metrics: Metrics to be reported for final mappings.max_fused_loops: The maximum total number of fused loops in a pmapping.max_fused_loops_per_rank_variable: The maximum number of fused loops in a pmapping for a given rank variable.max_loops: The maximum total loops in a pmapping.max_loops_minus_ranks: The maximum total loops in a pmapping minus the number of ranks. For example, 3 means that the number of loops can be up to (the number of ranks + 3).max_pmapping_templates_per_einsum: The maximum number of pmapping templates per Einsum. Once this many templates are generated, the mapper will stop generating more. This is useful for debugging (why are so many templates being generated?).memory_limit: The maximum memory limit for the mapper.memory_limit_per_process: The maximum memory limit per process for one of the mapper’s processes.metrics: Metrics used to optimize mappings.out_of_order_hierarchy_explore_removing_spatials_for_more_temporals: If force_memory_hierarchy_order is set to False or is set to False for any particular component, and a spatial fanout ends up being raised above a storage node that does not have that fanout, then there may be cases where a spatial loop is put above a component that does not have the associated fanout. When this happens, we may not put between the spatial and the storage node any temporal loops that affect the same indexing expressions as the spatial loops. For example, the following is not allowed: Arch: - Global Buffer - 2x fanout - Register Mapping: spatial-for-reg n in [0, 10): [Register reuses input] for n in [0, 2): [Global Buffer reuses output] By default, if there are spatial loops that are not constrained away, then the mapper will not explore putting any temporal loops that conflict. In the above example, it will never place the above temporal loop. If this is set to True, then the mapper will explore removing the spatial loop in order to allow for the temporal loop to be placed. In the above example, it will explore removing the spatial loop in order to allow for the temporal loop to be placed.time_limit: The maximum time limit for the mapper.time_limit_per_pmapping_template: The maximum time limit per pmapping template.
Interpreting Output Results#
The mapper outputs a Mappings object, which
contains all Pareto-optimal mappings found for the given cascade of Einsums.
In general, if there is only one objective function (such as energy), this will include only one mapping. If there are multiple objective functions, then many mappings may be Pareto-optimal.
We can access the mappings in the Mappings
object with the energy() and
latency() functions to get the energy
and latency of the mapping.
print(f'Energy: {mapping.energy()}J, {mapping.per_compute().energy()}J/compute')
for k, v in mapping.per_compute().energy(per_component=True).items():
print(f'\t{k}: {v}J/compute')
print(f'Latency: {mapping.latency()}s, {mapping.per_compute().latency()}s/compute')
for k, v in mapping.per_compute().latency(per_component=True).items():
print(f'\t{k}: {v}s/compute')
The Mappings object is a wrapper around a
dataframe that contains all of the resulting mapping stats. These stats are accessible
using a variety of functions:
access(): Returns a new Mappings object with only the columns that contain the given keys. Column names are strings separated by “<SEP>”, and this method will return columns with one <SEP>-separated string matching the given key. Then, for all remaining columns, the key will be removed. For example, if the columns are “Compute<SEP>Energy”, and “DRAM<SEP>Energy”, and “DRAM<SEP>Latency”, then access(“Energy”) will return a Mappings object with columns “Compute” and “DRAM”, and access(“DRAM”) will return a Mappings object with columns “Compute” and “Latency”. If multiple keys are given, then the procedure is repeated for each key. col_idx: The index of the key in the column name. This can be used if the given key is found at multiple indexes in different columns.drop(): Returns a new Mappings object with the given keys dropped from all columns. Column names are strings separated by “<SEP>”, and this method will will drop columns with one <SEP>-separated string matching the given key. Then, for all remaining columns, the key will be removed. For example, if the columns are “Compute<SEP>Energy”, and “DRAM<SEP>Energy”, and “DRAM<SEP>Latency”, then drop(“Energy”) will drop “Compute<SEP>Energy” and “DRAM<SEP>Energy”, and drop(“DRAM”) will drop “DRAM<SEP>Energy” and “DRAM<SEP>Latency”. If multiple keys are given, then the procedure is repeated for each key.drop_zeros(): Returns a new Mappings object with all columns that have only zeros dropped.energy(): Returns the energy consumed. A dictionary is returned with keys that are tuples of (Einsum name, Component name, Tensor name, Action name), with any of these being omitted if the corresponding parameter is not set to True. If neither of the ``per_`` parameters are set to True, a float or a list of floats is returned. NOTE: Leak power is not per-tensor. If ``per_tensor`` is True, then the tensor name for leak will be None.latency(): Returns the latency consumed. A dictionary is returned with keys that are tuples of (Einsum name, Component name), with either being omitted if the corresponding parameter is not set to True. If neither of the ``per_`` parameters are set to True, a float or a list of floats is returned. NOTE: If per-Einsum is False and per-component is True, then the latency of each component will be summed across all Einsums. THE TOTAL LATENCY MAY BE GREATER THAN THE MAX OF THE PER-COMPONENT LATENCIES. This is because different components can be the bottleneck for different Einsums.num_computes(): Returns the number of computes for the given Einsum name, or total computes if ``einsum_name`` is ``None``.per_compute(): Returns the per-compute evaluation results by dividing all statistics by the number of computes.per_tensor_size(): Returns a dictionary of: {Tensor name: Number of bits} for each tensor in the spec. If return_n_elements is true, then the number of elements is returned instead of the number of bits.render(): Renders the mapping as a Pydot graph. Returns an SVG string. This is only supported if there is a single mapping; if there are multiple mappings, then either index into this Mappings object first, or pass in an index.to_dict(): Returns the data in this Mappings object as a dictionary. Each column in the this Mappings’ data becomes a key in the dictionary. Values in the dictionary may be a single value if there is only one mapping, or a list of values if there are multiple mappings.
How the Mapper Works, Debugging, and Advanced Usage#
The Fast and Fusiest Mapper (FFM) works using partial mappings, pmappings, which are mappings for a subset of the workload.
The mapper works in two steps:
Make Partial Mappings: This part uses the Turbo-Charged Pmapper to all Pareto-optimal pmappings for all Einsums.
Join Partial Mappings: This part uses the Fast and Fusiest Mapper (FFM) to join the Pareto-optimal pmappings into full mappings.
To help with debugging, we can run each of these steps separately to inspect the results.
Making Partial Mappings#
Partial mappings are generated using the
make_pmappings() function. Pmappings are generated
in two stages.
Generate Pmapping Templates: This stage generates pmapping templates, which are pmappings without the loop bounds filled in.
Fill Pmapping Templates with Tile Shapes: This stage fills the loop bounds in the pmapping templates with tile shapes.
When make_pmappings() is called, it first generates
all pmapping templates for each Einsum. Each pmapping template starts as an ordering of
storage nodes (dataplacement) as well as a compute node choice (if there are multiple
compute nodes in the architecture). The mapper will then insert spatial and temporal
loops into the pmapping template.
Then, for each pmapping template, the mapper will generate all possible loop bounds for the temporal and spatial loops. As it enumerates loop bound choices, it will prune non-Pareto-optimal combinations to keep the search space tractable.
For debugging in this stage, you can use two main tools. First, you can directly inspect the generated pmapping templates and check if all expected templates are included. These will be logged as they are generated.
Next, you can inspect the outputted
MultiEinsumPmappings object. This object
includes a variety of functions to report the porportion of pmappings that were kept or
removed for various reasons, such as resource oversubscribtion or Pareto pruning.
drop_einsums(): Removes all pmappings for the given Einsums.n_evaluated_pmappings(): Returns the number of pmappings that were evaluated for each Einsum. This is greater than the number of Pareto-optimal pmappings because some mappings are found to be suboptimal after they have been evaluated.n_pareto_optimal_pmappings(): Returns the number of Pareto-optimal pmappings for each Einsum. This is the number of mappings that will be returned by the make_pmappings function.n_pmapping_string(): Returns a string representation of the number of pmappings in the mapspace. Printing this can help diagnose if the mapper is not finding any pmappings or mappings.n_total_pmappings(): Returns the number of total pmappings in the mapspace.n_valid_pmappings(): Returns the number of valid pmappings for each Einsum. A valid pmapping is one that satisfies all constraints and resource usage limits.pmapping_keep_rates(): Returns the keep rates for each cause of pmapping removal. For example, if only 25% of the pmappings have a valid spatial fanout, then the keep rate for the spatial fanout cause will be 0.25.
Joining Partial Mappings#
After we have all Pareto-optimal pmappings for all Einsums, we can join them into full
mappings with the join_pmappings() function. This
function returns a Mappings object, which
contains all Pareto-optimal mappings found for the given cascade of Einsums.
Joining will look at the compatibility for each pmapping, which is a representation of how the pmapping exchanges data with other pmappings. Pairs of pmappings must respect data dependencies to be compatible. Specifically, for every shared tensor, pmappings must agree on the shared storage node and the loops above the storage node. This ensures that they agree on the shape of shared tensor tiles and the order and memory level in which these tiles are exchanged.
If there are no valid mappings to join, an error will be raised reporting the compatibilities of the pmappings that were considered. It may be useful to cross-reference these compatibilites with the pmapping templates that were generated by the pmapping creation stage.