accelforge.frontend.mapper package#
Submodules#
accelforge.frontend.mapper.ffm module#
- class accelforge.frontend.mapper.ffm.FFM[source]#
Bases:
EvalableModelConfiguration for the Fast and Fusiest Mapper.
- explore_imperfect_spatial_loops: bool = False#
If True, spatial loop bounds may not perfectly divide the full rank shape. This takes longer to explore and requires more RAM, but mappings found may have better spatial utilization. This is especially helpful when the rank shapes have few prime factors.
For example, if the rank shape is 7, then explore_imperfect_spatial_loops=False would explore loop bounds of 1, 7 and explore_imperfect_spatial_loops=True would explore loop bounds of 1, 2, 3, 4, 7. This would be helfpul for a size-4 PE array, where we could get full utilization using 4 PEs in one timestep and 3 PEs in another timestep.
Only “simple” rank variables (those appearing alone and not inside an expression in any tensor access) may have imperfect loop bounds.
- explore_imperfect_temporal_loops: bool = False#
If True, temporal loop bounds may not perfectly divide the full rank shape. This takes longer to explore and requires more RAM, but mappings found may have lower memory usage. This is especially helpful when the rank shapes have few prime factors.
For example, if the rank shape is 7, then explore_imperfect_temporal_loops=False would explore loop bounds of 1, 7 and explore_imperfect_temporal_loops=True would explore loop bounds of 1, 2, 3, 4, 7.
Only “simple” rank variables (those appearing alone and not inside an expression in any tensor access) may have imperfect loop bounds.
- explore_loop_orders: bool = True#
Whether to explore loop orders for loops where we may get partial reuse. Note that loop orders that don’t matter (i.e., ones that have either full or no reuse) are not explored, except in joining where we may join partial mappings who have different-but-equivalent loop orders.
- force_memory_hierarchy_order: bool = True#
If set to true, storage nodes for lower-level memories must be placed below storage nodes for higher-level memories. For example, all MainMemory storage nodes must go above all LocalBuffer storage nodes.
This constraint always applies to same-tensor storage nodes (e.g., MainMemory reusing Output must go above LocalBuffer reusing Output); turning it off will permit things like MainMemory reusing Output going above LocalBuffer reusing Input.
- classmethod from_yaml(*files, jinja_parse_data=None, top_key=None, **kwargs)#
Loads a dictionary from one more more yaml files.
Each yaml file should contain a dictionary. Dictionaries are combined in the order they are given.
Keyword arguments are also added to the dictionary.
- Parameters:
files (
str|list[str] |Path|list[Path]) – A list of yaml files to load.jinja_parse_data (
dict[str,Any] |None) – Optional[Dict[str, Any]] A dictionary of Jinja2 data to use when parsing the yaml files.top_key (
str|None) – Optional[str] The top key to use when parsing the yaml files.kwargs – Extra keyword arguments to be passed to the Jinja2 parser.
- Return type:
TypeVar(T)- Returns:
A dict containing the combined dictionaries.
- max_fused_loops_per_rank_variable: int = 1#
The maximum number of fused loops in a pmapping for a given rank variable.
- max_loops_minus_ranks: float | int = inf#
The maximum total loops in a pmapping minus the number of ranks. For example, 3 means that the number of loops can be up to (the number of ranks + 3).
- max_pmapping_templates_per_einsum: float | int = inf#
The maximum number of pmapping templates per Einsum. Once this many templates are generated, the mapper will stop generating more. This is useful for debugging (why are so many templates being generated?).
- memory_limit_per_process: float | int = inf#
The maximum memory limit per process for one of the mapper’s processes.
- objective_tolerance: float = 0#
Reduces memory usage and runtime for the mapper. When set to a nonzero value, the mapper may return mappings up to (1 + tolerance)× optimal. Also see resource_usage_tolerance to further reduce mapper memory usage and runtime.
- out_of_order_hierarchy_explore_removing_spatials_for_more_temporals: bool = False#
If force_memory_hierarchy_order is set to False or is set to False for any particular component, and a spatial fanout ends up being raised above a storage node that does not have that fanout, then there may be cases where a spatial loop is put above a component that does not have the associated fanout.
When this happens, we may not put between the spatial and the storage node any temporal loops that affect the same indexing expressions as the spatial loops.
For example, the following is not allowed:
Arch:
Global Buffer
2x fanout
Register
Mapping:
- spatial-for-reg n in [0, 10):
- [Register reuses input]
- for n in [0, 2):
[Global Buffer reuses output]
By default, if there are spatial loops that are not constrained away, then the mapper will not explore putting any temporal loops that conflict. In the above example, it will never place the above temporal loop. If this is set to True, then the mapper will explore removing the spatial loop in order to allow for the temporal loop to be placed. In the above example, it will explore removing the spatial loop in order to allow for the temporal loop to be placed.
- prioritize_reuse_of_unfused_tensors: bool = False#
If set to True, then for all memory levels, the mapper will place the storage nodes of unfused tensors above those of fused tensors. This is overridden if there is any tensor_order_options specified for a memory level. The result of this is that the mapper will avoid mappings that repeatedly fetch unfused tensors in order to allow for smaller tiles of fused tensors. This may lead to better mappings, but slows down the mapper.
- resource_usage_tolerance: float = 0#
Reduces memory usage and runtime for the mapper. When set to a nonzero value, the mapper may drop mappings with resource usage > (1 - tolerance)× optimal. The mapper is guaranteed to return all Pareto-optimal mappings with resource usage below this, and perhaps more. If Metrics.RESOURCE_USAGE is set, then this is ignored. Setting this, as well as objective_tolerance, to a greater-than-zero value will reduce memory usage for the mapper.
accelforge.frontend.mapper.mapper module#
- class accelforge.frontend.mapper.mapper.Mapper[source]#
Bases:
EvalableModel- classmethod from_yaml(*files, jinja_parse_data=None, top_key=None, **kwargs)#
Loads a dictionary from one more more yaml files.
Each yaml file should contain a dictionary. Dictionaries are combined in the order they are given.
Keyword arguments are also added to the dictionary.
- Parameters:
files (
str|list[str] |Path|list[Path]) – A list of yaml files to load.jinja_parse_data (
dict[str,Any] |None) – Optional[Dict[str, Any]] A dictionary of Jinja2 data to use when parsing the yaml files.top_key (
str|None) – Optional[str] The top key to use when parsing the yaml files.kwargs – Extra keyword arguments to be passed to the Jinja2 parser.
- Return type:
TypeVar(T)- Returns:
A dict containing the combined dictionaries.
accelforge.frontend.mapper.metrics module#
- class accelforge.frontend.mapper.metrics.Metrics[source]#
Bases:
FlagMetrics used to optimize mappings or reported by model.
- ACTIONS = 32#
Action counts.
- DETAILED_MEMORY_USAGE = 64#
Memory usage broken down by tensor and Einsum.
- DYNAMIC_ENERGY = 4#
The amount of dynamic energy consumed by the workload.
- ENERGY = 2#
The amount of energy consumed by the workload.
- LATENCY = 1#
The amount of time taken to execute the workload.
- LEAK_ENERGY = 8#
The amount of leak energy consumed by the workload.
- RESOURCE_USAGE = 16#
The amount of resources used by the workload.
When used as a mapper objective, this objective is multivariate, and must consider every resource available to the hardware.
- __new__(value)#
Module contents#
- class accelforge.frontend.mapper.FFM[source]#
Bases:
EvalableModelConfiguration for the Fast and Fusiest Mapper.
- explore_imperfect_spatial_loops: bool = False#
If True, spatial loop bounds may not perfectly divide the full rank shape. This takes longer to explore and requires more RAM, but mappings found may have better spatial utilization. This is especially helpful when the rank shapes have few prime factors.
For example, if the rank shape is 7, then explore_imperfect_spatial_loops=False would explore loop bounds of 1, 7 and explore_imperfect_spatial_loops=True would explore loop bounds of 1, 2, 3, 4, 7. This would be helfpul for a size-4 PE array, where we could get full utilization using 4 PEs in one timestep and 3 PEs in another timestep.
Only “simple” rank variables (those appearing alone and not inside an expression in any tensor access) may have imperfect loop bounds.
- explore_imperfect_temporal_loops: bool = False#
If True, temporal loop bounds may not perfectly divide the full rank shape. This takes longer to explore and requires more RAM, but mappings found may have lower memory usage. This is especially helpful when the rank shapes have few prime factors.
For example, if the rank shape is 7, then explore_imperfect_temporal_loops=False would explore loop bounds of 1, 7 and explore_imperfect_temporal_loops=True would explore loop bounds of 1, 2, 3, 4, 7.
Only “simple” rank variables (those appearing alone and not inside an expression in any tensor access) may have imperfect loop bounds.
- explore_loop_orders: bool = True#
Whether to explore loop orders for loops where we may get partial reuse. Note that loop orders that don’t matter (i.e., ones that have either full or no reuse) are not explored, except in joining where we may join partial mappings who have different-but-equivalent loop orders.
- force_memory_hierarchy_order: bool = True#
If set to true, storage nodes for lower-level memories must be placed below storage nodes for higher-level memories. For example, all MainMemory storage nodes must go above all LocalBuffer storage nodes.
This constraint always applies to same-tensor storage nodes (e.g., MainMemory reusing Output must go above LocalBuffer reusing Output); turning it off will permit things like MainMemory reusing Output going above LocalBuffer reusing Input.
- classmethod from_yaml(*files, jinja_parse_data=None, top_key=None, **kwargs)#
Loads a dictionary from one more more yaml files.
Each yaml file should contain a dictionary. Dictionaries are combined in the order they are given.
Keyword arguments are also added to the dictionary.
- Parameters:
files (
str|list[str] |Path|list[Path]) – A list of yaml files to load.jinja_parse_data (
dict[str,Any] |None) – Optional[Dict[str, Any]] A dictionary of Jinja2 data to use when parsing the yaml files.top_key (
str|None) – Optional[str] The top key to use when parsing the yaml files.kwargs – Extra keyword arguments to be passed to the Jinja2 parser.
- Return type:
TypeVar(T)- Returns:
A dict containing the combined dictionaries.
- max_fused_loops_per_rank_variable: int = 1#
The maximum number of fused loops in a pmapping for a given rank variable.
- max_loops_minus_ranks: float | int = inf#
The maximum total loops in a pmapping minus the number of ranks. For example, 3 means that the number of loops can be up to (the number of ranks + 3).
- max_pmapping_templates_per_einsum: float | int = inf#
The maximum number of pmapping templates per Einsum. Once this many templates are generated, the mapper will stop generating more. This is useful for debugging (why are so many templates being generated?).
- memory_limit_per_process: float | int = inf#
The maximum memory limit per process for one of the mapper’s processes.
- objective_tolerance: float = 0#
Reduces memory usage and runtime for the mapper. When set to a nonzero value, the mapper may return mappings up to (1 + tolerance)× optimal. Also see resource_usage_tolerance to further reduce mapper memory usage and runtime.
- out_of_order_hierarchy_explore_removing_spatials_for_more_temporals: bool = False#
If force_memory_hierarchy_order is set to False or is set to False for any particular component, and a spatial fanout ends up being raised above a storage node that does not have that fanout, then there may be cases where a spatial loop is put above a component that does not have the associated fanout.
When this happens, we may not put between the spatial and the storage node any temporal loops that affect the same indexing expressions as the spatial loops.
For example, the following is not allowed:
Arch:
Global Buffer
2x fanout
Register
Mapping:
- spatial-for-reg n in [0, 10):
- [Register reuses input]
- for n in [0, 2):
[Global Buffer reuses output]
By default, if there are spatial loops that are not constrained away, then the mapper will not explore putting any temporal loops that conflict. In the above example, it will never place the above temporal loop. If this is set to True, then the mapper will explore removing the spatial loop in order to allow for the temporal loop to be placed. In the above example, it will explore removing the spatial loop in order to allow for the temporal loop to be placed.
- prioritize_reuse_of_unfused_tensors: bool = False#
If set to True, then for all memory levels, the mapper will place the storage nodes of unfused tensors above those of fused tensors. This is overridden if there is any tensor_order_options specified for a memory level. The result of this is that the mapper will avoid mappings that repeatedly fetch unfused tensors in order to allow for smaller tiles of fused tensors. This may lead to better mappings, but slows down the mapper.
- resource_usage_tolerance: float = 0#
Reduces memory usage and runtime for the mapper. When set to a nonzero value, the mapper may drop mappings with resource usage > (1 - tolerance)× optimal. The mapper is guaranteed to return all Pareto-optimal mappings with resource usage below this, and perhaps more. If Metrics.RESOURCE_USAGE is set, then this is ignored. Setting this, as well as objective_tolerance, to a greater-than-zero value will reduce memory usage for the mapper.
- class accelforge.frontend.mapper.Mapper[source]#
Bases:
EvalableModel- classmethod from_yaml(*files, jinja_parse_data=None, top_key=None, **kwargs)#
Loads a dictionary from one more more yaml files.
Each yaml file should contain a dictionary. Dictionaries are combined in the order they are given.
Keyword arguments are also added to the dictionary.
- Parameters:
files (
str|list[str] |Path|list[Path]) – A list of yaml files to load.jinja_parse_data (
dict[str,Any] |None) – Optional[Dict[str, Any]] A dictionary of Jinja2 data to use when parsing the yaml files.top_key (
str|None) – Optional[str] The top key to use when parsing the yaml files.kwargs – Extra keyword arguments to be passed to the Jinja2 parser.
- Return type:
TypeVar(T)- Returns:
A dict containing the combined dictionaries.
- class accelforge.frontend.mapper.Metrics[source]#
Bases:
FlagMetrics used to optimize mappings or reported by model.
- ACTIONS = 32#
Action counts.
- DETAILED_MEMORY_USAGE = 64#
Memory usage broken down by tensor and Einsum.
- DYNAMIC_ENERGY = 4#
The amount of dynamic energy consumed by the workload.
- ENERGY = 2#
The amount of energy consumed by the workload.
- LATENCY = 1#
The amount of time taken to execute the workload.
- LEAK_ENERGY = 8#
The amount of leak energy consumed by the workload.
- RESOURCE_USAGE = 16#
The amount of resources used by the workload.
When used as a mapper objective, this objective is multivariate, and must consider every resource available to the hardware.
- __new__(value)#