Arch Specification#

The architecture, defined by the Arch class, describes the hardware that is running the workload. An architecture is represented as a tree, where branches in the tree represent different compute paths that may be taken. For the rest of this section, we will assume that the architecture has been flattened, meaning that there are no branches in the tree. The flattening procedure is described in the Flattening section.

A flattened architecture is a hierarchy of components with a Compute at the bottom. The following components are supported:

  • Memory components store and reuse data.

  • Toll components perform some non-compute action (e.g., quantizing or transferring data) and charge for data passing through them.

  • Compute components performs the Einsum’s computation.

In architecture YAML files, each component is represented by a YAML dictionary. Component types are preceded by the ! character. An example architecture is shown below:

arch:
  nodes:
  - !Memory
    name: MainMemory
    size: inf
    leak_power: 0
    area: 0 # Don't include off-chip DRAM area
    actions:
      # Upper end of the range from the TPU paper. The lower end came from their
      # reference, and they said it left out some things.
    - {name: read, energy: 7.03e-12, latency: 1 / (8 * 614e9)}
    - {name: write, energy: 7.03e-12, latency: 1 / (8 * 614e9)}
    tensors: {keep: ~Intermediates, may_keep: All}

  - !Memory
    name: GlobalBuffer
    size: 1024*1024*128*8 # 128MB
    total_latency: max(read_latency, write_latency) # Separate ports
    leak_power: 0
    area: 112e-6 # 112 mm^2
    actions:
    - {name: read, energy: 1.88e-12, latency: 1 / (8 * 2048e9)}
    - {name: write, energy: 2.36e-12, latency: 1 / (8 * 1024e9)}
    tensors: {keep: ~MainMemory.tensors, may_keep: All}

  - !Memory
    name: LocalBuffer
    spatial: [{name: Z, fanout: 4, may_reuse: Nothing, min_usage: 1}]
    size: 1024*1024*4*8 # 4MB
    leak_power: 0
    area: 50e-6 # 50 mm^2. Very rough estimate based on die photo.
    actions:
    - {name: read, energy: 0.249e-12, latency: 0}
    - {name: write, energy: 0.293e-12, latency: 0}
    tensors: {keep: input | output}

  - !Compute
    name: ScalarUnit
    area: 10e-6 # 10 um^2. Very rough estimate based on die photo.
    leak_power: 0
    actions:
    - {name: compute, energy: 0, latency: 1 / 1.05e9 / 128}
    enabled: len(All) == 2

  - !Fanout
    name: ArrayFanout
    spatial:
    - {name: reuse_input, fanout: 128, may_reuse: input, reuse: input, min_usage: 1}
    - {name: reuse_output, fanout: 128, may_reuse: output, reuse: output, min_usage: 1}

  - !Memory
    name: Register
    size: weight.bits_per_value if weight else 0
    area: 1e-11 # 10 um^2. Very rough estimate based on die photo.
    leak_power: 0
    actions:
    - {name: read, energy: 0, latency: 0}
    - {name: write, energy: 0, latency: 0}
    tensors: {keep: weight}

  - !Compute
    name: MAC
    leak_power: 0
    area: 9e-11 # 90 um^2. Very rough estimate based on die photo.
    actions:
    - {name: compute, energy: 0.084e-12, latency: 1 / 1.05e9}
    enabled: len(All) == 3

Flattening#

A given Einsum may be executed only on a single Compute, and it may use hardware objects between the root of the tree and the leaf for that Compute. Flattening an architecture converts a tree architecture into multiple parallel Flattened Architectures, each one representing one possible path from the root of the tree to the leaf for that Compute.

For example, in the architecture above, there are two compute units, the scalar_unit and the mac. Flattening this architecture will produce two Flattened Architectures; one with a scalar_unit and one with a mac. The partial mappings for each of these architectures can be combined, and can share hardware that exists above both compute units.

Inserting a Compute directly into the top-level architecture hierarchy will create an optional compute path that goes from the top node to the compute. More complex topologies (e.g., give an upper-level compute a private cache) can be created by creating sub-branches following Sub-Branches.

Sub-Branches#

Sub-branches in the architecture can represent different execution paths. The primary ~accelforge.frontend.arch.Arch class is a ~accelforge.frontend.arch.Hierarchical node, which represents a single hierarchy where each node is a parent of the following nodes. Additionally, ~accelforge.frontend.arch.Fork can branch off from the main hierarchy. to represent alternate compute paths. They may be written with the following syntax:

- !Memory
  ...

- !Memory
  ...

- !Fork
  nodes:
  - !Memory
    ...
  # This compute is the final node in the Fork. The Fork is terminated afterwards
  # (because we end the list), and the main hierarchy continues.
  - !Compute
    ...

# Continuing the main hierarchy
- !Memory
  ...

- !Compute
  ...

Spatial Fanouts#

Spatial fanouts describe the spatial organization of components in the architecture. Any component may have spatial fanouts, and fanouts are allowed in any dimension. While any Leaf node can instantiate spatial fanouts, it is often convenient to use the dedicated Fanout class.

When a fanout is instantiated, the given component, alongside all of its children, are duplicated in the given dimension(s). For example, in the TPU v4i architecture above, the LocalBuffer component has a size-4 spatial fanout in the Z dimension, meaning that there are 4 instances of the component. The register component has both the size-4 Z fanout spatial fanout, as well as two size-128 spatial fanouts in the reuse_input and reuse_output dimensions, respectively.

Reuse in spatial dimensions may be controlled with the may_reuse keyword, which takes in a set expression that is evaluated according to the set expression section of the Set Expressions guide. In the example, nothing is reused spatially betweeen LocalBuffer instances, while inputs and outputs are reused across registers in the reuse_input and reuse_output dimensions, respectively. Additionally, the reuse keyword can be used to force reuse; for example, reuse: input means that all spatial instances must use the same input values, otherwise the mapping will be invalid.

Spatial fanouts support the following keywords:

  • fanout: The size of this fanout.

  • loop_bounds: Bounds for loops over this dimension. This is a list of :class:`~.Comparison` objects, all of which must be satisfied by the loops to which this constraint applies. Note: Loops may be removed if they are constrained to only one iteration.

  • may_reuse: The tensors that can be reused spatially across instances of this fanout. This expression will be evaluated for each mapping template.

  • min_usage: The minimum usage of spatial instances, as a value from 0 to 1. A mapping is invalid if less than this porportion of this dimension’s fanout is utilized. Mappers that support it (e.g., FFM) may, if no mappings satisfy this constraint, return the highest-usage mappings.

  • name: The name of the dimension over which this spatial fanout is occurring (e.g., X or Y).

  • power_gateable: Whether this spatial fanout has power gating. If True, then unused spatial instances will be power gated if not used by a particular Einsum.

  • reuse: A set of tensors or a set expression representing tensors that must be reused across spatial iterations. Spatial loops may only be placed that reuse ALL tensors given here. Note: Loops may be removed if they do not reuse a tensor given here and they do not appear in another loop bound constraint.

  • usage_scale: This factor scales the usage in this dimension. For example, if usage_scale is 2 and 10/20 spatial instances are used, then the usage will be scaled to 20/20.

Tensor Holders#

Tensor holders, which include Memory and Toll components, hold tensors.

A Memory is a TensorHolder that stores data over time, allowing for temporal reuse..

A Toll is a TensorHolder that does not store data over time, and therefore does not allow for temporal reuse. Use this as a toll that charges reads and writes every time a piece of data moves through it. Every write to a Toll is immediately written to the next Memory (which may be above or below depending on where the write came from), and same for reads. The access counts of a Toll are only included in the “read” action. Each traversal through the Toll is counted as a read. Writes are always zero..

A Memory is a TensorHolder that stores data over time, allowing for temporal reuse. and A Toll is a TensorHolder that does not store data over time, and therefore does not allow for temporal reuse. Use this as a toll that charges reads and writes every time a piece of data moves through it. Every write to a Toll is immediately written to the next Memory (which may be above or below depending on where the write came from), and same for reads. The access counts of a Toll are only included in the “read” action. Each traversal through the Toll is counted as a read. Writes are always zero. support the following fields:

  • actions: The actions that this `TensorHolder` can perform.

  • area: The area of a single instance of this component in m^2. If set, area calculations will use this value.

  • area_scale: The scale factor for the area of this comxponent. This is used to scale the area of this component. For example, if the area is 1 m^2 and the scale factor is 2, then the area is 2 m^2.

  • bits_per_action: The number of bits accessed in each of this component’s actions. Overridden by bits_per_action in any action of this component. If set here, acts as a default value for the bits_per_action of all actions of this component.

  • bits_per_value_scale: A scaling factor for the bits per value of the tensors in this `TensorHolder`. If this is a dictionary, keys in the dictionary are evaluated as expressions and may reference one or more tensors.

  • component_class: The class of this `Component`. Used if an energy or area model needs to be called for this `Component`.

  • component_model: The model to use for this `Component`. If not set, the model will be found with `hwcomponents.get_models()`. If set, the `component_class` will be ignored.

  • component_modeling_log: A log of the energy and area calculations for this `Component`.

  • enabled: Whether this component is enabled. If the expression resolves to False, then the component is disabled. This is evaluated per-pmapping-template, so it is a function of the tensors in the current Einsum. For example, you may say `len(All) >= 3` and the component will only be enabled with Einsums with three or more tensors.

  • energy_scale: The scale factor for dynamic energy of this component. For each action, multiplies this action’s energy. Multiplies the calculated energy of each action.

  • extra_attributes_for_component_model: Extra attributes to pass to the component model. In addition to all attributes of this component, any extra attributes will be passed to the component model. This can be used to define attributes that are known to the component model, but not accelforge, such as the technology node.

  • latency_scale: The scale factor for the latency of this component. This is used to scale the latency of this component. For example, if the latency is 1 ns and the scale factor is 2, then the latency is 2 ns. Multiplies the calculated latency of each action.

  • leak_power: The leak power of a single instance of this component in W. If set, leak power calculations will use this value.

  • leak_power_scale: The scale factor for the leak power of this component. This is used to scale the leak power of this component. For example, if the leak power is 1 W and the scale factor is 2, then the leak power is 2 W.

  • n_parallel_instances: The number of parallel instances of this component. Increasing parallel instances will proportionally increase area and leakage, while reducing latency (unless latency calculation is overridden).

  • name: The name of this `Component`.

  • spatial: The spatial fanouts of this `Leaf`. Spatial fanouts describe the spatial organization of components in the architecture. A spatial fanout of size N for this node means that there are N instances of this node. Multiple spatial fanouts lead to a multi-dimensional fanout. Spatial constraints apply to the data exchange across these instances. Spatial fanouts specified at this level also apply to lower-level `Leaf` nodes in the architecture.

  • tensors: Fields that control which tensor(s) are kept in this `TensorHolder` and in what order their nodes may appear in the mapping.

  • total_area: The total area of all instances of this component in m^2. Do not set this value. It is calculated when the architecture’s area is calculated.

  • total_latency: An expression representing the total latency of this component in seconds. This is used to calculate the latency of a given Einsum. Special variables available are the following: - `min`: The minimum value of all arguments to the expression. - `max`: The maximum value of all arguments to the expression. - `sum`: The sum of all arguments to the expression. - `X_actions`: The number of times action `X` is performed. For example, `read_actions` is the number of times the read action is performed. - `X_latency`: The total latency of all actions of type `X`. For example, `read_latency` is the total latency of all read actions. It is equal to the per-read latency multiplied by the number of read actions. - `action2latency`: A dictionary of action names to their latency. Additionally, all component attributes are availble as variables, and all other functions generally available in parsing. Note this expression is evaluated after other component attributes are evaluated. For example, the following expression calculates latency assuming that each read or write action takes 1ns: ``1e-9 * (read_actions + write_actions)``.

  • total_leak_power: The total leak power of all instances of this component in W. Do not set this value. It is calculated when the architecture’s leak power is calculated. If instances are power gated, actual leak power may be less than this value.

Additionally, Memory objects include:

  • size: The size of this `Memory` in bits.

Toll objects also include:

  • direction: The direction in which data flows through this `Toll`. If “up”, then data flows from below `TensorHolder`, through this `Toll` (plus paying associated costs), and then to the next `TensorHolder` above it. Other data movements are assumed to avoid this Toll.

Additionally, they have an additional tensors field, which is used to define the tensors that are held by the component. They are represented by the Tensors class, which supports the following fields:

  • back: A set expression describing which tensors must be backed by this :class:`accelforge.frontend.arch.TensorHolder`. If this is not defined, then no tensors must be backed.

  • force_memory_hierarchy_order: If set to true, storage nodes for lower-level memories must be placed below storage nodes for higher-level memories. For example, all MainMemory storage nodes must go above all LocalBuffer storage nodes. This constraint always applies to same-tensor storage nodes (e.g., MainMemory reusing Output must go above LocalBuffer reusing Output); turning it off will permit things like MainMemory reusing Output going above LocalBuffer reusing Input. This is identical to the `force_memory_hierarchy_order` field in the `FFM` class, but only applies to this tensor holder.

  • keep: A set expression describing which tensors must be kept in this :class:`accelforge.frontend.arch.TensorHolder`. If this is not defined, then all tensors must be kept. Any tensors that are in ``back`` will also be added to ``keep``.

  • may_keep: A set expression describing which tensors may optionally be kept in this :class:`accelforge.frontend.arch.TensorHolder`. The mapper will explore both keeping and not keeping each of these tensors. If this is not defined, then all tensors may be kept.

  • no_refetch_from_above: The tensors that are not allowed to be refetched from above. This is given as a set of :class:`~.TensorName` objects or a set expression that resolves to them. These tensors must be fetched at most one time from above memories, and may not be refetched across any temporal or spatial loop iterations. Tensors may be fetched in pieces (if they do not cause re-fetches of any piece).

  • tensor_order_options: Options for the order of tensor storage nodes in the mapping. This is given as a list-of-lists-of-sets. Each list-of-sets is a valid order of tensor storage nodes. Order is given from highest in the mapping to lowest. For example, an option could be [input | output, weight], which means that there is no relative ordering required between input and output, but weight must be below both.

  • tile_shape: The tile shape for each rank variable. This is given as a list of :class:`~.Comparison` objects, where each comparison must evaluate to True for a valid mapping.