259x Filetype PDF File size 3.12 MB Source: htor.inf.ethz.ch
Productivity, Portability, Performance: Data-Centric Python
Alexandros Nikolaos Ziogas, Timo Schneider, Tal Ben-Nun, Alexandru Calotoiu, Tiziano De
Matteis, Johannes de Fine Licht, Luca Lavarini, and Torsten Hoefler
Department of Computer Science, ETH Zurich
Switzerland
ABSTRACT
Python has become the de facto language for scientific computing.
ProgramminginPythonishighlyproductive,mainlyduetoitsrich
science-oriented software ecosystem built around the NumPy mod-
ule.Asaresult,thedemandforPythonsupportinHighPerformance
Computing(HPC)hasskyrocketed. However, the Python language
itself does not necessarily offer high performance. In this work, we
present a workflow that retains Python’s high productivity while
achieving portable performance across different architectures. The
workflow’s key features are HPC-oriented language extensions and
a set of automatic optimizations powered by a data-centric inter-
mediate representation. We show performance results and scaling
across CPU, GPU, FPGA, and the Piz Daint supercomputer (up to
23,328 cores), with 2.47x and 3.75x speedups over previous-best
solutions, first-ever Xilinx and Intel FPGA results of annotated 1.7x
Python, and up to 93.16% scaling efficiency on 512 nodes.
Figure 1: Data-Centric Python overview.
CCSCONCEPTS
· Softwareanditsengineering→Parallelprogramminglan- Asaresult, the scientific community now pushes to also make
guages; Distributed programming languages; Data flow lan- Python the language for writing high-performance code. For most
guages; Source code generation. scientific applications, NumPy arrays [36] provide the core data
KEYWORDS structures, interfaces, and BLAS and LAPACK library interoper-
Data-Centric, High Performance Computing, Python, NumPy ability. NumPy is optimized to provide efficient data structures
ACMReferenceFormat: and fast library implementations for many common operations.
Alexandros Nikolaos Ziogas, Timo Schneider, Tal Ben-Nun, Alexandru However, the performance benefits of NumPy are tied to optimized
Calotoiu, Tiziano De Matteis, Johannes de Fine Licht, Luca Lavarini, and methodcalls and vectorized array operations, both of which evap-
Torsten Hoefler. 2021. Productivity, Portability, Performance: Data-Centric orate in larger scientific codes that do not adhere to these con-
Python . In The International Conference for High Performance Computing, straints. Therefore, there is a significant performance gap between
Networking, Storage and Analysis (SC ’21), November 14ś19, 2021, St. Louis, the numpythonic code-style and general code written by domain
MO, USA. ACM, New York, NY, USA, 14 pages. https://doi.org/10.1145/ scientists.
3458817.3476176 InHPC,thethreePs(Productivity,Portability,Performance)
1 INTRODUCTION aredrivingrecentdevelopmentsininfrastructureandprogramming
model research to ensure sustainability [14, 53]. In Python, this
Python is the language to write scientific code [30]. The capabil- drive has resulted in many optimized domain-specific libraries and
ity to write and maintain Python code with ease, coupled with a frameworks [2, 17, 46, 61, 72]. Simultaneously, the diversity of the
vast number of domain-specific frameworks and libraries such as hardware landscape motivated the creation of interface libraries,
SciPy, Matplotlib, scikit-learn [62], or pandas [74], leads to high such as CuPy [56], which provides replacements to NumPy opera-
productivity. It also promotes collaboration with reproducible sci- tions for NVIDIA and AMD GPUs, and MPI4PY [22], which offers
entific workflows shared using Jupyter notebooks [42]. Therefore, direct MPI bindings. Lower-level interfaces, such as Cython [12],
numerousscientific fields, ranging from machine learning [2, 61] promisehighperformanceatthecostofwritingcodethatresembles
to climate [72] and quantum transport [75] have already adopted C, and lazy evaluation of array operations [10, 43, 61] that enable
Python as their language of choice for new developments. high-performance runtime systems. Furthermore, a variety of JIT
compilers [17, 34, 45] address the performance degradation result-
SC’21, November 14ś19, 2021, St. Louis, MO, USA ing from the interpreter. Last but not least, runtime systems [10],
©2021Association for Computing Machinery. distributed tasking (Dask) [23], and remote procedure calls [52] fur-
This is the author’s version of the work. It is posted here for your personal use. Not
for redistribution. The definitive Version of Record was published in The International ther scale Python to distributed systems. Despite the abundance of
Conference for High Performance Computing, Networking, Storage and Analysis (SC ’21), choices, Python still struggles: while each approach works towards
November 14ś19, 2021, St. Louis, MO, USA, https://doi.org/10.1145/3458817.3476176. one or more of the Ps, none of them supports all at the same time.
SC’21, November 14ś19, 2021, St. Louis, MO, USA Ziogas et al.
WeproposeawaytobridgethegapbetweenthethreePsfor Python is an imperative language and, therefore, not designed
Python programming using a data-centric paradigm. In particular, to express data movement. Its terseness makes the process of un-
weempowerPythonuserswithanautomaticoptimizationandspe- derstanding dataflow difficult, even when comparing to other lan-
cialization toolbox, which spans the entire Python/HPC ecosystem guages like C and FORTRAN, as the types of variables in Python
(Fig. 1) Ð from the code, through communication distribution, to code cannot be statically deduced.
hardware mapping. At the core of the toolbox, we use the State- Below, we define high-performance Python programs, discuss
ful Dataflow multiGraphs (SDFG) [13] data-centric intermediate the decorators that we must add to Python code to make the
representation, which enables these optimizations in the form of dataflow analyzable, and then detail how we translate them into
multi-level data movement transformations. With a data-centric the SDFG data-centric intermediate representation.
representation, as opposed to library bindings, all data dependen-
cies and potential overlap are inferred statically from the code, and 2.1 HighPerformancePython
interpreter overhead is mitigated. Compared with implicit and lazy Ourapproachsupports a large subset of the Python language that
evaluation approaches, we also provide a set of extensions that is important for HPC applications. The focus lies on NumPy ar-
give power users complete control over parallelism and partition- rays [36] and operations on such arrays. In addition to the low-
ing schemes, using pythonic principles and syntax (e.g., returning overhead data structures NumPy offers, it is central to many frame-
łlocal viewž objects from global data, but allowing users to operate works focused on scientific computing, e.g., SciPy, pandas, Mat-
onthełglobal viewž as well). plotlib.Asopposedtolazyevaluationapproaches,high-performance
Wedemonstrateawidevarietyofbenchmarksusingtheauto- Python must take control flow into account to auto-parallelize and
matic toolbox over annotated Python code, both on individual avoid interpreter overhead. This tradeoff between performance
nodes and the Piz Daint supercomputer. For the former, we show and productivity is necessary because Python features such as co-
that it consistently outperforms other automatic approaches on routines are not statically analyzable and have to be parsed as
multicore CPUs and GPUs, and for the first time show automatic łblack-boxesž. To combat some of these Python quirks, we propose
Python HPCcompilation results for both Xilinx and Intel FPGAs, to augmentthelanguagewithanalyzableconstructsusefulforHPC.
whichvastly differ in architecture and programming language. In
distributed-memoryenvironments,weshowhighscalingefficiency 2.2 AnnotatingPython
andabsoluteperformancecomparedwithdistributedtasking.Thus, TheData-Centric (DaCe) Python frontend parses Python code and
werealize all three Ps within a single system. convertsittoSDFGsonaper-functionbasis.Thefrontendwillparse
Thepapermakesthefollowingcontributions: only the Python functions that have been annotated explicitly by
• (Productivity) Definition of high-performance Python, a theuserwiththe@dace.programdecorator.DaCeprogramscanthen
methodology to translate it to a data-centric IR, and exten- be called like any Python function and perform Just-in-Time (JIT)
sions to improve said conversion via explicit annotation. compilation.
• (Portability)AsetofautomaticoptimizationsforCPU,GPU Static symbolic typing. To enable Ahead-of-Time (AOT) com-
andFPGA,outperformingthebestpriorapproachesby2.47× pilation, which is of key importance for FPGAs and for reusing
onCPUand3.75×onGPUonaverage(geometricmean[1]). programs across different inputs, SDFGs should be statically typed.
• (Performance)AutomaticimplicitMPItransformationsand Therefore, the function argument data types are given as type an-
communicationoptimizations,aswellasexplicitdistribution notations, providing the required information as shown below:
management, with the former scaling to 512 nodes with up
to 93.16% efficiency. N = dace.symbol()
@dace.program
def jacobi_1d(TSTEPS: dace.int32,
2 DATA-CENTRICPYTHON A: dace.float64[N],
Thecentral tenet of our approach is that understanding and opti- B: dace.float64[N]):
mizing data movement is the key to portable, high-performance for t in range(1, TSTEPS):
code. In a data-centric programming paradigm, three governing B[1:-1] = 0.33333 * (A[:-2]+A[1:-1]+A[2:])
principles guide development and execution: A[1:-1] = 0.33333 * (B[:-2]+B[1:-1]+B[2:])
(1) Data containers must be separate from computations. ThePythonmethodjacobi_1dhasthreearguments; TSTEPS is a 32-
(2) Data movement must be explicit, both from data containers bit integer scalar, while A and B are double precision floating-point
to computations and to other data containers. vectors of length N. The symbolic size N, defined with dace.symbol,
(3) Control flow dependencies must be minimized, they shall indicates that the vector sizes can be dynamic (but equal). All sub-
only define execution order if no implicit dataflow is given. sets are then symbolically defined (e.g., the subset B[1:-1] becomes
B[1:N-1], and symbolic manipulation can then be performed in
In the context of SDFGs, examples of data containers are arrays subsequent data-centric transformations.
andscalar data, which have a NumPy-compatible data type, such Parametric parallelism. An important feature that has no direct
as int32 or float64. expressioninPythonisaloopthatcanruninparallel.Ourapproach
Productivity, Portability, Performance: Data-Centric Python SC’21, November 14ś19, 2021, St. Louis, MO, USA
Python SDFGEquivalent A[0: , 0: ] C[0:, 0: ]
Declarations and Types | alpha |
, ∈0.. − 1 ∧ ∈ 0. . − 1 , ∈0.. − 1 ∧ ∈ 0. . − 1
Primitive data types Scalar data container alpha A[, ] C[, ]
out = inp1 * inp2 out = inp
NumPyarray Array data container tmp0[, ] alpha (+)
Assignments tmp0[0:, 0: ] alpha (+)
Assignments Tasklet or map scope with incoming
andoutgoingmemletsforread/writ- (a) Element-wise array operation (b) Augmented assignment with
ten operands tmp0 = alpha * A. WCR.
Array subscript Memlet(dataflow edge) Figure 2: SDFG representations
Statements
Branching (if) Branch conditions on state transi-
tion edges Our first pass traverses the Python AST to simplify the expres-
Iteration (for) Conditions, increments on state sions into individual steps, similar to static single assignment [7],
transition edges respecting order of operations:
Control-flow Edgetocontrolstructureorfunction tmp0 = alpha * A
(break, continue, return) exit state tmp1 = tmp0 @ B
tmp2 = beta * C
Functions C = tmp1 + tmp2
Functioncalls(withsource)and Nested SDFG for content, memlets The first step in the above code multiplies each element of A
decorator for argument types reduce shape of inputs and outputs with alpha. SDFGs view data containers separately from the com-
External/Library calls Tasklet with callback or Library putations the data are part of, as per the first data-centric tenet.
Node These containers are represented by oval Access nodes. In the first
Table 1: Mapping of Python syntax and constructs to SDFG. statement, these refer to tmp0, alpha, and A (see Fig. 2a).
In SDFGs,connectionstodatacontainersarecalledmemlets,and
they describe the data movement Ð the edge direction indicates
supports explicit parallelism declaration through map scopes, simi- whetheritisreadorwritten,anditscontentsrefertothepartofthe
larly to an N-dimensional parallel for. There are two ways to data container that is accessed. Computations consume/produce
take advantage of this feature. DaCe provides the dace.map iterator, memlets and can be divided into multiple types:
which can be used in Python code as a substitute to the Python (1) Stateless computations (Tasklets, shown as octagons), e.g.,
built-in range iterator and generates a map scope when parsed: representing scalar assignments such as a = 1.
for i, j in dace.map[0:M, 0:N]: (2) Calls to external libraries (Library Nodes, folded rectangles),
A[i, j] = B[j, i] that represent calls to functions that are not in the list of
functions decorated with dace.program. Matrix-matrix mul-
Alternatively, the DaCe framework provides a LoopToMap transfor- tiply is a common and important operation tmp1 = tmp0 @ B
mation that detects for-loops in the IR, whose iterations can be andis contained in a Library Node called MatMul.
executed safely in parallel (using symbolic affine expression analy- (3) Calls to other SDFGs (Nested SDFGs, rectangles), which rep-
sis), and converts them to map scopes automatically. resent calls to functions decorated with dace.program.
(4) Maps are a particular type of Nested SDFGs matching the
2.3 FromPythontoDaCe language augmentation previously discussed in Section 2.2
WeturntopresenttheSDFGintermediaterepresentation(IR)anda andexpress that the content can be processed in parallel.
novel data-centric Python translation procedure in tandem. While Element-wise array operations automatically yield Map scopes.
previous work [13] converted a restricted, low-level Python def- Similarly, assignments to arrays yield a Map scope, containing a
inition of the SDFG IR, here we aim to cover the majority of the Taskletwiththeelement-wiseassignment.Augmentedassignments
Python/NumPylanguageconstructsviastatic analysis and fallback such as C += 1 are a special case where the output is also an input
for unsupported features. We summarize the equivalence between whennodataracesaredetected.
Python constructs and SDFG counterparts in Table 1, and present Parallel maps can be augmented to express how Write-Conflict
the generation of an SDFG from a Python program using the gemm Resolution (WCR)shoulddeterminethevalueofdatawhenmultiple
kernel as an example: sourceswritetoitconcurrently.Ifdataracesarefound,theoutgoing
edges are marked as dashed. E.g., the following program requires
@dace.program WCR(SDFGrepresentationisshowninFig.2b):
def gemm(alpha, beta, C, A, B):
C[:] = alpha * A @ B + beta * C for i, j in dace.map[0:NI, 0:NJ]:
alpha += C[i, j]
SC’21, November 14ś19, 2021, St. Louis, MO, USA Ziogas et al.
Byconnecting the inputs and outputs of computations with the A[0:, 0: ]
containers explicitly, the data-centric representation of a program | alpha
, ∈0.. − 1 ∧ ∈ 0. . − 1 alpha A[0:, 0: ]
can be created. To represent control dependencies, we encapsulate alpha A[, ] , | ∈0.. − 1 ∧ ∈ 0. . − 1
codedrivenbypuredatadependenciesinStates(brightblueregions) out = inp1 * inp2 alpha A[, ]
in the SDFG. The states are connected with State Transition Edges tmp0[, ] StateFusion out = inp1 * inp2
(in blue) that can be annotatedwithcontrolflow,representingloops, tmp0[0:, 0: ] tmp0[, ]
branch conditions, and state machines in general. The following tmp0[0:, 0: ]
Python program is represented by the SDFG in Fig. 3: tmp0[0:, 0: ] B[0:, 0: ] tmp0[0:, 0: ] B[0:, 0: ]
MatMult
for i in range(NI): MatMult tmp1[0:, 0: ]
C[i] += 1 tmp1[0:, 0: ]
=0 +=1 C[] Figure4:Statefusionoftmp0 = alpha * Aandtmp1 = tmp0 @ B.
out = inp + 1
≥ < C[] coarseningpasshappensautomaticallyaspartofourproposedtool-
box, one can also apply transformations manually and separately,
Figure 3: SDFG representation of a Python for-loop. There without changing the original Python source code (we color such
are twostates (left: guard, right: body) connected by control łperformance engineering codesž in cyan):
flow(state transition) edges that define the loop range.
sdfg = gemm.to_sdfg()
TheaboveloopalsoisanexcellentusecasefortheLoopToMaptrans- sdfg.apply(StateFusion)
formation (Section 2.2) since its iterations are independent.
In the conversion to the SDFG IR, we also replace calls to library 2.5 PythonRestrictions
functions (e.g., np.linalg.solve) and object methods (A.view())
with custom subgraphs or Library Nodes, which users can extend Somefeatures available in Python are incompatible with our defi-
for other libraries and object types via a decorated function. Fol- nition of high performance Python, and are discussed below. This
lowing this initial conversion, the resulting SDFG contains a state doesnotexcludeprogramsusingthefullfeaturesetofPythonfrom
per statement and a nested SDFG per function call. analysis, but calls to functions containing unsupported features
will not benefit from our optimization. The restricted features are:
2.4 DataflowOptimization (1) Python containers (lists, sets, dictionaries, etc.) other than
Thedirect translation to SDFGs creates a control-centric version NumPy arrays as arguments, as they are represented by
of the code, which matches the Python semantics but does not linked lists and difficult to analyze from a dataflow view.
contain dataflow beyond a single statement (similarly to compilers’ Note that this does not preclude internal containers (which
-O0). To mitigate this, we run a pass of IR graph transformations can be analyzed) and list comprehensions.
that coarsens the dataflow, exposing a true data-centric view of the (2) Dynamictyping:fixedstructsareallowedinSDFGs,sofields
code (similar to -O1). The transformations include redundant copy can be transformed and their class methods into functions
removals, inlining Nested SDFGs, and others (14 in total), which decorated with the dace.program decorator. However, dy-
onlymodifyorremoveelementsinthegraph,suchthattheycannot namic changes to classes or dynamic types are unsupported
be applied indefinitely. as their structure cannot be statically derived.
Tounderstandthispass,weshowcaseonetransformationÐstate (3) Control-dependent variable state (i.e., no scoping rules), e.g.,
fusion Ð which allows merging two states where the result does the following valid Python:
not produce data races. For example, the two states containing the
assignments below can be merged: x = ...
if x > 5:
tmp0 = alpha * A y = np.ndarray([5, 6], dtype=np.float32)
tmp1 = tmp0 @ B # use y (will raise exception if x <= 5)
Internally, the transformation matches a subgraph pattern of two (4) Recursion. This is a limitation of the data-centric program-
states connected together and compares source and sink Access mingmodel[13], as recursion is a control-centric concept
nodes using symbolic set intersection. If no data dependency con- that is not portable across platforms.
straints are violated, Access nodes are either fused (if they point to
the same memory, see Fig. 4) or set side by side, creating multiple After performing the full translation and coarsening, the re-
connected components that can run in parallel. sulting gemm kernel can be seen in Figure 5 (left-hand side). This
All transformations in the DaCe framework follow the same data-centric representation can now use the SDFG IR capabilities
infrastructure, either matching a subgraph pattern or allowing to further optimize and map the original Python code to different
the user to choose an arbitrary subgraph [13]. While the dataflow architectures and distributed systems.
no reviews yet
Please Login to review.