Many-Task Computing with Python

Monte Lunacek

Thomas Hauser

https://github.com/mlunacek/data_science_meetup_2013

https://bit.ly/17KaVl3

Outline

High Performance Computing
- Types of applications
- Define Many-task computing
Why Python for HPC?
- Key packages for scientific computing
- Abstraction
Many-task computing with Python
- Scaling

Conclusions, questions

Goals

Understand the landscape of HPC
Links to key packages
Tools
- Notebook
- Easy parallelism
Lots of examples
- Learn something useful
- Download and try it!

High Performance Computing

Using applications to solve problems

Size
- Solve problems that can't fit on a laptop
- Need more than a few GB of RAM
- Need more than a few hundred GB of Disk
Speed
- Same problem, faster
- Makes a bigger problem more feasible

Henry Neeman, Super Computing in Plain English

Supercomputing

Definition changes daily!
Cluster of computers linked together
100x bigger, faster, better than a PC

Janus macbook comparison

Macbook	Janus
2.4 GHz Intel x2	2.8 GHz Intel x12 X 1360	~8000X
8 GB RAM	24 GB RAM	~4000X
3 M cache	12 M cache

Parallel Computing

Traditional

Shared Memory: OpenMP

Distributed: MPI

Accelerator: OpenACC, CUDA

Hybrid

New

MapReduce

Message brokers: AMQP, ZeroMQ

Solve a different problem
Offer a different set of characteristics

Landscape of Applications

Ioan Raicu, Many-Task Computing: Bridging the Gap between High Throughput Computing and High Performance Computing

Comparing MTC and HTC

Many-task Computing (MTC)
- Use a large resources over a short period
- May not be single-core and independent
- Peformance per second
High-throughput Computing (HTC)
- Operations per month

Applications that are communication-intensive but are not naturally expressed in MPI. -Ioan Raicu

University of Colorado

Some tightly-coupled MPI codes
Many simulations
- Monte Carlo, parameter scans, UQ, parameter optimization
Diverse computing backgrounds
- Geography, Ecology and Evolutionary Biology, Microbial Ecology, Geology, Astronomy
Range of computational experience

Python for High Performance Computing

Base

numpy

scipy

matplotlib

ipython

Effciency

Parallel

and many more...

Data

pandas

HPC

Andy TerrelGetting started with Python in HPC

Success Stories

~500,000 simulations on ~7,000 cores with mpi4py

Parameter optimization on ~100 cores with Scoop and DEAP

Improved biological workflow with IPython Parallel

Wrapped an engineering simulation with f2py and IPython Parallel

Packaged multiple MPI tasks with Jinja2

Benchmarking: mpi4py , pandas , Jinja2 , Django

Working on MapReduce with disco and spark

Working on Workflow with NetworkX , IPython Parallel , and Scoop

Why Python?

Flexible, powerful programming language
Easy to learn
Accessible to non-programmers
Glue
Free and open
Large community
Abstraction

Numpy (MKL) and Pytables

import numpy as np
import tables as tb

def read_h5(filename):
    filename = filename
    h5 = tb.openFile(filename, mode = "r")
    X = h5.root.x.read()
    h5.close()
    return X

if __name__ == '__main__':

    args = get_args(sys.argv[1:])
	
	A = read_h5(args.matrixA)
	B = read_h5(args.matrixB)
	C = np.dot(A,B)

2 reads, multiply, 1 write

mpy4py

from mpi4py import MPI

comm = MPI.COMM_WORLD
rank = comm.Get_rank()

if rank == 0:
   
   data = {'key1' : [7, 2.72, 3.2],
           'key2' : ( 'abc', 'xyz')}
else:
   data = None
   
data = comm.bcast(data, root=0)