[SLUG] Jon "maddog" Hall in Orlando last Friday

From: Robert Foxworth (rfoxwor1@tampabay.rr.com)
Date: Sun Mar 09 2003 - 23:23:22 EST


Jon "maddog" Hall (JH) address to LEAP/MLUG Friday 07 March comments
summary. Transcription by R. Foxworth who is responsible for any
possible errors and omissions. Most but not all of the points presented
were captured when taking notes aurally. Please do not ask me
questions about content. I hope all find this informative and useful.
---------------------------------------------------------------------
Jon Hall, Executive Director of Linux International spoke in Orlando.
Linux is a TM of Linus Torvalds. Unix is a TM of X/Open. Discussion
of the SCO-IBM legal matter. Schools should be teaching assembler
and hardware. JH believes schools have a duty to teach this. Example,
a discussion of "recursive reentrant" and a lack of understanding of
this principle; reason that basics should be taught in schools but
now are not.

Progression of Linux: Desktop system / Beowulf 1994-5 / Small Midrange
servers 1996 / Embedded systems 2000 / Commodity-based NUMA 2003 /
Desktop 2003-4. Linux would be put on older systems when HW upgrades
were made (MS); thus had little value to mfgrs. Oct 98: Informix ported
database to Linux; Oracle announced 1 day earlier it would do so
"in future" so pre-empted. 10/98 all major database systems support
Linux. Embedded systems can use Linux in 2000. In 2003: NUMA (Non
Uniform Memory Architecture) transition from proprietary to
commodity-based.

Why Beowulf? Supercomputers expensive, mfgrs failing. S/W is
expensive, tools are few. Debugger/compiler writers are faced with
different architectures and small markets.

Why small/midrange servers? Cheaper than SPARC. More stable and
cheaper than NT. Source is available. Good for ISP's. Databases
were ported in 1998.

Why embedded systems? Use Multi-architecture. Multi-tasking. Secure,
stable, royalty-free. Network stacks and device drivers available.
Modular with cheap memory. Linux meets all these requirements.

Sweet spot of mfg is highest volume. 8 MB memory chip. PDA's etc all
have 8 MB RAM chip, cheaper than a 4 MB chip due to volume. Size of
OS (100k) not a factor. Linux went from zero to 3rd most used, in
embedded systems, in 1 year. WinCE is 4th most used.

Linux has one set of API's, from embedded systems to supercomputers.
Emerging set of tools, programmers, applications. Reduced cost of S/W
when used in supercomputers, when commodity chips are used.
Mention of Itanium with Linux.

Problems when AT&T System 2 or 3 was out, Unix did not have
asynchronous I/O. Blocking. To counter this, Sys 3 included an I/O
buffer. With no blocking, a sync() forced a disk write. Parallelism
is present as buffers fill and are flushed. Database mgrs don't like
this, as transactions are assumed to be written out as done, and not
buffered. Sys 4 included synchronous mounts; when program did a
write, activity halted until the write completed.

Unix/linux and parallelism - Network is built-in and not added-on.
Parallelism: "live fast and die" (do a lot at once). Shell pipeline,
client-server programming, SMP, fork/exec, threads, PVM/MPI/
OpenMP - utilize subroutines and calls for even more parallelism.
Even in a single-CPU machine, p'ism speeds up clock-time execution
and cuts down on I/O wait time, keeps pages in memory 'warmer'.

Pipes and Filters - Shell pipeline is SMP-friendly. VMS has heavy
startup cost, lots of CPU effort to start a process, but is then
efficient. Example given of a for loop to output a word. However,
Unix was efficient at starting a process. Need is there to create
lightweight processes. Fork(2)ing and exec(2)ing is light-weight
p'ism. Unix is best at encouraging p'ism of code.

SMP machines 'rock' (since 2.0 kernels). Only so many CPU's can
sit on a memory buss. Race conditions, locks - spin, semaphore,
counting semaphore. Thread-safe vs. multi-threaded. Process-
or thread-based scheduling. H/W as well as S/W can create a
bottleneck in tightly coupled systems. Linux 2.0 solved problem
of locking/unlocking CPU. One CPU can idle while other does
kernel update. Linux 2.2, 2.4 created good scaling - get better
performance as new CPU's are added, solving the lock problem.
One probably can scale tightly-coupled systems to 12 or 14 CPU's.

Libraries: Not SMP-safe; are SMP-safe; threaded. When not SMP-
safe, the 2nd CPU can overwrite data from the 1st CPU. When is
SMP-safe, semaphore allows use of library in multi-CPU system.
(small amount of data not recorded-ed.). X11 was thread-safe but
not multi-threaded. X11 cannot run as fast on a multi-CPU system.

The CPU schedules: stack, memory and I/O channels. Threading
is bound by the process priority. Thread-based algorithms can
change priority of all processes. If one thread fills a buffer
and another empties a buffer (same? -ed.) then can adjust rate at
which buffers are used.

Scaling: Kernel locks: Too few waste CPU resources. Too many proc
create too many locks. Wrong kind. Discuss of atomic operations,
such as moving data, reading the clock: has to be done in 1 step.
Locks solve overhead issues.

Clusters: High reliability/availability/scalability (RAS). High-
throughput cluster has many processes e.g. batch-sharing.
Checkpoint restarts (failure recovery in-process). High perf'ce
where one problem is addressed in parallel to reduce execution
time (classic Beowulf). Single system image where one simulates
a giant CPU with all I/O attached, runs threaded apps well.

Failover capability: Modern Beowulf with 4096 CPU's (actually
4096 cases/systemboards, 1 CPU/system); if any one system
has a 300k hour MTBF you could expect 1 failure every 3 days
in the system, and someone has to locate it. It is cheaper to
throw it out and replace, than to fix it and reinstall. This
type system needed 2 new electrical service feeds installed.

Process Migration: How do you balance the load across all
systems? With e.g. 200 CPU's the problem, when running,
may only decompose well enough to use say, 64 CPU's and
the rest are idle. Some problems do not decompose well.

Various programs to create a Beowulf Cluster: cfengine,
Oscar, Rocks (rocks.npaci.edu); (5 more that I did not time
to record -ed.) - Commercial Products: Platform Computing
(www.platform.com) a load-sharing facility; Scyld (the
Donald Becker project).

Workstation farms - a "horizontal Beowulf". Discuss usage of
hundreds of idle desktops, discuss SETI project. It is possible
to create library kiosks with 1 system, 4 mice/keyboards etc
with no disk, fed from a server, needs then to maintain just
1 server. Discuss what is the idle-CPU-time between
instructions e.g. keyboard input, how to best use it. Discuss
the Stone Supercomputer at ORNL (Oak Ridge) with 48 old
cast-off systems. What is the cost of this? Assumes only
a H/W cost, that your labor cost is zero, but even labor,
floorspace and AC/Heating has a cost - so there is NO FREE
CPU. Discuss of a problem running on old equipment that
took 10 weeks/iteration. Need for RAS with high performance.
Need to maintain clean power, checkpoint restarts if power
loss.

Superparallelism: Beowulf Concept: Inexpensive H/W, use of
replication of complex S/W, new algorithms, hard to program
due to S/W latency overhead. Latency with miranet (in usec)
is still higher than CPU/memory latency. Use of CPU overhead
to formulate a data packet means these CPU cycles are lost
to the problem-solving effort.

Discuss an old problem of sorting records on a PDP/11 with
700k disk accesses, 32M memory accesses. Put all memory away
from the I/O to reduce sort time from 10hr to 3min. Original
problem was coded so badly but, because people do not know
or understand what speed is available in their H/W, they do
not code to work around I/O-cache-memory bottlenecks and
waste time waiting for these. Discuss of David Tang at an
expo in Raleigh RTP, 2 arrays were multiplied, classic method
using loops. But if you took the second array and inverted
it, then multiplied, you had better cache utilization, then
invert the answer you got a 40x speed increase (but only 10x
on Intel due to how Intel arch. uses the cache).

Use of NUMA reduces bottlenecks. Lower overhead
reduces latency.

Grid: Do we have the skills to take advantage of grid? We
need to teach high school students so they can solve tough
hard problems when in college. Reason to revitalize Extreme
Linux. We need EL boot camps, workshops, miniclusters,
'open teaching' methods, contests to spur interest. Answer
is yes to all. We have to do these as a community.

Thanks to Melbourne LUG, LEAP (Orlando) and to SGI Company
for providing transporetation and assistance.

QA: Awareness among companies. Linux Int'l project to
create marketing skills to give local gov't/business/schools.
Why you should use open source. This has to be a grass-
roots effort. Influence legislators to 'sell' open source.
The open source effort can create a bigger marketing force
than Microsoft has available.

QA: System Admin: Idea of creating a Sysadmin "Merit
Badge" for Beowulf. Discuss the idea that there is a "Black
Box" level below which one cannot figure out what is in
the black box. Today, all high level teaching has a Black
Box component below it. Discussion of coding in assembler.
The duty of any University is to teach these concepts all
the way down to the basic elemental concepts and ideas.
Students in earlier days had single-board computers.
They could create block-level teaching concepts of
building adders etc. to undertsand basic concepts (which
are now 'hidden' in monolithic IC's). It is essential
for students to understand these basic concepts.

Today administrators throw money at problems by buying
more commercial software. Alternative: Solve problems
by running Linux when that money for commercial S/W is
not available.

Discussion of SCO and IBM. Trend towards a 'service
industry'. Computer industry becoming a service industry.
The day is coming when buying shrink-wrapped S/W off
the shelf will be replaced by services that exist to
tailor custom S/W for individual users. The need for these
programmers will increase. The days of intellectual
rights to stable single-model S/W are going away.

Discussion of (many) books on Extreme Linux or clustering,
available in the presentation (contact MLUG when available).
(I could not write them down fast enough -ed.)

Discussion then by Andy Fenselau, Dir. Pgm. Mgmt.
of SGI (fenselau@sgi.com) of the Altix 3000 family of
NUMA servers/superclusters. Approaches to scaling
of Linux, economy of tech strategies, real-world
deployment options. SGI offering H/W arch., O/S,
High-Performance Computing, mgmt. tools, Linux
dev't tools. Discussion of corporate SGI background,
market served, product emphasis (3D graphics,
visualization, digital media, interactive TV, OpenGL,
shared memory computing). Cost of hardware is now
approaching $1/hour. Progression through Vector,
RISC, COTS (Commodity Off The Shelf). However the
costs of ISV S/W, and people, is increasing, IT and
Eng personnel is approaching $45/hour, ISV Application
S/W is approaching $5/hour. Zookeepers of clusters all
cost money, even graduate students. ( !! -ed.) Use of
standards-based HPC does not compromise costs.

Technology drivers: There is always a "larger problem".
Datasets are growing exponentially, not linearly. There
are different problem elements that interrelate: CPU,
Memory access, bandwidth and latency. Parallel
processing. Productivity computing: The Realized
Gain vs. the Theoretical Peak Performance.

HPC technologies shorten the time-to-solution. Offer
balance, scalable performance. Low-latency memory
access. Operating environments optimized for HPC.
Enabled by system-, resource-, and data-management-
tools. Easily deployable with on-going investment
protection.

Discuss of an x-y-z visualization cube, x=memory,
y=CPU and z=I/O. Where applications fit into this cube.
Webservers are small, integrated. Genomics need many
compute cycles. Signal processing needs network
and compute cycles. Database/CRM/ERP storage, media
streaming. What do users need? Architectural scaling of
process/memory/I/O for individual and collective
workloads. Need O/S and S/W support for realtime HPC
systems and data management. What can users do?
Workflow and process improvements - time-to-solution.
Workflow/process breakthroughs - Time-to-market.
Capability breakthroughs - a new process can be
modelled at the atomic, not the molecular level.

How are user needs and problems evolving? Memory
size and addressability. is 32-bit enough? I/O perf'ce
and scalability. Higher processor counts for problem
apps. Provide flexibility for mixed workloads. Manage-
ability for operator point of view. Storage volume.
Operator scheduling. Operator flexibility for mixed apps.

Example of weather computing, highly scaleable. I/O is a
low resource requirement, high processor power is needed.
Spectral and coupled climate models, explicit and semi-
implicit finite differences, etc. Issues are: CPU, memory
bandwidth, I/O bandwidth, communications bandwidth,
latency and scalability. Different models use some of
these preferentially as against others. Analyze the
resource constraints, and which architecture best meets it.
For those with big, proprietary SMP/NUMA/Vector H/W,
they pay a penalty when using older systems.

Altix 3000 is offering the first Linux with 64 CPUs in a
single-O/S image. Global shared-memory access in multiple
nodes. HPC system- and data-management tools. Good floating-
point calcs, memory performance, I/O B/W and real tech apps.
All Irix tools are ported. AIX and HP-UX haven't ported theirs.
Discussion of supercomputer vs. cluster performance,
how this addresses "best of both worlds". Discussion of the
"brick" architecture of modularized components of
this system and how they are integrated into case.

A wonderful evening, just fascinating. -ed.
{end}



This archive was generated by hypermail 2.1.3 : Fri Aug 01 2014 - 16:46:25 EDT