|
|
This is the user guide for the OODT Catalog and Archive Service (CAS) Resource Manager
component, or Resource Manager for short. This guide explains the Resource Manager architecture
including its extension points. The guide also discusses available services provided
by the Resource Manager, how to utilize them, and the different APIs that exist. The guide
concludes with a description of Resource Manager use cases.
The Resource Manager component is responsible for excecution, monitoring and traacking of jobs,
storage and networking resources for an underlying set of hardware resources. The Resource Manager is an
extensible software component that provides an XML-RPC external interface, and a fully
tailorable Java-based API for resource management. The critical objects managed by the Resource
Manager include:
-
Job - an abstract representation of an execution unit, that stores information about an
underlying program, or execution that must be run on some hardware node ,including information about the
Job Input that the Job requires, information about the job load, and the queue that the job should
be submitted to.
-
Job Input - an abstrct representation of the input that a Job requires.
-
Job Spec - a complete specification of a Job, including its Job Input, and the Job definition
itself.
-
Job Instance - the physical code that performs the underlying job executio
n.
-
Resource Node - an available execution node that a Job is sent to by the Resource Manager.
Each Job Spec contains exactly 1 Job, and Job Input. Each Job Input is provided to a single Job.
Each Job describes a single Job Instance. And finally, each Job is sent to exactly one Resource Node.
These relationships are shown in the below figure.
There are several extension points for the Resource Manager. An extension point is an interface
within the Resource Manager that can have many implementations. This is particularly useful when
it comes to software component configuration because it allows different implementations of an
existing interface to be selected at deployment time. So, the Resource Manager component may
submit Jobs to a custom XML-RPC batch submission system, or it may use an available off-the-shelf
batch submission system, such as LSF or Load-Share. The selection of the actual component implementations
is handled entirely by the extension point mechanism. Using extension points, it is fairly simple to support
many different types of what are typically referred to as "plug-in architectures" Each of the core extension
points for the Resource Manager is described below:
|
Batch Manager
|
The Batch Manager extension point is responsible for sending Jobs to the actual nodes
that the Resource Manager determines that it is appropriate that they execute on. A Batch Manager
typically includes a client service, to communicate with remote "stubs", which run on the local
compute nodes, and actual handle the physical execution of the provided Jobs.
|
|
Job Queue
|
The Job Queue extension point is responsible for queueing up Jobs when the Resource Manager
determines that there are no Resource Nodes available to execute the Job on. Capabilities such as
persistence, and queueing policy (e.g., LRU, FIFO) are all dealt with by this extension point.
|
|
Job Repository
|
The Job Repository is responsible for actual persistance of a Job, throughout its lifecycle in
the Resource Manager. A Job Repository would handle the ability to retrieve Job and Job Spec information
whether the Job is queued, or executing, or finished.
|
|
Monitor
|
The Monitor extension point is responsible for monitoring the execution of a Job once it has been sent to a
Resource Node by the Batch Manager extension point.
|
|
Job Scheduler
|
The Job Scheduler extension point is responsible for determining the availability of underlying Resource Nodes
managed by the Resource Manager, and determining the policy for pulling Jobs off of the Job Queue to schedule
for execution, interacting with the Job Repository, the Batch Manager, the Monitor, and nearly all of the
underlying extension points in the Resource Manager.
|
|
System
|
The extension point that provides the external interface to the Resource Manager
services. This includes the Resource Manager server interface, as well as the associated
Resource Manager client interface, that communicates with the server.
|
The relationships between the ex
tension points for the Resource Manager are shown in the below
Figure.
The Resource Manager is responsible for providing the necessary key capabilities for managing
job execution and underlying hardware resources. Each high level capability provided by the Resource
Manager is detailed below:
-
Easy execution - of compute jobs to heterogeneous computing resources, with very different underlying
specifications: large and small disks, network file storage, storage area networks, and exotic processor
architectures.
-
Cluster Management Pluggability - the ability to plug into existing batch submission systems (e.g., Torque, LSF), and
resource monitoring (e.g., Ganglia).
-
Scalability – The Resource Manager uses the popular client-server paradigm, allowing new Resource Manager
servers to be instantiated, as needed, without affecting the Resource Manager clients, and vice-versa.
-
Communication over lightweight, standard protocols – The Resource Manager uses XML-RPC, as its main
external interface, between Resource Manager client and server. XML-RPC, the little brother of SOAP, is
fast, extensible, and uses the underlying HTTP protocol for data transfer.
-
Wrapping - the use of wrappers to insulate the internal code of Job Instances from their external interfaces
allows a variety of different popular programming languages (e.g., shell scripting, Java, Python, Perl, Ruby) to be
used to implement the actual job.
-
Scheduler Pluggability - the ability to define the underlying job scheduling policy.
-
XML-based job description - allows for existing XML-based editing tools to
visualize the different job
properties, and for standard job definitions, and interchange.
This capability set is not exhaustive, and is meant to give the user a "feel" for what
general features are provided by the Resource Manager. Most likely the user will find that the
Resource Manager provides many other capabilities besides those described here.
There is at least one implementation of all of the aforementioned extension points for
the Resource Manager. Each extension point implementation is detailed below:
-
Batch Manager
-
XML-RPC based Batch Manager – an implementation of the Batch Manager extension point that uses
a custom, light-weight XML-RPC Batch Submission system, and batch stubs deployed on each of the Resource
Nodes.
-
Job Queue
-
Stack based Job Queue - an implementation of the Job Queue extension point that uses a common
Stack data structure to queue up Jobs in memory.
-
Job Repository
-
Memory based Job Repository - an implementation of the Job Repository extension point that uses
an in memory persistance layer to record Job and Job Spec information.
-
Monitor
-
Assignment Job Monitor - an implementation of the Monitor extension point that uses internal profiling
to keep track of Job status, and Resource Node load.
-
Job
Scheduler
-
LRU based Scheduler - an implementation of the Scheduler extension point that uses a
Least Recently Used (LRU)
approach
to selecting Jobs for submission to the Batch Manager.
-
System (Resource Manager client and Resource Manager server)
-
XML-RPC based Resource Manager server – an implementation of the external server interface
for the Resource Manager that uses XML-RPC as the transportation medium.
-
XML-RPC based Resource Manager client – an implementation of the client interface for the
XML-RPC Resource Manager server that uses XML-RPC as the transportation medium.
To install the Resource Manager, you need to download a
release
of the Resource Manager, available from its home web site. For bleeding-edge features, you can
also check out the cas-resource trunk project from the OODT subversion repository. You can browse
the repository using ViewCVS, located at:
http://oodt.jpl.nasa.gov/vc/svn/
The actual web url for the repository is located at:
http://oodt.jpl.nasa.gov/repo/
To check out the Resource Manager, use your favorite Subversion client. Several clients are
listed a
http://oodt.jpl.nasa.gov/wiki/display/oodt/Subversion
.
The cas-resource pr
oject follows the traditional Subversion-style
trunk
,
tag
and
branches
format. Trunk corresponds to the latest and greatest development on the
cas-resource. Tags are official release versions of the project. Branches correspond to deviations
from the trunk large enough to warrant a separate development tree.
For the purposes of this the User Guide, we'll assume you already have downloaded a built release
of the Resource Manager, from its web site. If you were building cas-resource from the trunk, a tagged release
(or branch) the process would be quite similar. To build cas-resource, you would need the Apache Maven
software. Maven is an XML-based, project management system similar to Apache Ant, but with many extra
bells and whistles. Maven makes cross-platform project development a snap. You can download Maven from:
http://maven.apache.org
The cas-resmgr is constructed to be compatible with the 1.x.x series of Maven, despite
Maven 2.x having already been released. Once you have Maven installed, follow the procedures in
the below Sections to build a fresh copy of the Resource Manager:
-
cd to cas-resource, and then type:
This will perform several tasks, including compiling the source code, downloading
required jar files, running unit tests, and so on. When the command completes, cd
to the
target/distributions
directory within cas-resource. This will
contain the build of the Resource Manager component, of the following form:
cas-resource-${version}.tar.gz
This is a distribution tar ball, that you would copy
to a deployment directory, such as
/usr/local/
, and then unpack using
# tar xvzf
. The resultant directory
layout from the unpacked tarball is as follows:
bin/ etc/ logs/ docs/ lib/ policy/ LICENSE.txt CHANGES.txt
-
bin - contains the "resmgr" server script, and the "resmgr-client" client script.
-
etc - contains the logging.properties file for the Resource Manager, and the resource.properties
file used to configure the server options.
-
logs - the default directory for log files to be written to.
-
docs - contains Javadoc documentation, and user guides for using the particular CAS component.
-
lib - the required Java jar files to run the Resource Manager.
-
policy – the default XML-based element and product type policy in
case the user is using the XML Repository Manager and/or the XML Validation
Layer.
-
CHANGES.txt - contains the CHANGES present in this released version of the CAS component.
-
LICENSE.txt - the LICENSE for the Resource Manager project.
To deploy the Resource Manager, you'll need to create an installation directory. Typically this
would be somewhere in /usr/local (on *nix style systems), or C:\Program Resources\ (on windows
style systems). We'll assume that you're installing on a *nix style system though the Windows
instructions are quite similar.
Follow the process below to deploy the Resource Manager:
-
Copy the binary distribution to the deployment directory
# cp -R cas-resource/trunk/target/distributions/cas-resource-${version}.tar.gz /usr/local/
-
Untar the distribution
# cd /usr/local ; tar xvzf cas-resource-${version}.tar.gz
-
Set up a symlink
# ln -s /usr/local/cas-resource-${version} /usr/local/resmgr
-
edit /usr/local/resmgr/bin/resmgr
-
Set the
SERVER_PORT
variable to the desired port you'd like to run the
Resource Manager server on.
-
Set the
JAVA_HOME
variable to point to the location of your installed
JRE runtime.
-
Set the
RUN_HOME
variable to point to the location you'd like the Resource
Manager PID file written to. Typically this should default to
/var/run
, but not all
system administrators allow users to write to
/var/run
.
-
edit
/usr/local/resmgr/bin/resmgr-client
-
Set the
JAVA_HOME
variable to point to the location of your installed JRE runtime.
-
(optional) edit
/usr/local/resmgr/etc/logging.properties
-
Set the logging levels for each subsystem to the desired level. The system
defaults are fairly considerate and prevent much of the logging at levels below
INFO
to the console.
-
edit
/usr/local/resmgr/etc/resource.properties
-
This java properties file contains all of the default information properties to
configure the Resource Manager. By default, the Resource Manager is built to use the XML-based
Assignment Monitor, the MemoryJobRepository, the LRUScheduler, and the Jo
bStackJobQueue
extension points.. These defaults can be changed quite easily by changing the factory classes
that are pointed to for each extension point. For example, to use your own own home Scheduler
extension point, you would change the following property,
resmgr.scheduler.factory
to
gov.nasa.jpl.oodt.cas.resmgr.scheduler.YourNewSchedulerFactory
-
You need to configure the properties for each of the extension points that you are
using. By default, you would at least need to configure:
-
The paths to the directories where the XML policy files are stored for the
XML Assignment Monitor. A good default location is to
place these files within /usr/local/resmgr/policy.
Other configuration options are possible: check the
API documentation
,
as well as the comments within the resource.properties file to find out the rest of the configurable
properties for the extension points you choose. A full listing of all the extension point factory
class names are provided in the Appendix. After step 7, you are officially done configuring the Resource
Manager for deployment.
To run the resmgr, cd to
/usr/local/resmgr/bin
and type:
This will startup the Resource Manager XML-RPC server interface. Your Resource Manager
is now ready to run! You can test out the Resource Manager by submitting the following example
Job, defined in the XML file below (save the file to a location on your system, such as
/usr/local/resmgr/examples/exJob.xml
):
<?xml version="1.0" encoding="UTF-8" ?>
<cas:job xmlns:cas="http://oodt.jpl.nasa.gov/1.0/cas" id="abcd"
name="TestJob">
<instanceClass
name="gov.nasa.jpl.oodt.cas.resource.examples.HelloWorldJob" />
<inputClass
name="gov.nasa.jpl.oodt.cas.resource.structs.NameValueJobInput">
<properties>
<property name="user.name" value="Homer!" />
</properties>
</inputClass>
<queue>quick</queue>
<load>1</load>
</cas:job>
The above job definition tells the resource manager to execute the
gov.nasa.jpl.oodt.cas.resource.examples.HelloWorldJob
,
which is one of the example Jobs that is shipped with the Resource Manager. The job simply echoes the name provided in the
user.name
property back to the screen, saying
Hello ${user.name}!
.
To run the job, first you must start an XML-RPC batch stub, to execute the job on the local node. Let's assume a default port of port
2001
:
The command to run the job, assuming that you started the Resource Manager on the default port of
9002
:
java -Djava.ext.dirs=../lib gov.nasa.jpl.oodt.cas.resource.tools.JobSubmitter \
--rUrl http://localhost:9002 \
--file /usr/local/resmgr/examples/exJob.xml
You should see a response message at the end similar to:
Mar 5, 2008 10:45:26 AM gov.nasa.jpl.oodt.cas.resource.jobqueue.JobStack addJob
INFO: Added Job: [2008-03-05T10:45:26.148-08:00] to queue
Mar 5, 2008 10:45:26 AM gov.nasa.jpl.oodt.cas.resource.tools.JobSubmitter main
INFO: Job Submitted: id: [2008-03-05T10:45:26.148-08:00]
Mar 5, 2008 10:45:27 AM gov.nasa.jpl.oodt.cas.resource.scheduler.LRUScheduler run
INFO: Obtained Job: [2008-03-05T10:45:26.148-08:00] from Queue: Scheduling for execution
Mar 5, 2008 10:45:27 AM gov.nasa.jpl.oodt.cas.resource.scheduler.LRUScheduler schedule
INFO: Assigning job: [TestJob] to node: [node001]
Mar 5, 2008 10:45:27 AM gov.nasa.jpl.oodt.cas.resource.system.extern.XmlRpcBatchStub genericExecuteJob
INFO: stub attempting to execute class: [gov.nasa.jpl.oodt.cas.resource.examples.HelloWorldJob]
Hello world! How are you Homer!!
which means that everything installed okay!
The Resource Manager was built to support several of the above capabilities outlined above.
In particular there were several use cases that we wanted to support, some
of which are described below.
The black numbers in the above Figure correspond to a sequence of steps that occurs and a
series of interactions between the different Resource Manager extension points in order to
perform the job execution activity. The Job provided to the Resource Manager (labeled
Process Manager
in the above diagram) is sent by the Workflow Manager, another CAS component responsible for modeling task
control flow and data flow. In Step 7, the job is provided to the Resource Manager, which uses its
Scheduler extension point in Step 8, along with the Monitor extension point, to determine the appropriate
Resource Node to execute the provided Job on (in steps 9-11). The information returned in Step 11 to the
Scheduler is then used to determine Job execution ability. Once the Job is determined "ready to run", in
Step 12, the Scheduler extension point uses the Batch Manager extension point (not shown) to submit the
Job to the underlying compute cluster
nodes, monitoring the Job execution using the Monitor extension point
shown in Step 13.
Full list of Resource Manager extension point classes and their associated property names from the
resource.properties file:
|
resource.batchmgr.factory
|
gov.nasa.jpl.oodt.cas.resource.batchmgr.XmlRpcBatchMgrFactory
|
|
resource.monitor.factory
|
gov.nasa.jpl.oodt.cas.resource.monitor.XMLAssignmentMonitorFactory
|
|
resource.scheduler.factory
|
gov.nasa.jpl.oodt.cas.resource.scheduler.LRUSchedulerFactory
|
|
resource.jobqueue.factory
|
gov.nasa.jpl.oodt.cas.resource.jobqueue.JobStackJobQueueFactory
|
|
resource.jobrepo.factory
|
gov.nasa.jpl.oodt.cas.resource.jobrepo.MemoryJobRepositoryFactory
|
|