User
marcux mengedit halaman ini 1 tahun lalu

Home -> User

User Documentation

L3Q is an abbreviation for: Light, Light, Lightweight Queue.
L3Q is a system to run processes/programs in parallell or in sequence on multiple nodes.
L3Q is an implementation of a light, light, lightway queue system.

General

L3Q contain three different programs that all have different purposes.

l3q is the client program that the user uses to interact with the system,
add jobs, view queue, view history, system status aso.

l3qd is the central daemon that takes care of the queue,
add jobs to calculating nodes, store status on jobs and nodes aso.
Receive user input from client program l3q and send jobs to node daemon node-l3qd.

node-l3qd is the node daemon that runs on each compute node.
The node daemon receive jobs from cental daemon (l3qd) and execute tasks
on local node. Node daemon send status and information about jobs
to central l3q daemon.

L3Q Client (l3q)

The L3Q client is the main tool that interacts with the L3Q daemon.
With the client you can view the queue, jobs, tasks and see the status
of the compute nodes. Using the client you can also add jobs to the
L3Q daemon that the daemon will take care of and execute when
resorces are available. To be able to use the client the user has
to be a part of the l3q group.

Queue

The default queueing system in the L3Q daemon is similar to a
FIFO (first in, first out) stack. The first job that is added will
be launched first, if there are more queued jobs they will all be
processed in turn from the top. If there are enough resources available
for the next queued job to be launch it will be started, until there
are no queued jobs left or not enough resources are available.
All jobs are processed from top to bottom and when the first queued
job will not fit the processing stops, even if there are smaller jobs
further down in the queue that would be able to be launched this will
not happen. That job has to wait until all previous queued jobs
has beed launched first.
Jobs that are in Depend state are not inluded in the queue until dependent
jobs are finished.
To view the current queue:

# Use flag -l or --long for long listing
l3q show queue

# There is a shortcut, use
l3q
l3q --long

Jobs

Jobs are the unit that is queued on the system and contains one
or more tasks. When a new job is added to the system, the initial
status is Queued.
A job can have the following statuses:

  • Queued, is in queue waiting to be started
  • Running, tasks in this job is executing
  • Depend, this job is depending on another job
    and is waiting for the depending jobs to terminate
  • Terminated, all tasks for this job has finished execution
  • Cancel, user has asked system to cancel job. System
    has not jet canceled the job.
  • Canceled, system has canceled job and terminated all tasks.
  • Error, job has terminated but one or more taskes has failed.
  • Node-error, an error has occcured on a node that intefere with the job.

Jobs with status Queued, Running, Depend and Cancel will
show up in queue.
Displayed with command:

# With flag --long or -l
# for long listing
l3q show queue
# or
l3q

Jobs with status Terminated, Canceled, Error, Node-error
will show up in history. Displayed with command:

# With flag --long or -l
# for long listing
l3q show history

Jobs can be one of two types:

  • Parallel, tasks are executed in parallel on specified
    number of compute nodes.
  • Sequence, all tasks are executed in sequence, one after
    each other in specified order from top to bottom.

Cancel job

To cancel a job run the following command:

# JOBID is the id of the job
# to be found in queue list
l3q cancel job JOBID

A user is only allowed to cancel job that user has added self,
only users own jobs. If other jobs needs to be canceled
sysadmin will be able to do this.
If there are jobs depending on the job tried to be canceled
it will fail to cancel job. To be able to cancel job all depending
jobs need to be canceled first.

Task

Task is the command that is executed on the compute node.
A job contains one or more tasks that are executed in sequence
or parallel on the compute nodes. Tasks are specified in
taskfiles that are supplied with the command:

l3q add -t TASKFILE

Taskfile

The taskfile is required when you add a job to the queue.
The taskfile may contain a special syntax line in file as
well as lines containing tasks, which are mandatory.
One task per line, empty lines and lines starting with #
are ignored.
Lines containing tasks must have one or two columns, first
column specifies a path to an executable program and the second
optional column contains working directory. Columns are
separated by whitespace. If second column, workdir, is not
specified, users home directory will be used as default workdir.

Special syntax in TASKFILE:

#--add-para ...
#--add-seq ...

The special syntax starts with token that describes if it is a sequential
job or a parallel job. On the same line the commandline arguments can
be added so they no not need to be specified on the commandline.

Special syntax is required in file if job is added with command:

l3q add -t TASKFILE

If job is added with command:

l3q add para ...

Arguments are parsed the following way:
If TASKFILE only contains one task on each line then both options --cores and --nodes are required.
If TASKFILE contains a line with the special syntax specifying required options then no options are
required on the command line.
If TASKFILE contains the special syntax line and --core and --nodes are given as arguments these
options will be used and the special syntax line will be ignored.

If job is added with command:

l3q add seq ...

Arguments are parsed the following way:
If TASKFILE contains a line with the special syntax specifying the name of the task this name will be used.
If TASKFILE contains the special syntax line and --jobname is given as arguments this options will be used
and the special syntax line will be ignored.
If TASKFILE contains multiple special syntax lines the last line will be used.

Depend

When a job is added it is possible to specify that new job
depend on other jobs with the --depend flag.
Specify dependencies when job is added:

# specify dependencies as a comma 
# separated list of jobids
l3q add -t TASKFILE --depend 5,18,23

Dependencies must be specified as a comma separated list,
without any whitespaces, of jobids of other jobs. This list
is given as value to argument --depend.
When a job have dependencies the depending jobs has to
terminate before this job is added to the queue.
If one the deoending jobs terminates in an error this
job will also change status to error and will not be
queued or executed.
The add command has a flag that returns the jobid that
the added job gets, -i or --jobid. The idea is the
posibility to write scripts that adds dependencies as
the jobs are added.

#!/usr/bin/bash

# The last added job will 
# depend on the first two jobs

a=$(l3q add --jobid .....)
b=$(l3q add --jobid .....)
l3q add --depend "$a,$b" .....

Node status

To see the status of the calculating nodes, the nodes that run all jobs,
run the following command:

l3q show status

This will display a list that shows what resources are available and what
resources are being used. The state is also shown for the calculating nodes.
The node can have the following state:

  • Online
  • Offline
  • Maintenance draining, sysadmin has asked system
    to take node offline for maintenance, node is waiting
    for all current running processes to terminate before
    node is put in Maintenance state.
  • Maintenance, node has no running jobs and can be maintaned
    without disturbing the system.
  • Soft Online, sysadmin had asked system to set node in
    online mode again. Node will be in soft online until
    system has confirmed that the node is online, then status
    will change to online.

L3Q daemon (l3qd)

L3qd is the central daemon that interacts with all clients
and all node daemons (node-l3qd). All commands entered at
the l3q client is sent to the central daemon, processed and
result returned to client.
The l3q daemon is periodically updating the status of the
calculating nodes, setting nodes offline if unreachable and
update the queue, if enough resources are available for jobs
to start, job will be sent to node-l3qd for execution.
L3q daemon receives periodic update from node-l3qd on running
jobs on each node, each node responsible for sending updates
to central daemon.

L3Q Node Daemon (node-l3qd)

L3q node daemon is the daemon running on each calculating node.
It receives tasks from central daemon (l3qd) that is executed
on the host in a systemd slice. Periodically status on tasks
are sent to central daemon (l3qd).