Datalab User's Guide

Datalab User's Guide

 

 

 

 

Getting Started

 

Before you can get to work in datalab, you must identify an array of computers on which datalab may work. These hosts, henceforth known as "Active Storage Units," or ASUs, are computers that will be used for both data storage and remote data processing. The only requirement for an ASU to be part of your datalab environment is that it must run a chirp server on port 9094. Notre Dame's datalab instance already has a large array of ASUs defined in its current configuration.

 

Sets

 

All work in datalab involves sets. A set is just what it sounds like -- a container for data. You can create a set in datalab, then fill it with far more data than your local machine can accommodate. Behind the scenes, datalab will use the array of ASUs at its disposal to locate storage for each object in the set.

 

Creating a Set

 

  create set set_name set_type

o   Example: create set mySet txt

o   set_name is limited to 24 characters and may contain letters, numbers, and underscores

o   set_type should be the extension of the file type with which you will populate the set (ie: "doc", "dat", or "jpg"). Set_type is limited to 24 characters.

 

This command creates an empty set. Of course, now you will want to start filling it with data. NOTE: Remember before doing this that the data processing advantages of datalab are only as great as the size of the ASU array you have defined. More hosts means better performance and more space!

 

Populating a Set

 

  add dir set_name local_dir

o   Example: add dir mySet ./myFiles

o   set_name must be the name of a currently existing set

o   local_dir must be a path to a local directory readable by datalab. Any files in this directory will be added to the set specified -- provided their extension matches the set's type. This action does not recurse into subdirectories.

 

Adding files to a set may take some time. As you add files, they are streamed off the local disk and onto remote storage locations within your ASU cluster.

 

Viewing Your Sets

 

  list sets

  ls

 

This command generates a list of every set in your datalab installation. For each set, you will see the following attributes displayed:

 

o   set name

o   set type

o   number of files in the set

o   parent set*

o   parent job ID*

o   created by (the username of the person who generated the set)

 

*If applicable. Sets created from scratch have no parent set, but those created by a function (see Functions, below) will report this information.

 

Optional Arguments:

  • -b
    • shows the number of inaccessible hosts, if any, on which this set has data stored
  • -t
    • for sets created by functions, shows the runtimes for the fastest and slowest hosts

 

Downloading a Set

 

  get set set_name local_dir

o   set_name must be an existing set

o   local_dir must be a local path to which datalab has write access.

 

Executing this command will cause the entire set to be downloaded to your local directory. Make sure you have enough space before trying this! Naturally, you may create sets that are larger than one machine can store, and you will still want to be able to download parts of the data set. We are looking at ways to make set data more accessible in the future.

 

Functions

 

Overview

 

Distributed storage is only half the appeal of working in datalab. In order to get the most out of what datalab has to offer, you're going to want to start writing functions.

 

Functions are written by you, the user, to certain specifications that datalab is equipped to handle. Functions are designed to work at the file level, but once they are loaded into datalab, they become tools for manipulating large data sets en masse.

 

Function Types

 

Datalab currently supports two kinds of functions, described below:

 

Transform functions -- these are functions that take a single file as input and produce one or more output files. A "transform" function may literally transform a file -- imagine zipping a document or desaturating a photograph -- or it may analyze the input to produce useful experimental data.

 

Select functions -- these are functions that take a single file as input and determine whether or not it meets a certain criteria (defined by you, the user). When applied to a set, a select function can be used to define a subset of a larger set. For example, you could create a set of satellite images, then write a function to select a subset containing only those images that feature your house.

 

Compare functions -- these are functions that take two files as input and produce some output describing their relationship. For instance, someone working on iris recognition might apply a compare function to a set of iris images to determine how well the function matches images of the same iris taken at different times.

 

Report Output

 

Most functions are written with the intent of creating a new datalab set with the output. However, sometimes the result of a function is not a new file, but something else, like a numeric value. To facilitate these kinds of functions, transform and compare functions can be written to output their results in a report style. This output type does not create a new set, but rather aggregates the output of each file-level function execution and collects them in a single document for you to review. For example, the result of a comparison function may be a simple numeric value, such as a match probability. Defining this function to generate report output, the user can easily review the result of each file comparison when the datalab job is done. For details, see Report Output in the Writing Functions section, below.

 

Writing Functions

 

The functions you write for datalab must behave in a very precise way with regard to inputs and outputs. However, we know that not everyone who uses datalab will be a programmer. Our method of function definitions lets you create a "wrapper" around an existing program to force it to comply with our standards.

 

The minimum function definition allowed by datalab consists of a folder containing a single batch file (a shell script with extension .bat). This batch file must share a name with whatever you wish to call the function within datalab. For instance, if you're going to name your function !zip (more on function names later), you will have to name this file "zip.bat".

 

This batch file must conform to the following signature:

 

For transform functions:

./batchName.bat input_file output_file output_file...[parameters]

 

...where the number of output files accepted is equal to the number your function will produce. These outputs must be the ONLY files created by your function, and they must be produced with the exact names specified on the command line. This means, for instance, that the function may not add extensions to output files. By conforming to this standard, you give datalab full control over the inputs and outputs of your function (to answer your unspoken question, datalab will handle the assignment of output extensions at runtime).

If you desire, the batch file can take a final command-line input: a string containing parameters for the function itself. For instance, let's say you defined a function that manipulates images. Perhaps the underlying function supports a score of command-line switches to do things like shrink, expand, and crop images. The batch file can take as its final parameter a string of these switches and pass them along. This way, a function's usefulness does not have to be narrowly defined -- someone using the function in datalab can send parameters to the function that modify its behavior in some way.

 

For select functions: ./batchName.bat input_file [parameters]

 

This type of function creates no output files. Rather, its output will be based on whether input_file matches your pre-determined criteria (with decision-making programmed into your function, of course). If the file is a match, you must print to standard output the absolute filepath of the input file. If the file is not a match, nothing should be printed. Again, if desired, the final input on the command line can be a string containing parameters for the function.

 

For compare functions: ./batchName.bat input_file1 input_file2 [parameters]

 

This type of function creates no output files. Rather, its output will be based on whether input_file matches your pre-determined criteria (with decision-making programmed into your function, of course). If the file is a match, you must print to standard output the absolute filepath of the input file. If the file is not a match, nothing should be printed. Again, if desired, the final input on the command line can be a string containing parameters for the function.

 

Function Wrappers

As described above, the batch file can be used as a function wrapper. That file may call any other program or link to libraries as long as the batch conforms to the prescribed I/O signature and all necessary files are contained within the function directory. In this way, someone with even modest shell scripting ability can write a wrapper around another function to make it datalab-compatible.

 

Defining a Function in Datalab

 

Now that you've got a complete function residing in a local folder, it's time to import that function into datalab. At the datalab prompt:

 

  define !functionName functionType localPath signature

 

o   functionName must match functionName.bat in your function folder (only here you will precede it with an exclamation point).

o   functionType will be either "transform" or "select"

o   localPath is the path to the folder where the function is stored

o   signature describes the inputs and outputs your function takes. More on this below.

 

Function Signature

 

The final value required by the "define" command is the function signature. The function signature defines the number and type of inputs and outputs your function will take. It should be given in the following syntax:

 

(input_type:output_type,output_type,...)

 

For example, a function to convert "tiff" images to "bmp" images would have a signature (tiff:bmp). A function that takes a data file (.dat) and creates a gnuplot file (.gnu) and an output image (.jpg) might have this signature: (dat:gnu,jpg).

 

A complete function definition example:

 

  define !tiff2bmp transform ./myFunction (tiff:bmp)

 

The local folder ./myFunction contains a file called "tiff2bmp.bat", which calls another program, imageConverter (also in the folder), to convert a tiff file to a bmp file.

 

When the "define" command is executed, a copy of this function is distributed to every ASU in the cluster, waiting for such time as the user wishes to apply it to a set.

 

Applying Functions

 

Finally, it's time to do some work. You have created and populated sets, and now you have functions defined. To apply a function to a set, use the following syntax:

 

  apply !functionName input_set output_set[;output_set;output_set...]

o   Example: apply !tiff2bmp myTiffSet myBMPSet

o   functionName must be an existing function in your datalab environment

o   input_set must be an already-existing set whose type matches the required input of the function and which is populated with data

o   each output_set must be a set that does not yet exist. Datalab will create these sets as a result of the function application.

o   If more than one output is required by the function, separate these output sets with semicolons -- no spaces in between.

 

Executing this command sends a flurry of text to your screen as datalab distributes your job to each ASU, which will execute your job in parallel. You will also be given a job ID as a reference for this particular function application.

 

Listing Functions

 

  list functions

  lf

 

This command shows a list of all functions defined in the system, along with their input/output signatures.

 

NOTE: Any output sets created by a job will be immediately visible in the set listing ("list sets" or "ls"). However, they will have an asterisk noting that their creation is not yet finalized.

 

Jobs

 

Now that you have started a job applying a function to a set, datalab will execute the function in parallel across each ASU hosting data for the input set. You will see your results much faster than if you were processing them in sequence on a single machine, but the job may still take time. Fortunately, you can quit datalab and return at any time to monitor running jobs.

 

Monitoring Jobs

 

  list jobs

 

This command displays a list of running jobs, along with the following information:

 

o   job ID

o   function name

o   input set(s)

o   output set(s)

o   elapsed time since starting job

o   number of hosts returned (how many have completed their work)

o   total number of hosts involved in the job

 

Analyzing Job Performance

 

After a job has completed, you may want to see details on how datalab performed, such as how quickly each ASU handled its work. Future additions to datalab will let you use this information to optimize your data distribution for better performance.

 

  plothosts jobID

 

This command will analyze the runtime for each ASU for the given job, plot the results, and provide you a link to view the generated graph. Note that after jobs disappear from the current job list, job IDs are still available as "parent job" data on the set listing screen.

 

  job jobID

 

This simple command gives much the same information as "list jobs", but it works on a completed job, and provides a few more details about the slowest host, the fastest host, and the average time of all the ASUs involved in the job.

 

In addition, you will be given the option to view a breakdown of job completion times for each host, along with