Project Ideas

The following are rough project ideas for your consideration. Part of your job will be to crystallize the purpose, methods, and scope of your specific project into a project proposal. Note that not all projects involving writing a lot of code, but all involve a thorough quantitative evaluation of a system. Students may undertake projects not listed here, but should consult with the instructor before submitting a proposal.
  • Gigantic Distributed Files. The CCL storage pool gives easy access to lots of different storage devices. However, some users have very large files that cannot fit on any single disk. Build a library for accessing very large (multi-TB) files that are somehow partitioned or striped across multiple disks. The library should have the simple interface of bigfile_open, bigfile_read, bigfile_write, bigfile_lseek, bigfile_close, and bigfile_delete Explore the performance of this library on both sequential and random-access workloads for both small and large files. Is it possible to get better performance than using a single remote disk?

  • Overflow File System. Users with lots of data know that processing data on a local filesystem (such as /tmp) is much faster than attempting to harness a distributed file system. However, when /tmp fills up, your entire system stops working! To fix this problem, design and build an "overflow" filesystem for the power user. Using the FUSE Toolkit, build a filesystem that stores most files in /tmp until the system is nearly full, and then additional data on a larger (but slower) disk in the CCL storage pool. Measure the performance of this filesystem on a selection of workloads that need lots of storage.

  • Fast Searchable Filesystem. Exhaustive search of a large filesystem for a particular file name or an email containing some data be frustratingly slow: without some kind of index, the entire directory structure and every data block must be searched exhaustively. Design and construct a filesystem using FUSE Toolkit that automatically indexes files containing text as they are added to the system. Compare the performance of your filesystem to a conventional system that uses find and grep for search.

  • Managing a Cluster of Virtual Machines The CSE department has built a new cluster (sc-xx) for running virtual machines. This allows researchers to construct their own computing environments without physically purchasing and installing new machines. Build a software system for managing this virtual machine cluster. Make it easy for a user to simply request an image by name (Red Hat 7.3 + MATLAB installed), whereupon the system will identify an available machine, install the disk image, bring up the virtual machine, and inform the user of the hostname and location. The problem is, each virtual machine type require enormous amounts of data to be moved from place to place. Measure the performance of creating, running, and destroying a virtual machine, and find a way to minimize the overhead by cleverly selecting machines or compressing images.

  • Improving Software Installation with Disk Images. Traditional software installation is very inefficient. Users must download (or copy) a ZIP, TAR, or RPM file to a local disk, then unpack the software by writing lots of fiddly little files to all sorts of directories all across the disk. If you have ever installed Office, you know how long this can take! A potentially more efficient way is to distribute a software package a single disk image that can be written once sequentially and then "mounted" into the filesystem view. Come up with a system for managing and installing software this way. Measure the performance of installing disk images versus unpacking archives. How do you deal with the problem of managing the user's PATH and similar configuration variables? Can this system scale to 100s or 1000s of software packages?

  • A File System for Lots of Little Files. Many scientific users employ filesystems in an unusual way: they create directories containing thousands or millions of little files, each containing perhaps 10-20 bytes of data. File systems are unusually bad at storing such data -- each file occupies a minimum of one 4KB disk block. However, users continue to work this way, because the data is easy to manipulate with standard commands. Begin by demonstrating a filesystem workload that could be dramatically improved. For example, compare the performance of creating one million 10 byte files versus one million ten byte records in a single file. Then, design, implement, and evaluate a filesystem tailored to datasets of many small files. Use the FUSE Toolkit to easily build and deploy the system in user space. Compare the performance of this filesystem to a conventional filesystem on a wide variety of workloads.

  • Large Scale File Distribution. A common problem in distributed computing is the need to get a large file out to all nodes of a system. In a previous semester, students created a tool chirp_distribute that distributes files on the CCL storage pool. using a spanning tree, and demonstrated this was much faster than either parallel or sequential distribution. However, a remaining problem is what order transfers should be made in. This affects the performance significantly, especially in a network where transfers are fast within clusters, but potentially slow between them. Design and evaluate an algorithm for sending files to all hosts in a system using a spanning tree. Concentrate on choosing an optimal (or at least better!) order of transfers.

  • User Filesystem Study. Many assumptions about user behavior in operating systems are based on studies that are decades old. (example one, example two.) Produce a new study of how users behave in the ND CSE network. Examine tools such as strace, tcpdump, and fstrace for the purpose of recording logs of filesystem activity. Demonstrate that you can record and analyze a few hours of activity. Then, get permission from Curt and a few of your friends to trace activity on a few workstations for several weeks. Write a comprehensive report on the file access behavior of those people over the semester.

  • Filesystem Performance Comparison. The user of a modern Linux workstation has access to a surprising number of filesystems. Which filesystems are best for different kinds of tasks? Many common filesystem benchmarks simply aren't realistic: they measure things that people don't do, like writing a gigabyte one byte at a time. Create several benchmarks of your own design that measure a variety of disk-intensive tasks that people really do: installing software, booting a complex system, processing image data, and so forth. Compare the performance of these benchmarks on five modern filesystems in Linux. Explain the results and suggest future areas for filesystem development.

  • OS Intrusion Traceback. When an intrusion of some kind has been discovered, administrators want to know how did this happen? Who logged in and ran what program that modified this vital file? Create a system that makes it easy to answer this question. Here is one idea: Create a kernel module that logs a message for each important event like a process creation, change in user identity, or file access. Then, build a tool that, after an intrusion, can reconstruct how a particular file was created or modified. (Example: sshd ran bash as user fred which ran vi and used it to modify the file bankaccount.) Measure the performance overhead of both the logging component as well as the reconstruction tool.