Figure 1: Target wireless sensor network deployment with traditional
sensors, our multimedia sensors (Cj), storage bricks
(Si) and compute hubs (Hk)
The ultimate goal of this project is to develop technologies that can connect
a number of uncoordinated views of a scene in order to deduce the global threat.
Uncertain scene recognition from any single view is checked and validated with
potential captures from another angle in order to improve the recognition accuracy.
Our infrastructure widely deploys a number of high fidelity video sensors and
stores all the captured streams for certain well defined durations. These sensors
are deployed as necessary in an ad hoc fashion. Smarter sensors and compute
hubs can then analyze these stored streams to detect and review emerging threats,
or potentially capturing events that are easily missed by simpler and real
time algorithms. Like the mythical creature Hydra, our system is designed
to be robust against the loss of video sensing and processing components. We
describe a few representative application scenarios enabled by our proposed
infrastructure to further motivate this research:
- retrospective surveillance: The goal of these systems is to review
the captured scenes from other sites in order to validate whether a hint
of threat detected at the local site is part of a larger pattern. Imagine
our proposed infrastructure deployed to monitor all important landmarks in
the United States (Figure 1). Analyzing the images
from multiple cameras peering into the crowds can allow detection algorithms
to potentially make more reliable identification of terrorists than
single cameras. More importantly, we can develop recognition algorithms that,
when triggered by the suspicious activity of one tourist, analyze the stored
streams from other landmarks to see if this same tourist exhibited suspicious
behavior in those other sites; activity which may be missed by each site
locally. Analyzing the streams in concert can also help identify more complex
threat behavioral patterns. One can imagine recognition algorithms that identify
a threat event that involves multiple actors; identified not only because
each of these actors exhibit similar suspicious behavior but also by the
fact that they all scoped out the landmark sites without overlapping
with each other. While one person was identified as video taping the Empire
State building and the Statue of Liberty in NYC and Sears tower in Chicago
within a week of each other, another individual was also noticed video taping
the Brooklyn bridge in New York and Navy pier in Chicago in the same week.
Note that tourists video taping landmark sites itself is not the threat;
rather the specific pattern and choice of sites might give clues to suspicious
Similar recognition algorithms can be developed to identify suspicious
loitering by retrospectively analyzing the behavior over a period of time.
The loitering might involve a single actor or multiple actors, each probing
any vulnerabilities over a period of time. For example, seven miscreants
can coordinate the loitering, each one randomly scoping out the site once
- event capture for telepostsence: We define telepostsence
as application scenarios which provide retrospective multimedia experience
of traditional telepresence systems. These systems collate the string of
video events that were part of the particular scene of interest. The key
is to reject events that were not pertinent to the scene of interest. Imagine
our multimedia sensors monitoring a highly structured activity such as a
nuclear power plant control room, combat information center or factory assembly
lines for training and post hoc analysis. Our ability to quickly deploy a
self managing infrastructure is especially attractive in these scenarios
where the use of a videographer per operator can be intrusive. Retrospective
algorithms monitor each operator for learning how each individual responds
to stressful activities. For example, Dell uses video equipment to manually
examine a work team's every movement, looking for any extraneous bends or
wasted twists in order to reduce assembly time. Combat information center
can monitor all of the people during a combat operation phase and the streams
are later analyzed to learn lessons about what to do differently. The system
captures the interactions of the specific operator; as the operator pauses
and looks at the work space, retrospective algorithms detect these pauses,
identify where the operator looked during the particular pause and search
the rest of the cameras deployed in the room to locate the camera that captured
the event that distracted the operator. The end result is a single video
stream that automatically captures the operator and any interactions with
the workspace without being distracted by actions of other operators, also
sharing the same workspace to do their job.
- adaptive battlefield sensing: The goal of these systems is to constantly
analyze the battlefield for new threats and use the knowledge gained to adaptively
scan the horizon for future events with near real-time response. Wireless
sensor network deployments in battlefield scenarios use a wide variety of
low fidelity sensors that capture environmental parameters such as temperature,
humidity, light and motion etc. to detect and respond to potential threats.
However, the primary motivation of the enemy is to evade the sensing and
aggregation algorithms built into the sensor deployment. The enemy might
move too slowly to trigger motion sensors, they may use camouflaging to protect
their tanks, they may use materials that do not trigger magneto-meters. Our
work focuses on deploying multimedia sensors as a last line of defense. These
cameras can watch minute details and catch threat behavior which were unknown
during the wireless sensor network (WSN) deployment. Widely deploying a number
of high fidelity video sensors, can allow the field commanders and smarter
sensors to analyze the stored streams to detect emerging threats; easily
missed by simpler and real time algorithms and un-anticipated by the sensor
system developers. Segments from these multimedia sensors can be reviewed
to detect the visual camouflaging technique used. This information is then
used to direct the sensors to peer into the horizon to detect similar tanks
and other emerging threats.
These target applications point to a need to liberally capture the scene.
The captured scenes can either be analyzed in real-time or stored for retrospective
analysis. Real-time processing is inherently not scalable, the system needs
to have been actively processing all the views of interest. On the other hand,
storing the stream allows for the luxury of not having to evaluate and recognize
the threat quickly. The recognition algorithm can take turns to analyze as
many views as necessary. This is especially true for threats such as loitering
that happen over a period of time. Also, video streams consume tremendous amounts
of data; high fidelity video streams easily consume 4-5 GB/hour. It is not
feasible to transport all this data out of the deployment for real time processing
(for lack of wireless link capacity, battery resources etc.). There is a need
for in-situ storage so that only interesting scenes need be transported out
of the deployment. Given the scale of these deployments, we advocate a fully
distributed storage approach to allow resiliency to failure of any components.
Note that we expect all these interesting recognition tasks outlined to be
extremely resource intensive, both in terms of their computational and network
requirements. Our goal then is to first build the infrastructure that will
allow us to explore the nature and complexity of these recognition tasks.
Prof. Surendar Chandra leads the
research efforts to build the scalabale storage infrastructure and Prof. Pat
Flynn will build the recognition applications on this infrastructure.
Our Multimedia Sensor Storage architecture
Figure 2: component functionality
The target system consists of a number of sensors (Ci),
storage bricks (Sj) and compute hubs (Hk),
potentially connected using ad hoc wireless networking technologies for quick
deployment of the system components. We advocate a distributed approach; storage
bricks are freely deployed alongside multimedia sensors to spatially localize
the streams and allow for incrementally scalable storage. The sensors and storage
bricks self-organize themselves such that the sensors and bricks can identify
potential bricks to replicate and migrate content. The compute hubs use these
same location mechanisms to locate various streams and build interesting recognition
The minimum functionality required of the sensors, storage bricks and compute
hubs are illustrated in Figure 2.
- multimedia sensors have enough computational and local storage capabilities
to capture, encode and automatically segment the high fidelity stream. Variable
size segments are the logical unit of storage. For our research platform,
we use a VGA resolution (640x480) stream captured at 30 fps. High fidelity
streams are large; we introduce lifetime as a novel abstraction for
systematically managing the storage scalability requirements. The sensors
do not preprocess and reduce the fidelity of the captured streams; our goal
is to use complex compute hubs for further processing. We make no
assumption on how these sensors are deployed and maintained.
- storage bricks expose their availability and capabilities such that
sensors can choose the storage bricks to replicate the segments. Segments
stored in the storage bricks may be migrated to other storage bricks to manage
the local resources. Segments are also rejuvenated to increase their lifetimes;
each brick can rejuvenate segments without coordination with other replicas.
- compute hubs perform retrospective analysis of the stored streams
to make interesting deductions; not always possible from analyzing a single
stream in real-time. The compute hubs might be deployed in-situ or remote
compute hubs may selectively process stored streams.
Availability of inexpensive and high capacity storage, wireless networks and
multimedia sensors makes such deployment scenarios increasingly feasible.
Key Research Components and Research Plan
Deploying the system components in our target environments brings its own
set of unique challenges; components can experience transient failure (e.g.
lack of energy, network connectivity), fail permanently, moved to a new location
as well as be deployed in non-ideal locations. Given the scale of the expected
deployment, managing the infrastructure to provide resilient capture and storage
for the compute hubs is of the utmost importance. We are developing a fully
distributed mechanism; the capture sensors locate the desired bricks that can
provide the requisite resiliency and replicate the captured streams. The bricks
independently manage the stream replication, rejuvenation and migration in
order to manage their local resources. The system must not depend on any single
component for its correct functioning.
Research Challenges Addressed
The Hydra system consists of two important components; a self managing media
capture and storage system that will form the basis for innovative retrospective
Self managing multimedia capture and storage system
We advocate three important mechanisms: a) an abstraction to manage the voluminous
data from media capture; b) a mechanism for the various system components to
locate each other, allowing for efficient media transfers while reducing the
maintenance overhead and c) mechanisms that allow each component to balance its
local resource requirements with global requirements. The size of multimedia
segments drive these policy choices.
- managing storage scalability using a lifetime abstraction: Profuse
capture from a large number of sensors requires equally scalable storage.
Traditionally, storage scalability is achieved by increasing storage capacity,
transcoding to reduce size, or manual reclamation. Our project achieves storage
scalability by systematically restricting the storage lifetimes. The lifetimes
act as service level agreements between the storage bricks and the multimedia
sensors. To account for the inability of the users to precisely specify these
intervals, lifetimes are expressed as a mixture of persistent followed by
probabilistically decaying time intervals. Given the dynamic and transient
nature of the system, it is important to choose the right entity to manage
lifetimes. The key insight is that, for transient components, the unit that
stores the segment is also the best unit to manage its lifetime. We associate
object lifetimes as a first class attribute of each multimedia segment and
unify the object storage with lifetime management into a self-contained abstraction.
The sensors will attach the lifetimes to segments and dynamically locate
the distributed storage bricks to provide the overall segment lifetimes and
achieve storage resiliency. The segments are self managing and do not depend
on the sensor that created the segment to further manage its location and
lifetime. Our recently funded NSF proposal titled ``CAREER: Scalable self
managing multimedia storage'' will primarily focus on developing the lifetime
notions for a scalable storage abstraction. We will leverage the research
results from this effort for project Hydra.
- managing the component location using a distributed overlay management: The
next challenge is the location of the storage relative to the multimedia
sensors. The system needs to identify and use new sensor and storage components
as well as gracefully recover from components leaving the system, either
temporarily or permanently. Media segments are large. There is a need
for storage that is tuned to the requirements of remote deployments.
The key challenge is to choose replica sites that minimize the cost for the
initial replication as well as for retrievals from all the potential compute
Our preliminary analysis illustrated the benefit of expander graphs for
quickly identifying nodes, reducing the maintenance overhead on each node
as well as for efficiently searching contents. The problem of finding optimal
replication locations is similar to the traditional P-Median problem;
our research also requires us to identify the optimal P. Many of
these problems are at least NP-hard. The challenge is to design mechanisms
that have the desired properties of expander graphs and solve the P-Median problem
in a distributed fashion.
- managing and balancing the node's local as well as peer resource consumption: Multimedia
segments are large; our design calls for the segments to be transferred between
sensors, storage bricks and compute hubs. Our primary focus is on managing
the resources consumed by these transfers. We will use application level
multicast mechanisms such as End System Multicast for efficient transport
between sensors and the chosen set of replica storage bricks. Frequently,
the segments need to be transferred through other components to increase
the wireless network range. The mobile ad hoc networking (MANET) community
provides us with a wealth of technologies that enable the source and the
destination sensors to route the streams through the intermediate sensors.
Previous work on multimedia sensing and streaming in the context of MANETs
have not addressed the resource requirements and management policies of the
intermediate routers. Our preliminary results show that such laissez-faire
approaches can incapacitate intermediate nodes.
There is a need to quantify the resource requirements and effectively
manage the resources consumed by the forwarding traffic. The intermediate
nodes need to manage their own resource consumption as well as provide
predictable end-to-end delays to the transit traffic. Though not our primary
focus, the resource management techniques developed can also be used to
defend against malicious network denial of service attacks (by limiting
the consumed resources) on our sensor environment.
We will develop novel fine grain Operating System resource management mechanisms
for quantifying and managing the forwarding network traffic as well as streaming
mechanisms that help these intermediate sensors manage local resources while
achieving the end to end stream QoS requirements.
- maintaining segment integrity: Given the nature of our target deployments,
it is likely that they will be exposed to attacks by the adversary. For the
stored data to have utility for forensic analysis, it must be possible to
establish its authenticity and chain of custody. Hence, we will need to explore
mechanisms to guarantee the integrity of segments stored in the system. We
will seek to do this at two levels: the first is by using appropriate cryptographic
primitives, such as keyed hashes, to allow data corruption to be detected.
The second level will utilize replication to guarantee availability in the
face of situation where data segments may be deleted, either due to policy
choices or intrusive behavior. To enable the chain of custody to be established,
an appropriate indexing scheme will be built to allow retrospective analysis
to establish where data originated from and what path it took en route to
being included in the final set.
Retrospective analysis algorithms explored:
Building on the capture and storage infrastructure, we will investigate the following
- Multicamera moving object tracking: Determination of intent of actors
in video streams requires a robust tracking ability including multi-sensor
integration to manage handoff as well as re-acquisition. We must assume that
sensors have overlapping fields of regard, but it may not be necessary or
desirable to treat the sensor network as a series of stereo rigs. Robust
tracking consists of initial acquisition to initialize trackers, tracker
updates in easy situations (a single moving actor) as well as ambiguous or
hard situations (crossing actors, temporary occlusion, and permanent departure
(tracker retirement). These tasks will be addressed in the context of a single
camera view. Then, the problems of sensor handoff (one sensor inherits a
tracker from another) can be addressed.
- Semantic tagging of actor trajectories: This research is part of
an interesting area known as ``determination of intent''. Like all research
in this area, our ideas are somewhat speculative. The hypothesis is that
there are common motion patterns associated with an actor that can be tagged.
Examples include loitering, casing, and stalking for situations with a low
density of actors, and altercations or violent crowd motion (e.g., fleeing
an explosion) for higher-density situations. A useful line of inquiry would
seem to be to examine the literature on crowd dynamics, and discuss motion
patterns with experts in the surveillance industry, with a goal of developing
motion templates for detection and tagging of motion patterns.
- Integrating multimedia data sets to extract patterns of interest: One
of the important goals for the recognition task is to extract patterns of
interest from the raw video data. Some of these patterns will not be discernible
in individual streams but will be present in the emergent integrated data
set. The process of melding the information can occur at a number of levels
of abstraction. At the most basic level, data from different sensors that
relates to the same scene can be gathered and synthesized into a single representation
of the scene. In the process we can increase the fidelity of the frames beyond
what individual sensors were capable of recording. When particular regions
are deemed to be of interest by preliminary automated analysis, this capability
will be invaluable in honing in and searching for false positives. We will
explore the nature of these resolution enhancement and pattern recognition
mechanisms for our system.
Test bed: Continuous multimedia capture and storage
We are building our own multimedia sensor platform to test our innovative
software techniques. Our goal is to deploy these sensors throughout our lab
space for gaining valuable practical experience. Our goal is to assemble a
multimedia sensor using off the shelf components. We are currently building
our multimedia sensors using inexpensive commodity components; VIA processor
and mother board (VIA EPIA MII 6000E Fanless Mini-ITX Motherboard with 600
MHz Eden processor for the storage bricks and a VIA EPIA MII12000 with 1.2
GHz processor with similar peripheral resources for the sensing nodes), bluetooth
and compact flash IEEE 802.11b wireless NIC. This particular processor board
was chosen because a) The micro-atx motherboards are small b) x86 compatible
processor allowing us to run our OS of choice, FreeBSD/Linux c) supports Firewire/IEEE
1384 for connecting high fidelity images and audio capture devices, d) compact
flash slot on the mother board itself and e) inexpensive (this hardware costs
around $500, including the case). We built our first prototypes in the summer
of 2004. We are also investigating upcoming nano-atx based mother boards as
well as Oxford AV940 multimedia processor boards for our storage and capture
Presently they continuously monitor Surendar's office and the Experimental
systems lab. Check out mmsensor01, mmsensor02 and mmsensor03.
We gratefully acknowledge generous support from the National Science Foundation
(NSF) and Defense Intelligence Agency (DIA).
[an error occurred while processing this directive]