Augmented Reality
Apr 15th, 2007 by Ricker
Background
A wearable heads-up display (HUD) and augmented reality system will provide a soldier on the ground with geospatial-intelligence data without having to look at a map or global positioning system (GPS) unit or take their eyes off their surroundings. The current monocle-type helmet mounted displays being fielded through PM Land Warrior provide situational awareness to dismounted soldiers, but these displays are essentially miniature computer monitors and are not transparent; so the soldier loses binocular vision, depth perception, and peripheral vision when using them. Commercially available systems are large, obtrusive, and fragile, and they do not interface with geographic information systems (GIS) or ruggedized wearable computers or personal digital assistant (PDA).
The hardware must be lightweight, rugged and mounted on a helmet or in protective eyewear. Software will also need to be developed. It will determine the direction the soldier is looking and the distance the soldier is away from the object being observed. The software will display feature data about the surroundings into the soldier’s field of view, such as the names of buildings and streets, GPS waypoints and tracks, locations of friendly forces and other various types of geospatial-intelligence, without obstructing the eyes via the HUD. The system will read geospatial intelligence data from shape files or geo-databases, and be compatible with a ruggedized device running the Windows CE operating system.
Tracker accuracy
The Microvision ND2100 HUD has a look-angle of 23º by 17º, or 0.40 radians by 0.30 radians. The screen resolution is 800 by 600 pixels. Each pixel is 0.0005 radians. The width of one pixel is equal to the tangent of the pixel angle times the distance from the screen.
![]()
Figure 1 Calculating pixel width
| Distance | 10. | 100. | 1000. |
| Pixel width | 0.005 | 0.05 | 0.5 |
| Relative position accuracy | 0.02 | 0.2 | 2.0 |
Table 1 Pixel size versus distance (in meters)
A typical icon is at least 8×8 pixels and usually 16×16 pixels. The icon must align with the real-world object. For icons of this size, we assume that our tolerance for icon alignment is plus or minus 4 pixels. Our location and orientation must be accurate to within these pixel tolerances. Location can be off by 0.2 m (i.e., 4 x 0.05m) and still align the icon with a target 100 m away. Orientation must be within 0.002 radians, or approximately 0.1 degrees.
For instance, Figure 2 shows a listen post labeled by a 16×16 pixel icon on an 800×600 pixel image. The target is at least 1000 meters away from the camera.
![]()
Figure 2 Full view with a 16×16 icon
Figure 3 shows the detail around the target. The red box is the displayed icon centered on the target. The blue box represents the icon shifted four pixels right and down from the target. The icon is still aligned with the physical target within a humanly acceptable interpretation.

Figure 3 Icon detail (red) with 4 pixel misalignment (blue)
Labeling challenge
View management, a relatively new area of research in augmented reality applications, is about the spatial layout of two-dimensional virtual annotations in the view plane. The primary task is to place virtual labels that identify information about real counterparts. There are many challenges in labeling for augmented reality, such as the following:
- Labels should avoid overlap
- Labels should not obscure the object that they are labeling
- Labels should be intuitively associated with the object
- There is a finite number of labels that a human can process
University research laboratories are currently investigating the challenges and developing algorithms to solve the augmented reality labeling. One of the primary leaders in this field is Dr. Steven Feiner of Columbia University. He will be serving as a consultant in this phase of research. We anticipate employing the software developed by his laboratory in our solution.
Transducer Markup Language
Transducer Markup Language (TML) is the emerging open industry standard for describing, normalizing, transporting, storing and processing transducer data from any sensor anywhere in the world. TML is the emerging standard for level one sensor fusion.
The principals of Distributed Instruments have been key developers of TML since its inception. TML was created and successfully demonstrated in small business innovative research (SBIR) grants from the US Air Force Research Laboratory (AFRL) and the Missile Defense Agency. The US National Geospatial Intelligence Agency (NGA) is currently funding the transition of TML to the Open Geospatial Consortium (OGC) [1] to establish TML as an open industry standard. US Special Operations Command (SOCOM) is planning an advance concept technology demonstration (ACTD) called MASCOT to begin in January 2006 that includes TML as a major component. NGA is also proposing a project named TAPESTRE begin as soon as possible to expand the deployment of TML.
Transducer Markup Language (TML) is a breakthrough in simplicity for level zero and level one sensor fusion. TML provides a solid, standard platform for enabling higher levels of sensor fusion. TML differs from previous sensor data formats in following key ways:
- All sensor data is transported at its own sample rate
- Position sensors are treated like any other sensor
- A machine-readable system description defines position dependencies and other dependencies
- The data consumer calculates position and other dependencies through interpolation
- Designed and built for a service-oriented architecture
All sensors sample at an independent rate. Stated another way, different sensors generate a signal at different times and at different intervals. Some sensors such as scanners create a signal of a specific duration, while others are nearly instant. Figure 1 shows conceptually the relationship between the sample times of five different sensors. The signals from the sensors are shown as boxes. The position of the boxes shows the relative occurrence of the signal. The width of the box shows the relative duration of the signal. The distance between boxes on the same row shows the relative sampling rate of the sensor.
In legacy sensor systems, the interdependencies between the sensors are hard coded. Likewise, the differences in the sampling rates of the various sensors are obscured and association is nearly arbitrary. Figure 1 shows conceptually how a legacy sensor data format combines sensor data. The format is an image with metadata in its headers. All time relations have been stripped from the data. In such a legacy format, there is no easy or direct means for interpolating data from the various sensors for higher accuracy. In TML, all of the data, the image and the location information, is sent at its own sample rate with precision time associated with each data packet. The consumer of the data can then use interpolation to make precise calculations of location at a given specific time.
The transducer system description, called Sensor Modeling Language (SensorML), defines the geometry of a transducer and the kinematics of the interdependencies of the sensors for calculating position of sensors. For instance, for the system shown in Figure 2, the system description would define the camera’s location and orientation relative to the gimbal, the IMU and the GPS. The position dependencies are expressed as the parameters of Euler transforms. A processor can calculate the interior orientation of the sensor relative to earth-center, earth-fixed (ECEF) coordinate reference system using the values of the gimbal, the IMU and the GPS by standard kinematic equations.
Kinematics is the science of motion which treats motion without regard to the forces which cause it. Most work in kinematics is focused on robotics. However, the same calculations and equations are necessary in sensor fusion. It is not without surprise that the US National Institute for Standards and Technology (NIST) efforts in sensor fusion arose from its robotics laboratory. Our objective is to resolve the origin of the sensor energy relative to a standard datum, which we have chosen as earth-centered earth-fixed (ECEF). The energy is known relative to the sensor’s internal coordinate system. In our example, the sensor of interest is an imaging sensor. We must translate the imaging sensor’s internal coordinate system to the ECEF coordinate system through a series of translations determined by the dependency graph shown in Figure 5.
There are many ways to translate coordinate systems. We will standardize on what is named the ZYX Euler angles as shown in Figure 5. It prescribes the following algorithm:
- Rotate around the Z axis at an angle ω (omega).
- Rotate around the resulting Y axis at angle φ (phi).
- Rotate around the resulting X axis at angle κ (kappa).

Figure 6 ZYX Euler angle translation
To determine the origin of the energy, we must transform the energy vector from the image sensor coordinate system T4 to the standard ECEF coordinate system T0. There are seven changing values in these transforms. The rest are constants. The angles (α1, β1, γ1) are the angle readings from the IMU. The distances (a2, b2, d2) are the position readings from the GPS. All of these readings are relative to the ECEF. The angle (α4) is the reading from the rotation encoder. The change in the coordinate system can only occur along this one angle. The gimbal is fixed to the platform, so the angles (α3, β3, γ3) and the distances (a3, b3, d3) are all constants. Similarly, the image sensor is fixed to the gimbal arm, so the distances (a4, b4, d4) are constants as well.
We calculate the origin of the sensor energy by determining the intersection between the sensor data instantaneous field of influence (IFOI) with one or more ambiguity spaces. The space of the origin of the phenomenon is equal to the intersection of the shape space of the ambiguity intersected with the shape space of the instantaneous field of influence.
In the theory of sensor fusion, all sensors are treated as in situ sensors. A sensor simply measures the response to energy at a particular point in space within a particular IFOI. From the transducer system description and the interpolation of sensor data, we know where the sensor received the energy and we know from which direction the energy came, but we do not know the point from where the energy came. In order to know from which point the energy came, that is, to resolve the phenomenon to a geo-location, then we must intersect the IFOI with one or more ambiguity spaces.
We resolve the energy origin for all sensors the same way. Figure 3 shows an example of how to resolve a camera image. Knowing the IFOI orientation of a pixel, we can intersect the vector with a terrain model. The point of intersection is the origin of the energy.

Figure 7 Resolving the origin of image data and radar data is the same in TML
Radar is resolved in the exact same manner, except the ambiguity space is not a terrain model. The radar system determines location by creating its own ambiguity space. An actuator generates a chirp at a known time interval. The sensor receives the chirp at a given time. The point in time defines a sphere, that is, an ambiguity space in which the chirp would have travel there and back in that amount of time.
The results of TML for level one sensor fusion are as follows:
- All sensor data is represented and transported the same way
- All sensor data is described in the same machine readable format (XML)
- The geospatial interdependencies (kinematics) of all sensors are described in the same XML format
- The geo-location of all sensor data is calculated in the same process
Successful organisms
George Miller is sometimes attributed with creating the field of cognitive psychology with his paper, “The Magical Number Seven, Plus or Minus Two: Some Limits on Our Capacity for Processing Information.” In that paper, Miller presented the results of several studies on human ability to discern differences through their senses. In these studies, subjects were asked to discern stimuli such as sounds at different pitches, sounds at different volumes, solutions at different saliency, colors at different hues, colors at different brightness, shapes at different sizes, and so forth.
Whether the sense was touch, taste, sight or sound, the results were remarkably the same. Miller found that, regardless of the input, the subjects could only discern on average seven unique inputs. The mean was 6.5, with one standard deviation including 4 to 10, a remarkably narrow range. If the number of inputs went up beyond seven, errors dramatically increased.
Miller observed,
“There seems to be some limitation built into us either by learning or by the design of our nervous systems, a limit that keeps our channel capacities in this general range. On the basis of the present evidence it seems safe to say that we possess a finite and rather small capacity for making such unidimensional judgments and that this capacity does not vary a great deal from one simple sensory attribute to another.
“I have been careful to say that this magical number seven applies to one-dimensional judgments. Everyday experience teaches us that we can identify accurately any one of several hundred faces, any one of several thousand words, any one of several thousand objects, etc. The story certainly would not be complete if we stopped at this point. We must have some understanding of why the one-dimensional variables we judge in the laboratory give results so far out of line with what we do constantly in our behavior outside the laboratory. A possible explanation lies in the number of independently variable attributes of the stimuli that are being judged. Objects, faces, words, and the like differ from one another in many ways, whereas the simple stimuli we have considered thus far differ from one another in only one respect.”
Miller went on to present the results of research that demonstrated that human ability to differentiate does increase as the number of stimuli increase. We combine color, shape, position and a number of other inputs to make differentiations. Miller deduced that,
“We might argue that in the course of evolution those organisms were most successful that were responsive to the widest range of stimulus energies in their environment. In order to survive in a constantly fluctuating world, it was better to have a little information about a lot of things than to have a lot of information about a small segment of the environment. If a compromise was necessary, the one we seem to have made is clearly the more adaptive.”
I believe Miller’s analysis have direct implication to sensor fusion. Before the Internet, a sensor signal would exist inside a single isolated machine or a well-defined isolated system. Those systems provided “a lot of information about a small segment of the environment.” With the advent of the Internet, a single processor can access the signals from any number of sensors. Sensor fusion enables our Internet applications to have “a little information about a lot of things.” Perhaps sensor fusion may enable our new systems to have a lot of information about a lot of things.
Miller argued that humans are successful organisms because they are able to fuse the input from several senses to identify a greater number of different phenomena. Under the same reasoning, sensor fusion should provide an increase level of survivability to things like battle fleets and retail firms.
Currently, each sensor, no matter how sophisticated, provides information on only one dimension of the environment. Radar, sonar, SAR and imagery systems on a battleship are each limited in much the same way a human sense of saliency or color are limited. With sensor fusion, a battleship can combine multiple sensors to identify threats, much as a human combines multiple senses to identify people and words. Miller stated that, “in the course of evolution organisms were most successful that were responsive to the widest range of stimulus energies in their environment.” We might state that,
In the course of war, battle fleets were most successful that were responsive to the widest range of stimulus energies in their environment.
Quite often, what is true in war is true in other human endeavors. We could just as easily state,
In the course of increasing competition, the retail firms were most successful that were responsive to the widest range of stimulus energies in their environment.
The implications of sensor fusion are quite significant.
Cognitive limitations
Miller was focused on the inherent cognitive limitations of humans. He presented the concept of a communication channel as having inputs and outputs. If we picture two partially overlapping circles, the right representing the input and the left representing the output, then the overlap of the two represents the transmitted information. Miller assessed the human as a communication channel.
“In the experiments on absolute judgment, the observer is considered to be a communication channel. Then the left circle would represent the amount of information in the stimuli, the right circle the amount of information in his responses, and the overlap the stimulus-response correlation as measured by the amount of transmitted information…. If the human observer is a reasonable kind of communication system, then when we increase the amount of input information the transmitted information will increase at first and will eventually level off at some asymptotic value. This asymptotic value we take to be the channel capacity of the observer.”

Figure 8 Miller’s concept of the human as a communication channel
The objective of sensor fusion is to determine that the observations from two or more sensors correspond to the same phenomenon. The purpose of sensor fusion is to improve the human condition by enabling human judgment to be responsive to the widest range of stimulus energies in the environment.
“There is a clear and definite limit to the accuracy with which we can identify absolutely the magnitude of a one-dimensional stimulus variable. I would propose to call this limit the span of absolute judgment, and I maintain that for one-dimensional judgments this span is usually somewhere in the neighborhood of seven.”
Miller recognized that the human mind has many ways of coping with or overcoming this span of absolute judgment, the most significant being what he called chunking and recoding.
“We must recognize the importance of grouping or organizing the input sequence into units or chunks. Since the memory span is a fixed number of chunks, we can increase the number of bits of information that it contains simply by building larger and larger chunks, each chunk containing more information than before.
“A man just beginning to learn radio-telegraphic code hears each dit and dah as a separate chunk. Soon he is able to organize these sounds into letters and then he can deal with the letters as chunks. Then the letters organize themselves as words, which are still larger chunks, and he begins to hear whole phrases. I do not mean that each step is a discrete process, or that plateaus must appear in his learning curve, for surely the levels of organization are achieved at different rates and overlap each other during the learning process. I am simply pointing to the obvious fact that the dits and dahs are organized by learning into patterns and that as these larger chunks emerge the amount of message that the operator can remember increases correspondingly. In the terms I am proposing to use, the operator learns to increase the bits per chunk.
“In the jargon of communication theory, this process would be called recoding. The input is given in a code that contains many chunks with few bits per chunk. The operator recodes the input into another code that contains fewer chunks with more bits per chunk. There are many ways to do this recoding, but probably the simplest is to group the input events, apply a new name to the group, and then remember the new name rather than the original input events.”
If we simply feed the data of thousands or millions of sensors to a human, we will have achieved nothing. Humans are not able to make judgments on such a vast amount of unprocessed input. Sensor fusion does not fulfill its purpose unless it provides or at least enables chunking and recoding for humans.
Humans need a representation of the physical environment that is abstract, chunked and encoded, that accommodates the cognitive limitations and enables judgment. To be truly successful, the geographically-enabled augmented reality for dismounted soldiers developed under this grant must take into account these cognitive limitations.
Technical objectives
Distributed Instruments will perform one or more experiments to determine the feasibility of an unobtrusive HUD system. Basic development of optics, projection hardware, labeling software, and integration with existing GPS enabled wearable computers or PDAs will be part of the feasibility study. Distributed Instruments will create a report that addresses both hardware and software issues and includes the results of introductory development and descriptions of the techniques used.
In particular, the experiments and report will focus on answering the following questions:
- Can GPS provide or be augmented to provide 0.2 m accuracy?
- Is there an affordable, compact inertial measurement unit (IMU) that can provide 0.002 radian (0.1 degree) accuracy?
- Can the helmet be adequately fixed to the soldier to maintain orientation tolerances?
- Is ±4 pixels a feasible tolerance for icon alignment for the user? Can it be greater or should it be smaller?
- Can the software and hardware refresh icon positioning and display within 1/30th second? Is that rate fast enough for human perception?
- Does the current Transducer Markup Language (TML) and associated sensor network architectures provide the necessary data for augmented reality?
- What labeling software is available to meet the needs of the dismounted soldier? How close is that software to production? What level of effort is necessary to get the labeling software to production?
Related work
Augmented reality
Several universities have been pursuing research in augmented reality. DOD has funded some of these efforts, such as that at Columbia.
- Columbia University, Mobile Augmented Reality System (MARS)
- Advanced Information Technology Branch of Information Technology Division at the Naval Research Laboratory, Battlefield Augmented Reality System (BARS)
- University of North Carolina, The UNC Tracker Project
- University of Southern California
- Colorado School of Mines
Label management
Dr. Feiner is one of the leaders in addressing the challenge of label management.
- B. Bell, S. Feiner, and T. Höllerer, View Management for Virtual and Augmented Reality, In Proc. UIST ‘01, Orlando, FL, November 11-14 2001. pp. 101-110.
- B. Bell, T. Höllerer, and S. Feiner, An Annotated Situation-Awareness Aid for Augmented Reality, In Proc. UIST ‘02, Paris, France, October 27-30 2002. pp. 213-216.
- B. Bell and S. Feiner, Dynamic Space Management for User Interfaces, In Proc. UIST ‘00, San Diego, CA, November 5-8 2000. pp. 239-248.
The laboratory at Columbia University has created labeling software. Distributed Instruments will be evaluating this software in its prototype.
Transducer Markup Language
Distributed Instruments is one of the major contributors to the development of Transducer Markup Language (TML), the enabling technology for sensor data fusion. IRIS Corp created TML on a US Air Force Research Laboratory (AFRL) SBIR grant. It is the only SBIR to go to Phase 3 at Wright Paterson in the past 15 years. Distributed Instruments has been a subcontractor in the development of TML since the beginning.
TML is on fast-track to become the DOD standard for sensor data fusion. The US National Geospatial Intelligence Agency (NGA) is funding the transition of TML to the Open Geospatial Consortium (OGC) to manage TML as an open industry standard. US Special Operations Command (SOCOM) is initiating in January 2006 an advanced concept technology demonstration (ACTD) of TML called MASCOT.
Distributed Instruments is focused on building hardware and software for TML. On July 1, 2005, we released the alpha version of our Transducer Data Server and our Transducer Reader software. Figure 7 provides a photograph of Distributed Instruments’ initial prototype of a position-enabled video device. The device provides video, location and attitude along a single USB wire. It is designed to stream TML and enable TML-based sensor data fusion. The next version of the system compresses all the components to a single board, removing all of the internal wires.

Figure 10 Initial prototype of Distributed Instruments’ position-enable video device
Leave a Reply
You must be logged in to post a comment.


