GestureCap is a real-time system that uses computer vision and sound synthesis to turn hand gestures into sound controls. It lets users create and shape sounds interactively through their hand movements.
This year, our work on GestureCap focused on improving responsiveness and accuracy across the system. A major part of the effort went into precise latency measurement, since in HCI systems latency defines the gap between a user’s intended action and the system’s audio or visual response. To measure this, we built a setup using a Teensy microcontroller that directly monitors the audio signal through analog input. This allows us to capture the delay between a hand gesture making contact with a surface and the resulting sound output. The latency is measured from two timestamps on a shared clock based on electrical signals that are directly/indirectly triggered at the same time as the hand gesture and audio output.
We also reworked the GestureCap pipeline by moving from a multithreaded to a multiprocessing approach, giving more efficient use of system resources and reducing processing delays.
In addition, we introduced a calibration system for real-time trigger detection, which ensures accurate and consistent measurements across different test runs. With these optimizations, GestureCap now delivers faster and more reliable gesture-to-sound interaction, especially when paired with a high FPS camera.
This year, we built a new latency measurement setup to get more accurate and reliable results. The Teensy 4.1 directly timestamps both the trigger event (electrical contact) and the audio detection, ensuring that there are no in-between communication delays that could distort the readings.
By monitoring the audio signal either through a direct AUX connection or a microphone input, the system captures the true end-to-end latency from gesture to sound output. This setup removes errors from USB or serial communication timing and gives a clean, consistent measurement.
Combined with the updated multiprocessing pipeline and a calibration system for real-time trigger detection, our latency measurements are now precise and consistent across multiple hardware setups, from high-performance desktops to standard laptops.
This pipeline sets up a two-process, shared-memory video processing system for low-latency hand-tap detection and OSC triggering.
The producer process captures frames from a FLIR camera, measures acquisition and conversion times, and writes each frame into one of two pre-allocated shared memory buffers, switching between them to avoid overwriting in-use data. It also updates shared timing values and a timestamp indicating when the frame was captured.
The consumer process continuously reads the latest available frame from shared memory, runs the hand pose detector, and applies a tap detection algorithm based on pre-loaded calibration parameters. When a valid tap is detected, it sends an OSC trigger message and logs timing metrics (frame age, camera read time breakdown, and detection time) to a CSV file.
The system uses Python’s multiprocessing shared memory and Value objects for fast, lock-free data transfer, ensuring minimal frame latency between capture and detection. The design allows the camera capture and the pose detection to run in parallel without blocking each other.
We implemented a real-time calibration system to make trigger detection both accurate and consistent. The goal is to ensure the system reacts exactly when the user makes a trigger gesture, without firing early or with noticeable delay.
During calibration, the system:
- Records the average vertical position (y-coordinate) of the hand landmarks when the hand is resting on the surface.
- Measures the natural pixel noise from Mediapipe’s tracking.
- Sets the detection threshold using the formula:
Threshold = Mean Rest Position + (3 × Standard Deviation)
This approach ensures the threshold stays far enough from the resting noise to prevent false positives, while still low enough to trigger instantly when the user’s hand actually makes contact.
This setup is currently used for surface trigger detection, but the same logic can be adapted for other cases, for example, mapping gestures between two points in space for range-based interactions.
The calibration can be repeated anytime and adapts automatically to changes in camera alignment, lighting, or hand position, keeping detection accurate and consistent across sessions and hardware setups.
With a high-FPS camera and a good GPU, we are able to achieve a median latency of 13ms. For more details on the latency measurement setup/results, read (link to the other README.md in latency measurement folder, to be made)
Boost GPU clocks before starting experiments:
sudo nvidia-smi -lgc=3000,3000 && sudo nvidia-smi -lmc 8000,8000Check GPU clocks:
nvidia-smi -q -d CLOCKReset GPU clocks after experiment:
sudo nvidia-smi --reset-memory-clocks && sudo nvidia-smi --reset-gpu-clocksRun scripts with GPU: Use your system's GPU execution command for all main Python scripts. For NVIDIA Optimus systems:
prime-run python latency_mp.pyScripts Involved:
record_flircam.py– camera positioningcalibration.py– set reference line and calibration distancelatency_mp.py– latency testinglog_serial.py– serial logging from Teensyjoin_tables.py– combine latency logs into a CSV
- Connect AUX cable from your computer to the speaker
- Place the microphone sensor close to the speaker membrane
- Connect Teensy pins: GND → Ground, 3V → VCC, A0 → Pin 23 (or whichever analog pin you configure)
- Confirm selected pin matches the Teensy code
- Upload the
raw_data_plotcode to the Teensy to check values in silent conditions - Set the threshold in
latency.inocomfortably above the silent baseline level - Test the setup by tapping the microphone and observing the readings
- Use an open-ended AUX wire: Ground terminal → GND on Teensy, Positive terminal → Analog pin (currently Pin 23)
- Repeat the steps for Teensy code and
raw_data_plotupload - Set threshold in
latency.inocomfortably above silent readings
Setting Limits in latency.ino:
- Lower limit: theoretical minimum latency minus a few ms for margin
- Upper limit: theoretical maximum latency plus a few ms
- This prevents spurious readings outside expected bounds
Current Reference Values:
- Raw analog AUX: threshold = 30
- Microphone + speaker: threshold = 80
Note: These values were determined by observing silent readings with raw_data_plot. Adjust if you notice false positives or missed taps.
- Paste aluminum foil at the edge of a flat surface
- Connect any Teensy GND pin to the foil using an alligator clip/wire
- Connect a wire to the buttonPin (currently Pin 2)
- Attach this wire to the side of your left pinky finger, minimizing obstruction of the back of your hand
- Connect the Teensy to your computer via USB
- Connect the FLIR camera to your computer
- Run camera positioning script:
prime-run python record_flircam.py # or your GPU command- Adjust camera so foil edge is parallel to the reference line
- Press 'q' to close
prime-run python calibration.py # or your GPU command- Click twice along the foil edge with maximum precision
- Keep left hand vertical on surface with pinky resting sideways on foil
- Script measures distance between Avg(y-coord(17 to 20)) and reference line for 1 second
- Ensure right hand is not visible during measurement
prime-run python latency_mp.py # or your GPU commandRun these tests:
- Hover test: Taps should only trigger when hand is very close to foil (< few mm)
- Rapid taps test: Tap rapidly - false positives should be rare (about 1 in 7 or better)
- Still hand test: Rest hand sideways on surface - no taps should register when stationary
If tests fail, repeat calibration and adjust camera exposure in flircam.py or room lighting.
- Open PureData and load
beep.pd - Set frequency to 200 Hz
- Go to Audio Settings: select AUX output device, set delay = 3 ms
- Test by pressing button in PureData to confirm sound plays from speaker
Start the measurement:
prime-run python latency_mp.py # or your GPU commandIn a separate terminal, start logging:
python log_serial.pyConfiguration:
- Script uses
config.jsonwith these keys:device: Experiment device namebaud_rate: Serial connection rate (9600 or 115200)method: Audio detection method descriptionfrequency: Audio signal frequency (Hz)threshold: Detection threshold valuepd_delay: PureData delay (milliseconds)output_method: Output method (AUX + speaker-mic, direct AUX, etc.)
Data Collection:
- Re-attach wire to pinky
- Test with a few taps to confirm latencies are logging
- Restart both scripts and begin experiment
- Perform approximately 250 taps to obtain ~200 valid latency samples
prime-run python join_tables.py --tablea path/to/TableA.csv \
--tableb path/to/TableB.csv --out path/to/final.csv --tol_ms 50Output:
- Two files saved:
TableA.csv,TableB.csv join_tables.pyhandles false positives/negatives automatically- Final CSV contains total latency and breakdown of internal latencies for each tap
In the latency_mp.py script, set SAVE_FRAMES = True
- Saves
LAST_N_FRAMESbefore tap detection (default to7) - Wait at least 500ms between taps (saving 7 frames require 500ms)
- Disconnect the FLIR camera
- Disconnect the Teensy from USB
- Reset GPU clocks to default settings:
sudo nvidia-smi --reset-memory-clocks && sudo nvidia-smi --reset-gpu-clocks- Codebase: dockerisation
- Codebase: implementing multiprocessing in the main repository
- Optimisation: implementing multiple mediapipe workers
(tbf)