5,337
edits
Line 841: | Line 841: | ||
===Models for Video Recognition=== | ===Models for Video Recognition=== | ||
;Basic Video Pipeline | |||
# Extract features | # Extract features | ||
# Learn space-time bag of words | # Learn space-time bag of words | ||
# Train/test BoW classifier | # Train/test BoW classifier | ||
;Spatio-temporal Feature Detectors | |||
* Harris3D | |||
* Cuboid | |||
* Hessian | |||
* Dense | |||
;Spatio-temporal Feature Descriptors | |||
* HOG/HOF | |||
* Cuboid | |||
* HOG3D | |||
* ExtendedSURF | |||
;Add trajectories | |||
# Track a keypoint's movement over time | |||
# Make a ''feature tube'' around the trajectory. Existing methods used a cube instead of a tube. | |||
# Then do whatever pooling you want (e.g. HOG) to get a trajectory description. | |||
* Two-stream ConvNet uses a spatial stream (RGB) and temporal stream (optical flow) | |||
* Add/Stack Trajectories: Flow should be added on top of original points. | |||
* Pool Along Trajectories instead of cubes | |||
* I3D stacks 8 frames, passes to 3D convnet and gets an output | |||
* Late fusion extracts features per-frame and then combines later. | |||
===What is an action?=== | |||
ActionVLAD | |||
* BoW for actions | |||
* Actions are made up of subactions | |||
** E.g. basketball shoot = dribbling + jump + throw + running + ball | |||
Gaussian Temporal Awareness Networks | |||
* Key idea: Not all actions have the same temporal support. | |||
** Depending on frame-rate & action speed, actions can take a variable number of frames. | |||
Compressed Video Action Recognition | |||
* Idea is to present P-frames directly to the CNN which are essentially optical-flow. | |||
R-C3D: Region Convolutional 3D Network for Temporal Activity Detection | |||
* Inspired by Faster R-CNN Architecture | |||
SST: Single-Stream Temporal Action Proposals | |||
* Single-shot proposal network | |||
Action Tubelet Detector for Spatio-Temporal Action Localization | |||
* For every frame, regress how to move the tubelet up or down | |||
Tube Convolutional Neural Network (T-CNN) | |||
===Complementary Approaches=== | |||
Pose-based Action Recognition | |||
* Convert a video into pose maps and do classifications on poses | |||
PoTion: Pose MoTion Representation | |||
* Do pose estimation to get joint heatmaps. | |||
* Represent pose position & movement as an image with red=start and green=end. | |||
* Stack this across time. | |||
PA3D: Pose-Action 3D Machine | |||
* Focus on pose to do action recognition | |||
VideoGraph: Recognizing Minutes-Long Activites | |||
==Recognition== | |||
===Context=== | |||
==Will be on the exam== | ==Will be on the exam== |