Visual Learning and Recognition: Difference between revisions

Line 841: Line 841:


===Models for Video Recognition===
===Models for Video Recognition===
In general:
;Basic Video Pipeline
# Extract features
# Extract features
# Learn space-time bag of words
# Learn space-time bag of words
# Train/test BoW classifier
# Train/test BoW classifier
;Spatio-temporal Feature Detectors
* Harris3D
* Cuboid
* Hessian
* Dense
;Spatio-temporal Feature Descriptors
* HOG/HOF
* Cuboid
* HOG3D
* ExtendedSURF
;Add trajectories
# Track a keypoint's movement over time
# Make a ''feature tube'' around the trajectory. Existing methods used a cube instead of a tube.
# Then do whatever pooling you want (e.g. HOG) to get a trajectory description.
* Two-stream ConvNet uses a spatial stream (RGB) and temporal stream (optical flow)
* Add/Stack Trajectories: Flow should be added on top of original points.
* Pool Along Trajectories instead of cubes
* I3D stacks 8 frames, passes to 3D convnet and gets an output
* Late fusion extracts features per-frame and then combines later.
===What is an action?===
ActionVLAD
* BoW for actions
* Actions are made up of subactions
** E.g. basketball shoot = dribbling + jump + throw + running + ball
Gaussian Temporal Awareness Networks
* Key idea: Not all actions have the same temporal support.
** Depending on frame-rate & action speed, actions can take a variable number of frames.
Compressed Video Action Recognition
* Idea is to present P-frames directly to the CNN which are essentially optical-flow.
R-C3D: Region Convolutional 3D Network for Temporal Activity Detection
* Inspired by Faster R-CNN Architecture
SST: Single-Stream Temporal Action Proposals
* Single-shot proposal network
Action Tubelet Detector for Spatio-Temporal Action Localization
* For every frame, regress how to move the tubelet up or down
Tube Convolutional Neural Network (T-CNN)
===Complementary Approaches===
Pose-based Action Recognition
* Convert a video into pose maps and do classifications on poses
PoTion: Pose MoTion Representation
* Do pose estimation to get joint heatmaps.
* Represent pose position & movement as  an image with red=start and green=end.
* Stack this across time.
PA3D: Pose-Action 3D Machine
* Focus on pose to do action recognition
VideoGraph: Recognizing Minutes-Long Activites
==Recognition==
===Context===


==Will be on the exam==
==Will be on the exam==