Visual Learning and Recognition: Difference between revisions

No edit summary
Line 81: Line 81:


===Dealing with Sparse Data===
===Dealing with Sparse Data===
;Better Similarity
*Better Similarity
;Better Alignment
====Better Alignment====
**E.g. reduce resolution, sifting, warping
 
;SIFT-Flow
Take sift features for all regions. 
Then learn some SIFT vector to RGB color matching.
The RGB images are called ''sift flow'' features. 
Similar RGB regions will have similar SIFT feature vectors. 
Then we can learn some transformation <math>T</math> to match the sift flows (i.e. <math>T(F_1) \approx F_2</math>).
 
;Non-parametric Scene Parsing (CVPR 2009)
If you have a good scene alignment algorithm, you can just use a segmentation map.
 
====Use sub-images (primitives) to match====
Allows matching from multiple images
 
;Mid-level primitives
Bag of visual words: 
# Take some features (e.g. SIFT) from every image in your dataset.
# Apply clustering to your dataset to get k clusters. These k clusters are your visual words.
 
The challenge with matching patches is how to find patches to match? 
Ideally, we want patches which are both representative and discriminate. 
Representative is that the patch is found in the target image set; i.e. coverage of the target concept. 
Discriminative is that the patch is not found in non-target image sets (distinct from other concepts). 
 
====Understanding simple stuff first====
E.g. from a video, find one frame which is easy to detect pose and then apply optical-flow methods to transfer the flow to adjacent frames.
 
====Looking beyond the k-NN method====
Use data to make connections. 
 
;Visual Memex Knowledge Graph
(Malisiewicz and Efros 2009) 
Build a visual knowledge graph of entites. Edges can be context edges or similarity edges. 
Embed an image into the graph and copy information from the graph.
 
;Manifolds in Vision
These days, we can assume deep learning features are reasonable manifolds.


==ConvNets and Architectures==
==ConvNets and Architectures==