@@ Line 377: / Line 377: @@
 * 4 Texture Related
-===Discriminatively Trained Part Based Models (DPM)===
+===Deformable Part Models (DPM)===
+Lecture (Oct 6-8, 2020)
+Deformable Part Models (DPM)
 Felzenszwalb et al (2009) <ref name="felzenszwalb2009dpm"></ref>
 '''Important: Read this paper'''
@@ Line 390: / Line 393: @@
 * <math>w_{ij}^T \in \mathbb{R}^5</math> is the deformation parameter between parts i and j
-The total number of configurations is <math>10^(4*N)</math> since for each <math>100\times100</math> image, each of <math>p_i</math> can take 100*100 values. <math>N</math> is the number of parts.
+The total number of configurations is <math>10^{(4*N)}</math> since for each <math>100 \times 100</math> image, each of <math>p_i</math> can take 100*100 values. <math>N</math> is the number of parts.
 The trick is to use dynamic programming and a tree-based model.
@@ Line 457: / Line 460: @@
 ===TextonBoost===
 Shotton ''et al.'' <ref name="shotton2009texton"></ref>
+Incorporates texture-layout, color, location, and edge in a conditional random field.
+Jointly considers appearance, shape, context.
 ==Semantic Segmentation==
@@ Line 516: / Line 521: @@
 ==Region-based Object Detection Systems==
+Lecture (Oct 15, 2020)
 ===R-CNN===
 ;R-CNN at test time
@@ Line 746: / Line 751: @@
 ===Approaches for Pose Estimation===
+Lecture Oct 27
 * Top-down approaches
 ** Do person detection then do pose estimation.
@@ Line 993: / Line 1,000: @@
 Primitives
-* Depth
+* Depth - not normalized making them hard to use, have discontinuities, do not represent objects
-* Surface normals
+* Surface normals - are gradient of depth
 ===Scene Intrinsics===
@@ Line 1,049: / Line 1,056: @@
 Shape carving:
 * Assume everything is cubeoids and remove cubes.
+Without segmentation masks:
+* Train autoencoder for 3D voxels shape
+* Have encoder for rendered chairs.
+* At test time, images go to encoder for rendered and out the decoder for 3D voxels.
+==GANs and VAEs==
+===Pixel-RNN/CNN===
+* Fully-visible belief network
+* Explicit density model:
+** Each pixel depends on all previous pixels
+** <math>P_{\theta}(x) = \prod_{i=1}^{n} P_{\theta}(x_i | x_1, ..., x_{i-1})</math>
+** You need to define what is ''previous pixels'' (e.g. all pixels above and left)
+* Then maximize likelihood of training data
+;Pros:
+* Can explicitly compute P(x)
+* Explicit P(x) gives good evaluation metric
+;Cons:
+* Sequence generation is slow
+* Optimizing P(x) is hard.
+Types of ''previous pixels'' connections:
+* PixelCNN looks at all previous pixels (fastest)
+* Row LSTM has a triangular receptive field (slow)
+* Diagonal LSTM
+* Diagonal BiLSTM has a full dependency field (slowest)
+;Multi-scale PixelRNN
+* Takes subsampled pixels as additional input pixels
+* Can capture better global information
+* Slightly better results
+===Generative Adversarial Networks (GANs)===
+* Generator generates images
+* Discriminator classifies real or fake
+* Loss: <math>\min_{G} \max_{D} E_x[\log D(x)] + E_z[\log(1-D(G(z)))]</math>
+;Image-to-image Conditional GANS
+* Add an image encoder which outputs z
+;pix2pix
+* Add L1 loss to the loss function
+* UNet generator
+* PatchGAN discriminator
+** PatchGAN outputs N*N values with real-fake with each patch (i.e. limited receptive field)
+* Requires paired samples
+;CycleGAN
+* Unpaired image-to-image translation
+* Cycle-consistency loss
+;BicycleGAN
+* First learn to reconstruct images with a nice latent code representation in between (cVAE-GAN)
+* The main difference is that we have a many-to-many mapping (multi-modal image-to-image) between the two domains.
+;MUNIT
+* Multimodal UNsupervised Image-to-image Translation
+* Maps images from each domain into a shared content space and domain-specific style space.
+===Training Problems with GANs===
+* Instability
+* Difficult to keep generator and discriminator in sync.
+** Discriminator cannot be too good or too bad. Same with generator.
+** Tricks: LR scheduling, keep discriminator small, update generator more frequently.
+* Mode collapse
+Mode collapse happens when the generator cannot model different parts of the distribution.
+;DCGAN architecture guidelines
+* Use strided conv instead of pooling for discriminator.
+* Use batchnorm in generator and discriminator.
+* Remove FC hidden layers.
+* Use Relu for hidden layers, tanh for output layers of generator.
+* Use LRelu for discriminator.
+LSGAN, WGAN have tricks to mitigate mode collapse.
+===Evaluation of GANs===
+* Turing test (User study)
+* Inception score
+===Variational Auto-encoders (VAEs)===
+;Training a VAE
+* Data likelihood: <math>P(x) = \int P(x|z) P(z) dz</math>
+* Approx with samples of z during training: <math>P(x) \approx \frac{1}{n} \sum_{i=0}^{n} P(x | z_i)</math>
+* This is impractical.
+Assume we can learn a distribution <math>Q(z)</math> such that <math>z \sim Q(z)</math> generates <math>P(x|z) > 0</math>.
+Relating <math>P(x)</math> and <math>E_{z \sim Q(z|x)}</math>?
+<math>
+\begin{aligned}
+D_{KL}[Q(z|x) \Vert P(z|x)] &= E_{z \sim Q}[\log Q(z|x) - \log P(z)] - E_{z \sim Q}[\log P(x|z)] + \log P(x)\\
+&= D_{KL}[Q|P] - E_{z \sim Q}[\log P(x|z)] + \log P(x)
+\end{aligned}
+</math>
+Rearranging we get:
+<math>
+\begin{aligned}
+&\log P(x) - D_{KL}[Q(z|x) \Vert P(z|x)] = E_{z \sim Q}[\log P(x|z)] - D_{KL}[Q(z|x) \Vert P(z)]\\
+\implies &\log P(x) \geq E_{z \sim Q(z|x)}[\log P(x|z)] - D_{KL}[Q(z|x) \Vert P(z)]
+\end{aligned}
+</math>
+This is known as variational lower bound or ''ELBO''.
+* We first have the encoder output a mean <math>\mu_{z|x}</math> and covariance matrix diagonal <math>\Sigma_{z|x}</math>.
+* For ELBO we want to optimize <math>E_{z \sim Q(z|x)}[\log P(x|z)] - D_{KL}[Q(z|x) \Vert P(z)]</math>.
+* Our first loss is <math>D_{KL}(N(\mu_{z|x}, \Sigma_{z|x}) \Vert N(0, I))</math>.
+* We sample z from <math>N(\mu_{z|x}, \Sigma_{z|x})</math> and pass it to the decoder which outputs <math>\mu_{x|z}, \Sigma_{x|z}</math>.
+* Sample <math>\hat{x}</math> from the distribution and have reconstruction loss <math>\Vert x - \hat{x} \Vert^2</math>.
+* Most blog posts will forget to sample from <math>P(x|z)</math>.
+;Modeling P(x|z)
+Let f(z) be the network output.
+* Assume <math>P(x|z)</math> is iid Gaussian.
+* <math>\hat{x} = f(z) + \eta</math> where <math>\eta \sim N(0,1)</math>
+* Simplifies to an L2 loss <math>\Vert x - f(z) \Vert^2</math>
+Importance weighted VAE uses N samples for the loss.
+;Reparameterization trick
+To sample from the latent space, you do <math>z = \mu + \sigma \varepsilon</math> where <math>\varepsilon \sim N(0,1)</math>.
+This way, you can backprop through through the sampling step.
+;Conditional VAE
+Just input the condition into the encoder and decoder.
+;Pros
+* Principled approach to generative models
+* Allows inference of <math>q(z|x)</math> which can be used as a feature representation
+;Cons
+* Maximizes ELBO
+* Samples are blurrier than GANs
+Why are samples blurry?
+* Samples are not blurry but noisy
+** Sample vs Mean/Expected Value
+* L2 loss
+===Flow-based Models===
+Flow-based models minimize the negative log-likelihood.
+==Attribute-based Representation==
+;Motivation
+Typically in recognition, we only predict the class of the image.
+From the category, we can guess the attributes but the category provides only limited information.
+The network cannot perform prediction on unseen new classes.
+This problem used to be called ''graceful degradation''.
+;Goal
+Learn intermediate structure with object categories.
+;Should we care about attributes in DL?
+;Why is attributes not simply supervised recognition?
+;Benefits
+* Dealing with inevitable failure.
+* We can infer things about unseen categories.
+* We can make comparison between objects or categories.
+;Datasets
+* a-Pascal
+* a-Yahoo
+* CORE
+* COCO Attributes
+Deep networks should be able to learn attributes implicitly.
+However, you don't know if it has actually learned them.
+==Extra Topics==
+===Fine-grained Recognition===
+===Few-shot Recognition===
+* Metric learning methods
+* Meta-learning methods
+* Data Augmentation Methods
+* Semantics
+===Zero-shot Recognition===
+Goal is train a classifier without having seen a single labeled example.
+The information comes from a knowledge graph e.g. from word embeddings.
+===Beyond Labelled Datasets===
+* Semi-supervised: We have both labelled and unlabeled training samples.
+* Weakly-supervised: The labels are weak, noisy, and non-necessarily for the task we want.
+* Learning from the Web: Download data from the internet
 ==Will be on the exam==
@@ Line 1,054: / Line 1,249: @@
 * Softmax, sigmoid, cross entropy
 * RCNN vs Fast-RCNN vs Faster-RCNN
+* DPM
+* Selective search vs RPM
+* ELBO
+Final exam:
+* Friday Dec 4, 2020
+* 4pm-6pm on gradescope
+* Will have multiple choice, fill-in-the-blank, question answering, open-ended
+** Generally simple questions; either you know or you don't
+* No practice exams
+* Only need to know major names (RCNN, Fast(er)-RCNN)
+* Only covers lecture material.
+* 1 letter page of open notes, both sides allowed (honor system)
+Homework 2
+* Released Nov 30, 2020
+* Take SLIC super pixels, extract deep features and classify them.
+* Will have 2 bonus credits, must pick one
+** Use features from multiple layers (multi-scale)
+** Use multiple-levels of SLIC in input (SLIC feature pyramid)
+Final Project
+* Presentations Dec 10
+* Recorded videos with presentations
+* Final reports Dec 18
+[https://docs.google.com/document/d/1BKmpBWBWuEEywDyBw9CsHgOPB6DH7I0oKEs8zXS7XQw/edit?usp=sharing My Exam Cheat Sheet]
 ==Project Notes==
@@ Line 1,059: / Line 1,281: @@
 * Challenges
 * What methods worked and didn't work.
-==Misc==
-[[Visible to::users]]
 ==References==
@@ Line 1,067: / Line 1,286: @@
 <ref name="torralba2008tinyimages">Antonio Torralba, Rob Fergus and William T. Freeman (2008). 80 million tiny images: a large dataset for
 non-parametric object and scene recognition (PAMI 2008) [https://people.csail.mit.edu/torralba/publications/80millionImages.pdf Link]</ref>
-<ref name="standing1973learning">Lionel Standing (1973). Learning 10000 pictures. ''Journal
+<ref name="standing1973learning">Lionel Standing (1973). Learning 10000 pictures. ''Journal Quarterly Journal of Experimental Psychology'' [https://www.tandfonline.com/doi/abs/10.1080/14640747308400340 Link]</ref>
-Quarterly Journal of Experimental Psychology'' [https://www.tandfonline.com/doi/abs/10.1080/14640747308400340 Link]</ref>
 <ref name="brady2008visual">Timothy F. Brady, Talia Konkle, George A. Alvarez, and Aude Oliva (2008). Visual long-term memory has a massive storage capacity for object details. [http://olivalab.mit.edu/MM/pdfs/BradyKonkleAlvarezOliva2008.pdf Link].</ref>
 <ref name="torralba2011unbiased>Antonio Torralba, Alexei A. Efros (2011). Unbiased Look at Dataset Bias (CVPR 2011) [https://people.csail.mit.edu/torralba/publications/datasets_cvpr11.pdf Link]</ref>

Visual Learning and Recognition: Difference between revisions