Awesome
Awesome-Video-Generation
A curated list of awesome work (currently 257 papers) a on video generation and video representation learning, and related topics (such as RL). Feel free to contribute or email me if I've missed your paper off the list : ]
They are ordered by year (new to old). I provide a link to the paper as well as to the github repo where available.
2020
Disentangling multiple features in video sequences using Gaussian processes in variational autoencoders. Bhagat, Uppal, Yin, Lim https://arxiv.org/abs/2001.02408
Generative adversarial networks for spatio-temporal data: a survey. Gao, Xue, Shao, Zhao, Qin, Prabowo, Rahaman, Salim https://arxiv.org/pdf/2008.08903.pdf
Deep state-space generative model for correlated time-to-event predictions. Xue, Zhou, Du, Dai, Xu, Zhang, Cui https://dl.acm.org/doi/abs/10.1145/3394486.3403206
Toward discriminating and synthesizing motion traces using deep probabilistic generative models. Zhou, Liu, Zhang, Trajcevski https://ieeexplore.ieee.org/abstract/document/9165954/
Sample-efficient robot motion learning using Gaussian process latent variable models. Delgado-Guerrero, Colome, Torras http://www.iri.upc.edu/files/scidoc/2320-Sample-efficient-robot-motion-learning-using-Gaussian-process-latent-variable-models.pdf
Sequence prediction using spectral RNNS . Wolter, Gall, Yao https://www.researchgate.net/profile/Moritz_Wolter2/publication/329705630_Sequence_Prediction_using_Spectral_RNNs/links/5f36b9d892851cd302f44a57/Sequence-Prediction-using-Spectral-RNNs.pdf
Self-supervised video representation learning by pace prediction. Wang, Joai, Liu https://arxiv.org/pdf/2008.05861.pdf
RhyRNN: Rhythmic RNN for recognizing events in long and complex videos. Yu, Li, Li http://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123550137.pdf
4D forecasting: sequential forecasting of 100,000 points. Weng, Wang, Levine, Kitani, Rhinehart http://www.xinshuoweng.com/papers/SPF2_eccvw/camera_ready.pdf
Multimodal deep generative models for trajectory prediction: a conditional variational autoencoder approach. Ivanovic, Leung, Schmerling, Pavone https://arxiv.org/pdf/2008.03880.pdf
Memory-augmented dense predictive coding for video representation learning. Han, Xie, Zisserman https://arxiv.org/pdf/2008.01065.pdf
SeCo: exploring sequence supervision for unsupervised representation learning. Yao, Zhang, Qiu, Pan, Mei https://arxiv.org/pdf/2008.00975.pdf
PDE-driven spatiotemporal disentanglement. Dona, Franceschi, Lamprier, Gallinari https://arxiv.org/pdf/2008.01352.pdf
Dynamics generalization via information bottleneck in deep reinforcement learning. Lu, Lee, Abbeel, Tiomkin https://arxiv.org/pdf/2008.00614.pdf
Latent space roadmap for visual action planning. Lippi, Poklukar, Welle, Varava, Yin, Marino, Kragic https://rss2020vlrrm.github.io/papers/3_CameraReadySubmission_RSS_workshop_latent_space_roadmap.pdf
Weakly-supervised learning of human dynamics. Zell, Rosenhahn, Wandt https://arxiv.org/pdf/2007.08969.pdf
Deep variational Leunberger-type observer for stochastic video prediction. Wang, Zhou, Yan, Yao, Liu, Ma, Lu https://arxiv.org/pdf/2003.00835.pdf
NewtonianVAE: proportional control and goal identification from pixels via physical latent spaces. Jaques, Burke, Hospedales https://arxiv.org/pdf/2006.01959.pdf
Constrained variational autoencoder for improving EEG based speech recognition systems. Krishna, Tran, Carnahan, Tewfik https://arxiv.org/pdf/2006.02902.pdf
Latent video transformer. Rakhumov, Volkhonskiy https://arxiv.org/pdf/2006.10704.pdf
Beyond exploding and vanishing gradients: analysing RNN training using attractors and smoothness . Ribeiro, Tiels, Aguirre, Schon http://proceedings.mlr.press/v108/ribeiro20a/ribeiro20a.pdf
Towards recurrent autoeregressive flow models . Mern, Morales, Kochenderfer https://arxiv.org/pdf/2006.10096.pdf
Learning to combine top-down and bottom-up signals in recurrent neural networks with attention over modules. Mittal, Lamb, Goyal, Voleti et al. https://www.cs.colorado.edu/~mozer/Research/Selected%20Publications/reprints/Mittaletal2020.pdf
Unmasking the inductive biases of unsupervised object representations for video sequences. Weis, Chitta, Sharma et al. https://arxiv.org/pdf/2006.07034.pdf
G3AN: disentnagling appearance and motion for video generation. Wang, Bilinski, Bermond, Dantcheva http://openaccess.thecvf.com/content_CVPR_2020/papers/Wang_G3AN_Disentangling_Appearance_and_Motion_for_Video_Generation_CVPR_2020_paper.pdf
Learning dynamic relationships for 3D human motion prediction . Cui, Sun, Yang http://openaccess.thecvf.com/content_CVPR_2020/papers/Cui_Learning_Dynamic_Relationships_for_3D_Human_Motion_Prediction_CVPR_2020_paper.pdf
Joint training of variational auto-encoder and latent energy-based model. Han, Nijkamp, Zhou, Pang, Zhu, Wu http://openaccess.thecvf.com/content_CVPR_2020/papers/Han_Joint_Training_of_Variational_Auto-Encoder_and_Latent_Energy-Based_Model_CVPR_2020_paper.pdf
Learning invariant representations for reinforcement learning without reconstruction. Zhang, McAllister, Calandra, Gal, Levine https://arxiv.org/pdf/2006.10742.pdf
Variational inference for sequential data with future likelihood estimates. Kim, Jang, Yang, Kim http://ailab.kaist.ac.kr/papers/pdfs/KJYK2020.pdf
Video prediction via example guidance. Xu, Xu, Ni, Yang, Darrell https://arxiv.org/pdf/2007.01738.pdf
Hierarchical path VAE-GAN: generating diverse videos from a single sample. Gur, Benaim, Wolf https://arxiv.org/pdf/2006.12226.pdf
Dynamic facial expression generation on Hilbert Hypersphere with conditional Wasserstein Generative adversarial nets. Otberdout, Daoudi, Kacem, Ballihi, Berretti https://arxiv.org/abs/1907.10087
HAF-SVG: hierarchical stochastic video generation with aligned features. Lin, Yuan, Li https://www.ijcai.org/Proceedings/2020/0138.pdf
Improving generative imagination in object-centric world models. Lin, Wu, Peri, Fu, Jiang, Ahn https://proceedings.icml.cc/static/paper_files/icml/2020/4995-Paper.pdf
Deep generative video compression with temporal autoregressive transforms. Yang, Yang, Marino, Yang, Mandt https://joelouismarino.github.io/files/papers/2020/seq_flows_compression/seq_flows_compression.pdf
Spatially structured recurrent modules. Rahaman, Goyal, Gondal, Wuthrich, Bauer, Sharma, Bengio, Scholkopf https://arxiv.org/pdf/2007.06533.pdf
Unsupervised object-centric video generation and decomposition in 3D. Henderson, Lampert https://arxiv.org/pdf/2007.06705.pdf
Planning from images with deep latent gaussian process dynamics. Bosch, Achterhold, Leal-Taixe, Stuckler https://arxiv.org/pdf/2005.03770.pdf
Planning to explore via self-supervised world models . Sekar, Rybkin, Daniilidis, Abbeel, Hafner, Pathak https://arxiv.org/pdf/2005.05960.pdf
Mutual information maximization for robust plannable representations. Ding, Clavera, Abbeel https://arxiv.org/pdf/2005.08114.pdf
Supervised contrastive learning. Khosla, Teterwak, Wang, Sarna https://arxiv.org/pdf/2004.11362.pdf
Blind source extraction based on multi-channel variational autoencoder and x-vector-based speaker selection trained with data augmentation. Gu, Liao, Lu https://arxiv.org/pdf/2005.07976.pdf
BiERU: bidirectional emotional recurrent unit for conversational sentiment analysis. Li, Shao, Ji, Cambria https://arxiv.org/pdf/2006.00492.pdf
S3VAE: self-supervised sequential VAE for representation disentanglement and data generation. Zhu, Min, Kadav, Graf, https://arxiv.org/pdf/2005.11437.pdf
Probably approximately correct vision-based planning using motion primitives. Veer, Majumdar https://arxiv.org/abs/2002.12852
MoVi: a large multipurpose motion and video dataset . Ghorbani, Mahdaviani, Thaler, Kording, Cook, Blohm, Troje https://arxiv.org/abs/2003.01888
Temporal convolutional attention-based network for sequence modeling. Hao, Wang, Xia, Shen, Zhao https://arxiv.org/abs/2002.12530
Neuroevolution of self-interpretable agents. Tang, Nguyen, Ha https://arxiv.org/abs/2003.08165
Attentional adversarial variational video generation via decomposing motion and content. Talafha, Rekabdar, Ekenna, Mousas https://ieeexplore.ieee.org/document/9031476
Imputer sequence modelling via imputation and dynamic programming. Chan, Sharahia, Hinton, Norouzi, Jaitly https://arxiv.org/abs/2002.08926
Variational conditioning of deep recurrent networks for modeling complex motion dynamics. Buckchash, Raman https://ieeexplore.ieee.org/document/9055015?denied=
Training of deep neural networks for the generation of dynamic movement primitives. Pahic, Ridge, Gams, Morimoti, Ude https://www.sciencedirect.com/science/article/pii/S0893608020301301
PreCNet: next frame video prediction based on predictive coding. Straka, Svoboda, Hoffmann https://arxiv.org/pdf/2004.14878.pdf
Dimensionality reduction of movement primitives in parameter space. Tosatto, Stadtmuller, Peters https://arxiv.org/abs/2003.02634
Disentangling physical dynamics from unknown factors for unsupervised video prediction. Guen, Thorne https://arxiv.org/abs/2003.01460
A real-robot dataset for assessing transferability of learned dynamics models. Agudelo-Espana, Zadaianchuk, Wenk, Garg, Akpo et al https://www.is.mpg.de/uploads_file/attachment/attachment/589/ICRA20_1157_FI.pdf
Hierarchical decomposition of nonlinear dynamics and control for system indentification and policy distillation. Abdulsamad, Peters https://arxiv.org/pdf/2005.01432.pdf
Occlusion resistant learning of intuitive physics from videos. Riochet, Sivic, Laptev, Dupoux https://arxiv.org/pdf/2005.00069.pdf
Scalable learning in altent state sequence models Aicher https://digital.lib.washington.edu/researchworks/bitstream/handle/1773/45550/Aicher_washington_0250E_21152.pdf?sequence=1
How useful is self-supervised pretraining for visual tasks? Newell, Deng https://arxiv.org/pdf/2003.14323.pdf
q-VAE for disentangled representation learning and latent dynamical systems Koboyashi https://arxiv.org/pdf/2003.01852.pdf
Variational recurrent models for solving partially observable control tasks. Han, Doya, Tani https://openreview.net/forum?id=r1lL4a4tDB
Stochastic latent residual video prediction. Franceschi, Delasalles, Chen, Lamprier, Gallinari https://arxiv.org/pdf/2002.09219.pdf https://sites.google.com/view/srvp
Disentangled speech embeddings using cross-modal self-supervision. Nagrani, Chung, Albanie, Zisserman https://arxiv.org/abs/2002.08742
TwoStreamVAM: improving motion modeling in video generation. Sun, Xu, Saenko https://arxiv.org/abs/1812.01037
Variational hyper RNN for sequence modeling. Deng, Cao, Chang, Sigal, Mori, Brubaker https://arxiv.org/abs/2002.10501
Exploring spatial-temporal multi-frequency analysis for high-fidelity and temporal-consistency video prediction. Jin, Hu,Tang, Niu, Shi, Han, Li https://arxiv.org/abs/2002.09905
2019
Representing closed transformation paths in encoded network latent space. Connor, Rozell https://arxiv.org/pdf/1912.02644.pdf
Animating arbitrary objects via deep motion transfer. Siarohin, Lathuiliere, Tulyakov, Ricci, Sebe https://arxiv.org/abs/1812.08861
Feedback recurrent autoencoder. Yang, Sautiere, Ryu, Cohen https://arxiv.org/abs/1911.04018
First order motion model for image animation. Siarohin, Lathuiliere, Tulyakov, Ricci, Sebe https://papers.nips.cc/paper/8935-first-order-motion-model-for-image-animation
Point-to-point video generation. Wang, Cheng, Lin, Chen, Sun https://arxiv.org/pdf/1904.02912.pdf
Learning deep controllable and structured representations for image synthesis, structured prediction and beyond. Yan https://deepblue.lib.umich.edu/handle/2027.42/153334
Decoupling feature extraction from policy learning: assessing benefits of state representation learning in goal based robotics. Raffin, Hill, Traore, Lesort, Diaz-Rodriguez, Filliat https://arxiv.org/abs/1901.08651
Task-Conditioned variational autoencoders for learning movement primitives. Noseworthy, Paul, Roy, Park, Roy https://groups.csail.mit.edu/rrg/papers/noseworthy_corl_19.pdf
Spatio-temporal alignments: optimal transport through space and time. Janati, Cuturi, Gramfort https://arxiv.org/pdf/1910.03860.pdf
Action Genome: actions as composition of spatio-temporal scene graphs. Ji, Krishna, Fei-Fei, Niebles https://arxiv.org/pdf/1912.06992.pdf
Video-to-video translation for visual speech synthesis. Doukas, Sharmanska, Zafeiriou https://arxiv.org/pdf/1905.12043.pdf Predictive coding, variational autoencoders, and biological connections Marino https://openreview.net/pdf?id=SyeumQYUUH
Single Headed Attention RNN: stop thinking with your head. Merity https://arxiv.org/pdf/1911.11423.pdf
Hamiltonian neural networks. Greydanus, Dzamba, Yosinski https://arxiv.org/pdf/1906.01563.pdf https://github.com/greydanus/hamiltonian-nn
Learning what you can do before doing anything. Rybkin, Pertsch, Derpanis, Daniilidis, Jaegle https://openreview.net/pdf?id=SylPMnR9Ym https://daniilidis-group.github.io/learned_action_spaces
Deep Lagrangian networks: using physics as model prior for deep learning. Lutter, Ritter, Peters https://arxiv.org/pdf/1907.04490.pdf
A general framework for structured learning of mechanical systems. Gupta, Menda, Manchester, Kochenderfer https://arxiv.org/pdf/1902.08705.pdf https://github.com/sisl/machamodlearn
Learning predictive models from observation and interaction. Schmeckpeper, Xie, Rybkin, Tian, Daniilidis, Levine, Finn https://arxiv.org/pdf/1912.12773.pdf
A multigrid method for efficiently training video models. Wu, Girshick, He, Feichtenhofer, Krahenbuhl https://arxiv.org/pdf/1912.00998.pdf
Deep variational Koopman models: inferring Koopman observations for uncertainty-aware dynamics modeling and control . Morton, Witherden, Kochenderfer https://arxiv.org/pdf/1902.09742.pdf
Symplectic ODE-NET: learning hamiltonian dynamics with control. Zhong, Dey, Chakraborty https://arxiv.org/pdf/1909.12077.pdf
Hamiltonian graph networks with ODE integrators. Sanchez-Gonzalez, Bapst, Cranmer, Battaglia https://arxiv.org/pdf/1909.12790.pdf
Neural ordinary differential equations. Chen, Rubanova, Bettencourt, Duvenaud https://arxiv.org/pdf/1806.07366.pdf https://github.com/rtqichen/torchdiffeq
Variational autoencoder trajectory primitives and discrete latent codes. Osa, Ikemoto https://arxiv.org/pdf/1912.04063.pdf
Newton vs the machine: solving the chaotic three-body problem using deep neural networks. Breen, Foley, Boekholt, Zwart https://arxiv.org/pdf/1910.07291.pdf
Learning dynamical systems from partial observations. Ayed, de Bezenac, Pajot, Brajard, Gallinari https://arxiv.org/pdf/1902.11136.pdf
GP-VAE: deep probabilistic time series imputation. Fortuin, Baranchuk, Ratsch, Mandt https://arxiv.org/pdf/1907.04155.pdf https://github.com/ratschlab/GP-VAE
Ghost hunting in the nonlinear dynamic machine. Butner, Munion, Baucom, Wong https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0226572
Faster attend-infer-repeat with tractable probabilistic models. Stelzner, Peharz, Kersting http://proceedings.mlr.press/v97/stelzner19a/stelzner19a.pdf https://github/stelzner/supair
Tree-structured recurrent switching linear dynamical systems for multi-scale modeling. Nassar, Linderman, Bugallo, Park https://arxiv.org/pdf/1811.12386.pdf
DynaNet: neural Kalman dynamical model for motion estimation and prediction. Chen, Lu, Wang, Trigoni, Markham https://arxiv.org/pdf/1908.03918.pdf
Disentangled behavioral representations. Dezfouli, Ashtiani, Ghattas, Nock, Dayan, Ong https://papers.nips.cc/paper/8497-disentangled-behavioural-representations.pdf
Structured object-aware physics prediction for video modeling and planning. Kossen, Stelzner, Hussing, Voelcker, Kersting https://arxiv.org/pdf/1910.02425.pdf https://github.com/jlko/STOVE
Recurrent attentive neural process for sequential data. Qin, Zhu, Qin, Wang, Zhao https://arxiv.org/pdf/1910.09323.pdf https://kasparmartens.rbind.io/post/np/
DeepMDP: learning continuous latent space models for representation learning. Gelada, Kumar, Buckman, Nachum, Bellemare https://arxiv.org/pdf/1906.02736.pdf
Genesis: generative scene inference and sampling with object-centric latent representations. Engelcke, Kosiorek, Jones, Posner https://arxiv.org/pdf/1907.13052.pdf https://github.com/applied-ai-lab/genesis
Deep conservation: a latent dynamics model for exact satisfaction of physical conservation laws. Lee, Carlberg https://arxiv.org/pdf/1909.09754.pdf
Switching linear dynamics for variational bayes filtering. Becker-Ehmck, Peters, van der Smagt https://arxiv.org/pdf/1905.12434.pdf https://arxiv.org/pdf/1905.12434.pdf
Approximate Bayesian inference in spatial environments Mirchev, Kayalibay, Soelch, van der Smagt, Bayer https://arxiv.org/pdf/1805.07206.pdf
beta-DVBF: learning state-space models for control from high dimensional observations. Das, Karl, Becker-Ehmck, van der Smagt https://arxiv.org/pdf/1911.00756.pdf
SSA-GAN: End-to-end time-lapse video generation with spatial self-attention. Horita, Yanai http://img.cs.uec.ac.jp/pub/conf19/191126horita_0.pdf
Learning energy-based spatial-temporal generative convnets for dynamic patterns. Xie, Zhu, Wu https://arxiv.org/pdf/1909.11975.pdf http://www.stat.ucla.edu/~jxie/STGConvNet/STGConvNet.html
Multiplicative interactions and where to find them. Anon https://openreview.net/pdf?id=rylnK6VtDH
Time-series generative adversarial networks. Yoon, Jarrett, van der Schaar https://papers.nips.cc/paper/8789-time-series-generative-adversarial-networks.pdf
Explaining and interpreting LSTMs. Arras, Arjona-Medina, Widrich, Montavon, Gillhofer, Muller, Hochreiter, Samek https://arxiv.org/pdf/1909.12114.pdf
Gating revisited: deep multi-layer RNNs that can be trained. Turkoglu, D'Aronco, Wegner, Schindler https://arxiv.org/pdf/1911.11033.pdf
Re-examination of the role of latent variables in sequence modeling. Lai, Dai, Yang, Yoo https://arxiv.org/pdf/1902.01388.pdf
Improving sequential latent variable models with autoregressive flows. Marino, Chen, He, Mandt https://openreview.net/pdf?id=HklvmlrKPB
Learning stable and predictive structures in kinetic systems: benefits of a causal approach. Pfister, Bauer, Peters https://arxiv.org/pdf/1810.11776.pdf
Learning to disentangle latent physical factors for video prediction. Zhu, Munderloh, Rosenhahn, Stuckler https://link.springer.com/chapter/10.1007/978-3-030-33676-9_42
Adversarial video generation on complex datasets. Clark, Donahue, Simonyan https://arxiv.org/pdf/1907.06571.pdf
Learning to predict without looking ahead: world models without forward prediction. Freeman, Metz, Ha https://arxiv.org/pdf/1910.13038.pdf
Learning video representations using contrastive bidirectional transformer. Sun, Baradel, Murphy, Schmid https://arxiv.org/pdf/1906.05743.pdf
STCN: stochastic temporal convolution networks. Aksan, Hilliges https://arxiv.org/pdf/1902.06568.pdf http://jacobcwalker.com/DTP/DTP.html https://ait.ethz.ch/projects/2019/stcn/
Zero-shot generation of human-object interaction videos. Nawhal, Zhai, Lehrmann, Sigal https://arxiv.org/pdf/1912.02401.pdf http://www.sfu.ca/~mnawhal/projects/zs_hoi_generation.html
Learning a generative model for multi-step human-object interactions from videos. Wang, Pirk, Yumer, Kim, Sener, Sridhar, Guibas http://www.pirk.info/papers/Wang.etal-2019-LearningInteractions.pdf http://www.pirk.info/projects/learning_interactions/index.html
Dream to control: learning behaviors by latent imagining. Hafner, Lillicrap, Ba, Norouzi https://arxiv.org/pdf/1912.01603.pdf
Multistage attention network for multivariate time series prediction. Hu, Zheng https://www.sciencedirect.com/science/article/abs/pii/S0925231219316625
Predicting video-frames using encoder-convLSTM combination. Mukherjee, Ghosh, Ghosh, Kumar, Roy https://ieeexplore.ieee.org/document/8682158
A variational auto-encoder model for stochastic point processes. Mehrasa, Jyothi, Durand, He, Sigal, Mori https://arxiv.org/pdf/1904.03273.pdf
Unsupervised speech representation learning using WaveNet encoders. Chorowski, Weiss, Bengio, van den Oord https://arxiv.org/pdf/1901.08810.pdf
Local aggregation for unsupervised learning of visual embeddings. Zhuang, Zhai, Yamins http://openaccess.thecvf.com/content_ICCV_2019/papers/Zhuang_Local_Aggregation_for_Unsupervised_Learning_of_Visual_Embeddings_ICCV_2019_paper.pdf
Hamiltonian generative Networks. Toth, Rezende, Jaegle, Racaniere, Botev, higgins https://arxiv.org/pdf/1909.13789.pdf
VideoBERT: a joint model for video and language representations learning. Sun, Myers, Vondrick, Murphy, Schmid https://arxiv.org/pdf/1904.01766.pdf
Video representation learning via dense predictive coding. Han, Xie, Zisserman http://openaccess.thecvf.com/content_ICCVW_2019/papers/HVU/Han_Video_Representation_Learning_by_Dense_Predictive_Coding_ICCVW_2019_paper.pdf https://github.com/TengdaHan/DPC
Hamiltonian Generative Networks Toth, Rezende, Jaegle, Racaniere, Botev, higgins https://arxiv.org/pdf/1909.13789.pdf
Unsupervised state representation learning in Atari. Anand, Racah, Ozair, Bengio, Cote, Hjelm https://arxiv.org/pdf/1906.08226.pdf
Temporal cycle-consistency learning. Dwibedi, Aytar, Tompson, Sermanet, Zisserman http://openaccess.thecvf.com/content_CVPR_2019/papers/Dwibedi_Temporal_Cycle-Consistency_Learning_CVPR_2019_paper.pdf
Self-supervised learning by cross-modal audio-video clustering. Alwassel, Mahajan, Torresani, Ghanem, Tran https://arxiv.org/pdf/1911.12667.pdf
Human action recognition with deep temporal pyramids. Mazari, Sahbi https://arxiv.org/pdf/1905.00745.pdf
Evolving losses for unlabeled video representation learning. Piergiovanni, Angelova, Ryoo https://arxiv.org/pdf/1906.03248.pdf
MoGlow: probabilistic and controllable motion synthesis using normalizing flows. Henter, Alexanderson, Beskow https://arxiv.org/pdf/1905.06598.pdf https://www.youtube.com/watch?v=lYhJnDBWyeo
High fidelity video prediction with large stochastic recurrent neural networks. Villegas, Pathak, Kannan, Erhan, Le, Lee https://arxiv.org/pdf/1911.01655.pdf https://sites.google.com/view/videopredictioncapacity
Spatiotemporal pyramid network for video action recognition. Wang, Long, Wan, Yu https://arxiv.org/pdf/1903.01038.pdf
Attentive temporal pyramid network for dynamic scene classification. Huang, Cao, Zhen, Han https://www.aaai.org/ojs/index.php/AAAI/article/view/5184
Disentangling video with independent prediction. Whitney, Fergus https://arxiv.org/pdf/1901.05590.pdf
Disentangling state space representations. Miladinovic, Gondal, Scholkopf, Buhmann, Bauer https://arxiv.org/pdf/1906.03255.pdf
Cycle-SUM: cycle-consistent adversarial LSTM networks for unsupervised video summarization Yuan, Tay, Li, Zhou, Feng https://arxiv.org/pdf/1904.08265.pdf Unsupervised learning from video with deep neural embeddings Zhuang, Andonian, Yamins https://arxiv.org/pdf/1905.11954.pdf Scaling and benchmarking self-supervised visual representation learning. Goyal, Mahajan, Gupta, Misra https://arxiv.org/pdf/1905.01235.pdf
Self-supervised visual feature learning with deep neural networks: a survey. Jing, Tian https://arxiv.org/pdf/1902.06162.pdf Unsupervised learning of object structure and dynamics from videos Minderer, Sun, Villegas, Cole, Murphy, Lee https://arxiv.org/pdf/1906.07889.pdf
Learning correspondence from the cycle-consistency of time. Wang, Jabri, Efros https://arxiv.org/pdf/1903.07593.pdf https://ajabri.github.io/timecycle/
DistInit: learning video representations without a single labeled video . Girdhar, Tran, Torresani, Ramanan https://arxiv.org/pdf/1901.09244.pdf
VideoFlow: a flow-based generative model for video. Kumar, Babaeizadeh, Erhan, Finn, Levine, Dinh, Kingma https://arxiv.org/pdf/1903.01434.pdf find code in tensor2tensor library
Learning latent dynamics for planning from pixels. Hafner, Lillicrap, Fischer, Villegas, Ha, Lee, Davidson https://arxiv.org/pdf/1811.04551.pdf https://github.com/google-research/planet
View-LSTM: Novel-view video synthesis trough view decomposition. Lakhal, Lanz, Cavallaro http://openaccess.thecvf.com/content_ICCV_2019/papers/Lakhal_View-LSTM_Novel-View_Video_Synthesis_Through_View_Decomposition_ICCV_2019_paper.pdf
Likelihood conribution based multi-scale architecture for generative flows. Das, Abbeel, Spanos https://arxiv.org/pdf/1908.01686.pdf
Adaptive online planning for continual lifelong learning. Lu, Mordatch, Abbeel https://arxiv.org/pdf/1912.01188.pdf Exploiting video sequences for unsupervised disentangling in generative adversarial networks Tuesca, Uzal https://arxiv.org/pdf/1910.11104.pdf
Memory in memory: a predictive neural network for learning higher-order non-stationarity from spatiotemporal dynamics. Wang, Zhang, Zhe, Long, Wang, Yu https://arxiv.org/pdf/1811.07490.pdf
Improved conditional VRNNs for video prediction. Castrejon, Ballas, Courville https://arxiv.org/pdf/1904.12165.pdf
Temporal difference variational auto-encoder. Gregor, Papamakarios, Besse, Buesing, Weber https://arxiv.org/pdf/1806.03107.pdf
Time-agnostic prediction: predicting predictable video frames. Jayaraman, Ebert, Efros, Levine https://arxiv.org/pdf/1808.07784.pdf https://sites.google.com/view/ta-pred
Variational tracking and prediction with generative disentangled state-space models. Akhundov, Soelch, Bayer, van der Smagt https://arxiv.org/pdf/1910.06205.pdf
Self-supervised spatiotemporal learning via video clip order prediction. Xu, Xiao, Zhao, Shao, Xie, Zhuang https://pdfs.semanticscholar.org/558a/eb7aa38cfcf8dd9951bfd24cf77972bd09aa.pdf https://github.com/xudejing/VCOP
Self-supervised spatio-temporal representation learning for videos by predicting motion and appearance statistics. Wang, Jiao, Bao, He, Liu, Liu http://openaccess.thecvf.com/content_CVPR_2019/papers/Wang_Self-Supervised_Spatio-Temporal_Representation_Learning_for_Videos_by_Predicting_Motion_and_CVPR_2019_paper.pdf
Spatio-temporal associative representation for video person re-identification. Wu, Zhu, Gong http://www.eecs.qmul.ac.uk/~sgg/papers/WuEtAl_BMVC2019.pdf
Object segmentation using pixel-wise adversarial loss. Durall, Pfreundt, Kothe, Keuper https://arxiv.org/pdf/1909.10341.pdf
2018
The dreaming variational autoencoder for reinforcement learning environments. Andersen, Goodwin, Granmo https://arxiv.org/pdf/1810.01112v1.pdf
MT-VAE: Learning Motion Transformations to Generate Multimodal Human Dynamics. Yan, Rastogi, Villegas, Sunkavalli, Shechtman, Hadap, Yumer, Lee http://openaccess.thecvf.com/content_ECCV_2018/html/Xinchen_Yan_Generating_Multimodal_Human_ECCV_2018_paper.html
Deep learning for universal linear embeddings of nonlinear dynamics. Lusch, Kutz, Brunton https://www.nature.com/articles/s41467-018-07210-0
Variational attention for sequence-to-sequence models. Bahuleyan, Mou, Vechtomova, Poupart https://arxiv.org/pdf/1712.08207.pdf https://github.com/variational-attention/tf-var-attention
Understanding image motion with group representations. Jaegle, Phillips, Ippolito, Daniilidis https://openreview.net/forum?id=SJLlmG-AZ
Relational neural expectation maximization: unsupervised discovery of objects and their interactions. van Steenkiste, Chang, Greff, Schmidhuber https://arxiv.org/pdf/1802.10353.pdf https://sites.google.com/view/r-nem-gifs https://github.com/sjoerdvansteenkiste/Relational-NEM
A general method for amortizing variational filtering. Marino, Cvitkovic, Yue https://arxiv.org/pdf/1811.05090.pdf https://github.com/joelouismarino/amortized-variational-filtering
Deep learning for physical processes: incorporating prior scientific knowledge de Bezenac, Pajot, Gallinari https://arxiv.org/pdf/1711.07970.pdf https://github.com/emited/flow
Probabilistic recurrent state-space models . Doerr, Daniel, Schiegg, Nguyen-Tuong, Schaal, Toussaint, Trimpe https://arxiv.org/pdf/1801.10395.pdf https://github.com/boschresearch/PR-SSM
TGANv2: efficient training of large models for video generation with multiple subsampling layers. Saito, Saito https://arxiv.org/abs/1811.09245
Towards high resolution video generation with progressive growing of sliced Wasserstein GANs. Acharya, Huang, Paudel, Gool https://arxiv.org/abs/1810.02419
Representation learning with contrastive predictive coding. van den Oord, Li, Vinyas https://arxiv.org/pdf/1807.03748.pdf
Deconfounding reinforcement learning in observational settings . Lu, Scholkopf, Hernandez-Lobato https://arxiv.org/pdf/1812.10576.pdf
Flow-grounded spatial-temporal video prediction from still images. Li, Fang, Yang, Wang, Lu, Yang https://arxiv.org/pdf/1807.09755.pdf
Adaptive skip intervals: temporal abstractions for recurrent dynamical models. Neitz, Parascandolo, Bauer, Scholkopf https://arxiv.org/pdf/1808.04768.pdf
Disentangled sequential autoencoder. Li, Mandt https://arxiv.org/abs/1803.02991 https://github.com/yatindandi/Disentangled-Sequential-Autoencoder
Video jigsaw: unsupervised learning of spatiotemporal context for video action recognition. Ahsan, Madhok, Essa https://arxiv.org/pdf/1808.07507.pdf
Iterative reoganization with weak spatial constraints: solving arbitrary jigsaw puzzels for unsupervised representation learning. Wei, Xie, Ren, Xia, Su, Liu, Tian, Yuille https://arxiv.org/pdf/1812.00329.pdf
Stochastic adversarial video prediction. Lee, Zhang, Ebert, Abbeel, Finn, Levine https://arxiv.org/pdf/1804.01523.pdf https://alexlee-gk.github.io/video_prediction/
Stochastic variational video prediction. Babaeizadeh, Finn, Erhan, Campbell, Levine https://arxiv.org/pdf/1710.11252.pdf https://github.com/alexlee-gk/video_prediction
Folded recurrent neural networks for future video prediction. Oliu, Selva, Escalera https://arxiv.org/pdf/1712.00311.pdf
PredRNN++: Towards a resolution of the deep-in-time dilemma in spatiotemporal predictive learning. Wang, Gao, Long, Wang, Yu https://arxiv.org/pdf/1804.06300.pdf https://github.com/Yunbo426/predrnn-pp
Stochastic video generation with a learned prior. Denton, Fergus https://arxiv.org/pdf/1802.07687.pdf https://sites.google.com/view/svglp
Unsupervised learning from videos using temporal coherency deep networks. Redondo-Cabrera, Lopez-Sastre https://arxiv.org/pdf/1801.08100.pdf
Time-contrastive networks: self-supervised learning from video. Sermanet, Lynch, Chebotar, Hsu, Jang, Schaal, Levine https://arxiv.org/pdf/1704.06888.pdf
Learning to decompose and disentangle representations for video prediction. Hsieh, Liu, Huang, Fei-Fei, Niebles https://arxiv.org/pdf/1806.04166.pdf https://github.com/jthsieh/DDPAE-video-prediction
Probabilistic video generation using holistic attribute control. He, Lehrmann, Marino, Mori, Sigal https://arxiv.org/pdf/1803.08085.pdf
Interpretable intuitive physics model. Ye, Wang, Davidson, Gupta https://arxiv.org/pdf/1808.10002.pdf https://github.com/tianye95/interpretable-intuitive-physics-model
Video synthesis from a single image and motion stroke. Hu, Walchli, Portenier, Zwicker, Facaro https://arxiv.org/pdf/1812.01874.pdf
Graph networks as learnable physics engines for inference and control. Sanchez-Gonzalez, Heess, Springenberg, Merel, Riedmiller, Hadsell, Battaglia https://arxiv.org/pdf/1806.01242.pdf https://drive.google.com/file/d/14eYTWoH15T53a7qejvCkDLItOOE9Ve7S/view
Deep dynamical modeling and control of unsteady fluid flows. Morton, Witherden, Jameson, Kochenderfer https://arxiv.org/pdf/1805.07472.pdf https://github.com/sisl/deep_flow_control
Sequential attend, infer, repeat: generative modelling of moving objects. Kosiorek, Kim, Posner, Teh https://arxiv.org/pdf/1806.01794.pdf https://github.com/akosiorek/sqair https://www.youtube.com/watch?v=-IUNQgSLE0c&feature=youtu.be
Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. Xiong, Luo, Ma, Liu, Luo https://arxiv.org/pdf/1709.07592.pdf
Integrating accounts of behavioral and neuroimaging data using flexible recurrent neural network models. Dezfouli, Morris, Ramos, Dayan, Balleine https://papers.nips.cc/paper/7677-integrated-accounts-of-behavioral-and-neuroimaging-data-using-flexible-recurrent-neural-network-models.pdf
2017
Autoregressive attention for parallel sequence modeling. Laird, Irvin https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1174/reports/2755456.pdf
Physics informed deep learning: data-driven solutions of nonlinear partial differential equations. Raissi, Perdikaris, Karniadakis https://arxiv.org/pdf/1711.10561.pdf https://github.com/maziarraissi/PINNs
Unsupervised real-time control through variational empowerment. Karl, Soelch, Becker-Ehmck, Benbouzid, van de Smagt, Bayer https://arxiv.org/pdf/1710.05101.pdf https://github.com/tessavdheiden/Empowerment
z-forcing: training stochastic recurrent networks. Goyal, Sordoni, Cote, Ke, Bengio https://arxiv.org/abs/1711.05411 https://github.com/ujjax/z-forcing
View synthesis by appearance flow. Zhou, Tulsiani, Sun, Malik, Efros https://arxiv.org/pdf/1605.03557.pdf
Learning to see physics via visual de-animation . Wu, Lu, Kohli, Freeman, Tenenbaum https://jiajunwu.com/papers/vda_nips.pdf https://github.com/pulkitag/pyphy-engine
Deep predictive coding networks for video prediction and unsupervised learning. Lotter, Kreiman, Cox https://arxiv.org/pdf/1605.08104.pdf
The predictron: end-to-end learning and planning. Silver, Hasselt, Hessel, Schaul, Guez, Harley, Dulac-Arnold, Reichert, Rabinowitz, Barreto, Degris https://arxiv.org/pdf/1612.08810.pdf
Recurrent ladder networks. Premont-Schwarz, Llin, Hao, Rasmus, Boney, Valpola https://arxiv.org/pdf/1707.09219.pdf
A disentangled recognition and nonlinear dynamics model for unsupervised learning. Fraccaro, Kamronn, Paquet, Winther https://arxiv.org/pdf/1710.05741.pdf
MoCoGAN: decomposing motion and content for video generation. Tulyakov, Liu, Yang, Kautz https://arxiv.org/pdf/1707.04993.pdf
Temporal generative adversarial nets with singular value clipping. Saito, Matsumoto, Saito https://arxiv.org/pdf/1611.06624.pdf
Multi-task self-supervised visual learning. Doersch, Zisserman https://arxiv.org/pdf/1708.07860.pdf
Prediction under uncertainty with error-encoding networks . Henaff, Zhao, LeCun https://arxiv.org/pdf/1711.04994.pdf https://github.com/mbhenaff/EEN.
Unsupervised learning of disentangled representations from video. Denton, Birodkar https://papers.nips.cc/paper/7028-unsupervised-learning-of-disentangled-representations-from-video.pdf https://github.com/ap229997/DRNET
Self-supervised visual planning with temporal skip connections. Erbert, Finn, Lee, Levine https://arxiv.org/pdf/1710.05268.pdf
Unsupervised learning of disentangled and interpretable representations from sequential data. Hsu, Zhang, Glass https://papers.nips.cc/paper/6784-unsupervised-learning-of-disentangled-and-interpretable-representations-from-sequential-data.pdf https://github.com/wnhsu/FactorizedHierarchicalVAE https://github.com/wnhsu/ScalableFHVAE
Decomposing motion and content for natural video sequence prediction. Villegas, Yang, Hong, Lin, Lee https://arxiv.org/pdf/1706.08033.pdf
Unsupervised video summarization with adversarial LSTM networks. Mahasseni, Lam, Todorovic http://web.engr.oregonstate.edu/~sinisa/research/publications/cvpr17_summarization.pdf
Deep variational bayes filters: unsupervised learning of state space models from raw data. Karl, Soelch, Bayer, van der Smagt https://arxiv.org/pdf/1605.06432.pdf https://github.com/sisl/deep_flow_control
A compositional object-based approach to learning physical dynamics. Chang, Ullman, Torralba, Tenenbaum https://arxiv.org/pdf/1612.00341.pdf https://github.com/mbchang/dynamics
Bayesian learning and inference in recurrent switching linear dynamical systems. Linderman, Johnson, Miller, Adams, Blei, Paninski http://proceedings.mlr.press/v54/linderman17a/linderman17a.pdf https://github.com/slinderman/recurrent-slds
SE3-Nets: learning rigid body motion using deep neural networks. Byravan, Fox https://arxiv.org/pdf/1606.02378.pdf
2016
Beyond temporal pooling: recurrence and temporal convolutions for gesture recognition in video. Pigou, van den Oord, Dieleman, Van Herreweghe, Dambre https://arxiv.org/abs/1506.01911
Dynamic filter networks. De Brabandere, Jia, Tuytelaars, Gool https://arxiv.org/pdf/1605.09673.pdf
Dynamic movement primitives in latent space of time-dependent variational autoencoders. Chen, Karl, van der Smagt https://ieeexplore.ieee.org/document/7803340
Learning physical intuiting of block towers by example. Lerer, Gross, Fergus https://arxiv.org/pdf/1603.01312.pdf
Structured inference networks for nonlinear state space models. Krishnan, Shalit, Sontag https://arxiv.org/pdf/1609.09869.pdf https://github.com/clinicalml/structuredinference
A recurrent latent variable model for sequential data. Chung, Kastner, Dinh, Goel, Courville, Bengio https://arxiv.org/pdf/1506.02216.pdf https://github.com/jych/nips2015_vrnn
Recognizing micro-actions and reactions from paired egocentric videos Yonetani, Kitani, Sato http://www.cs.cmu.edu/~kkitani/pdf/YKS-CVPR16.pdf
Anticipating visual representations from unlabeled video. https://github.com/chiawen/activity-anticipation https://www.zpascal.net/cvpr2016/Vondrick_Anticipating_Visual_Representations_CVPR_2016_paper.pdf
Deep multi-scale video prediction beyond mean square error. Mathieu, Couprie, LeCun https://arxiv.org/pdf/1511.05440.pdf
Generating videos with scene dynamics. Vondrick, Pirsiavash, Torralba https://papers.nips.cc/paper/6194-generating-videos-with-scene-dynamics.pdf
Disentangling space and time in video with hierarchical variational auto-encoders. Grathwohl, Wilson https://arxiv.org/pdf/1612.04440.pdf
Understanding visual concepts with continuation learning. Whitney, Chang, Kulkarni, Tenenbaum https://arxiv.org/pdf/1602.06822.pdf
Contextual RNN-GANs for abstract reasoning diagram generation. Ghosh, Kulharia, Mukerjee, Namboodiri, Bansal https://arxiv.org/pdf/1609.09444.pdf
Interaction networks for learning about objects, relations and physics . Battaglia, Pascanu, Lai, Rezende, Kavukcuoglu https://arxiv.org/pdf/1612.00222.pdf https://github.com/jsikyoon/Interaction-networks_tensorflow https://github.com/higgsfield/interaction_network_pytorch https://github.com/ToruOwO/InteractionNetwork-pytorch
An uncertain future: forecasting from static images using Variational Autoencoders. Walker, Doersch, Gupta, Hebert https://arxiv.org/pdf/1606.07873.pdf
Unsupervised learning for physical interaction through video prediction. Finn, Goodfellow, Levine https://arxiv.org/pdf/1605.07157.pdf
Sequential neural models with stochastic layers. Fraccaro, Sonderby, Paquet, Winther https://arxiv.org/pdf/1605.07571.pdf https://github.com/marcofraccaro/srnn
Learning visual predictive models of physics for playing billiards. Fragkiadaki, Agrawal, Levine, Malik https://arxiv.org/pdf/1511.07404.pdf
Attend, infer, repeat: fast scene understanding with generative models. Eslami, Heess, Weber, Tassa, Szepesvari, Kavukcuoglu, Hinton https://arxiv.org/pdf/1603.08575.pdf http://akosiorek.github.io/ml/2017/09/03/implementing-air.html https://github.com/akosiorek/attend_infer_repeat
Synthesizing robotic handwriting motion by learning from human demonstrations. Yin, Alves-Oliveira, Melo, Billard, Paiva https://pdfs.semanticscholar.org/951e/14dbef0036fddbecb51f1577dd77c9cd2cf3.pdf?_ga=2.78226524.958697415.1583668154-397935340.1548854421
2015
Learning stochastic recurrent networks. Bayer, Osendorfer https://arxiv.org/pdf/1411.7610.pdf https://github.com/durner/STORN-keras
Deep Kalman Filters. Krishnan, Shalit, Sontag https://arxiv.org/pdf/1511.05121.pdf https://github.com/k920049/Deep-Kalman-Filter
Unsupervised learning of visual representations using videos. Wang, Gupta https://arxiv.org/pdf/1505.00687.pdf
Embed to control: a locally linear latent dynamics model for control from raw images. Watter, Springenberg, Riedmiller, Boedecker https://arxiv.org/pdf/1506.07365.pdf https://github.com/ericjang/e2c
2014
Seeing the arrow of time. Pickup, Pan, Wei, Shih, Zhang, Zisserman, Scholkopf, Freeman https://www.robots.ox.ac.uk/~vgg/publications/2014/Pickup14/pickup14.pdf
2012
Activity Forecasting. Kitani, Ziebart, Bagnell, Hebert http://www.cs.cmu.edu/~kkitani/pdf/KZBH-ECCV12.pdf
2006
Information flows in causal networks. Ay, Polani https://sfi-edu.s3.amazonaws.com/sfi-edu/production/uploads/sfi-com/dev/uploads/filer/45/5f/455fd460-b6b0-4008-9de1-825a5e2b9523/06-05-014.pdf
2002
Slow feature analysis. Wiskott, Sejnowski http://www.cnbc.cmu.edu/~tai/readings/learning/wiskott_sejnowski_2002.pdf
n.d.
Learning variational latent dynamics: towards model-based imitation and control. Yin, Melo, Billard, Paiva https://pdfs.semanticscholar.org/40af/a07f86a6f7c3ec2e4e02665073b1e19652bc.pdf