Awesome

Awesome-Video-Generation

A curated list of awesome work (currently 257 papers) a on video generation and video representation learning, and related topics (such as RL). Feel free to contribute or email me if I've missed your paper off the list : ]

They are ordered by year (new to old). I provide a link to the paper as well as to the github repo where available.

2020

Disentangling multiple features in video sequences using Gaussian processes in variational autoencoders. Bhagat, Uppal, Yin, Lim https://arxiv.org/abs/2001.02408

Generative adversarial networks for spatio-temporal data: a survey. Gao, Xue, Shao, Zhao, Qin, Prabowo, Rahaman, Salim https://arxiv.org/pdf/2008.08903.pdf

Deep state-space generative model for correlated time-to-event predictions. Xue, Zhou, Du, Dai, Xu, Zhang, Cui https://dl.acm.org/doi/abs/10.1145/3394486.3403206

Toward discriminating and synthesizing motion traces using deep probabilistic generative models. Zhou, Liu, Zhang, Trajcevski https://ieeexplore.ieee.org/abstract/document/9165954/

Sample-efficient robot motion learning using Gaussian process latent variable models. Delgado-Guerrero, Colome, Torras http://www.iri.upc.edu/files/scidoc/2320-Sample-efficient-robot-motion-learning-using-Gaussian-process-latent-variable-models.pdf

Sequence prediction using spectral RNNS . Wolter, Gall, Yao https://www.researchgate.net/profile/Moritz_Wolter2/publication/329705630_Sequence_Prediction_using_Spectral_RNNs/links/5f36b9d892851cd302f44a57/Sequence-Prediction-using-Spectral-RNNs.pdf

Self-supervised video representation learning by pace prediction. Wang, Joai, Liu https://arxiv.org/pdf/2008.05861.pdf

RhyRNN: Rhythmic RNN for recognizing events in long and complex videos. Yu, Li, Li http://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123550137.pdf

4D forecasting: sequential forecasting of 100,000 points. Weng, Wang, Levine, Kitani, Rhinehart http://www.xinshuoweng.com/papers/SPF2_eccvw/camera_ready.pdf

Multimodal deep generative models for trajectory prediction: a conditional variational autoencoder approach. Ivanovic, Leung, Schmerling, Pavone https://arxiv.org/pdf/2008.03880.pdf

Memory-augmented dense predictive coding for video representation learning. Han, Xie, Zisserman https://arxiv.org/pdf/2008.01065.pdf

SeCo: exploring sequence supervision for unsupervised representation learning. Yao, Zhang, Qiu, Pan, Mei https://arxiv.org/pdf/2008.00975.pdf

PDE-driven spatiotemporal disentanglement. Dona, Franceschi, Lamprier, Gallinari https://arxiv.org/pdf/2008.01352.pdf

Dynamics generalization via information bottleneck in deep reinforcement learning. Lu, Lee, Abbeel, Tiomkin https://arxiv.org/pdf/2008.00614.pdf

Latent space roadmap for visual action planning. Lippi, Poklukar, Welle, Varava, Yin, Marino, Kragic https://rss2020vlrrm.github.io/papers/3_CameraReadySubmission_RSS_workshop_latent_space_roadmap.pdf

Weakly-supervised learning of human dynamics. Zell, Rosenhahn, Wandt https://arxiv.org/pdf/2007.08969.pdf

Deep variational Leunberger-type observer for stochastic video prediction. Wang, Zhou, Yan, Yao, Liu, Ma, Lu https://arxiv.org/pdf/2003.00835.pdf

NewtonianVAE: proportional control and goal identification from pixels via physical latent spaces. Jaques, Burke, Hospedales https://arxiv.org/pdf/2006.01959.pdf

Constrained variational autoencoder for improving EEG based speech recognition systems. Krishna, Tran, Carnahan, Tewfik https://arxiv.org/pdf/2006.02902.pdf

Latent video transformer. Rakhumov, Volkhonskiy https://arxiv.org/pdf/2006.10704.pdf

Beyond exploding and vanishing gradients: analysing RNN training using attractors and smoothness . Ribeiro, Tiels, Aguirre, Schon http://proceedings.mlr.press/v108/ribeiro20a/ribeiro20a.pdf

Towards recurrent autoeregressive flow models . Mern, Morales, Kochenderfer https://arxiv.org/pdf/2006.10096.pdf

Learning to combine top-down and bottom-up signals in recurrent neural networks with attention over modules. Mittal, Lamb, Goyal, Voleti et al. https://www.cs.colorado.edu/~mozer/Research/Selected%20Publications/reprints/Mittaletal2020.pdf

Unmasking the inductive biases of unsupervised object representations for video sequences. Weis, Chitta, Sharma et al. https://arxiv.org/pdf/2006.07034.pdf

G3AN: disentnagling appearance and motion for video generation. Wang, Bilinski, Bermond, Dantcheva http://openaccess.thecvf.com/content_CVPR_2020/papers/Wang_G3AN_Disentangling_Appearance_and_Motion_for_Video_Generation_CVPR_2020_paper.pdf

Learning dynamic relationships for 3D human motion prediction . Cui, Sun, Yang http://openaccess.thecvf.com/content_CVPR_2020/papers/Cui_Learning_Dynamic_Relationships_for_3D_Human_Motion_Prediction_CVPR_2020_paper.pdf

Joint training of variational auto-encoder and latent energy-based model. Han, Nijkamp, Zhou, Pang, Zhu, Wu http://openaccess.thecvf.com/content_CVPR_2020/papers/Han_Joint_Training_of_Variational_Auto-Encoder_and_Latent_Energy-Based_Model_CVPR_2020_paper.pdf

Learning invariant representations for reinforcement learning without reconstruction. Zhang, McAllister, Calandra, Gal, Levine https://arxiv.org/pdf/2006.10742.pdf

Variational inference for sequential data with future likelihood estimates. Kim, Jang, Yang, Kim http://ailab.kaist.ac.kr/papers/pdfs/KJYK2020.pdf

Video prediction via example guidance. Xu, Xu, Ni, Yang, Darrell https://arxiv.org/pdf/2007.01738.pdf

Hierarchical path VAE-GAN: generating diverse videos from a single sample. Gur, Benaim, Wolf https://arxiv.org/pdf/2006.12226.pdf

Dynamic facial expression generation on Hilbert Hypersphere with conditional Wasserstein Generative adversarial nets. Otberdout, Daoudi, Kacem, Ballihi, Berretti https://arxiv.org/abs/1907.10087

HAF-SVG: hierarchical stochastic video generation with aligned features. Lin, Yuan, Li https://www.ijcai.org/Proceedings/2020/0138.pdf

Improving generative imagination in object-centric world models. Lin, Wu, Peri, Fu, Jiang, Ahn https://proceedings.icml.cc/static/paper_files/icml/2020/4995-Paper.pdf

Deep generative video compression with temporal autoregressive transforms. Yang, Yang, Marino, Yang, Mandt https://joelouismarino.github.io/files/papers/2020/seq_flows_compression/seq_flows_compression.pdf

Spatially structured recurrent modules. Rahaman, Goyal, Gondal, Wuthrich, Bauer, Sharma, Bengio, Scholkopf https://arxiv.org/pdf/2007.06533.pdf

Unsupervised object-centric video generation and decomposition in 3D. Henderson, Lampert https://arxiv.org/pdf/2007.06705.pdf

Planning from images with deep latent gaussian process dynamics. Bosch, Achterhold, Leal-Taixe, Stuckler https://arxiv.org/pdf/2005.03770.pdf

Planning to explore via self-supervised world models . Sekar, Rybkin, Daniilidis, Abbeel, Hafner, Pathak https://arxiv.org/pdf/2005.05960.pdf

Mutual information maximization for robust plannable representations. Ding, Clavera, Abbeel https://arxiv.org/pdf/2005.08114.pdf

Supervised contrastive learning. Khosla, Teterwak, Wang, Sarna https://arxiv.org/pdf/2004.11362.pdf

Blind source extraction based on multi-channel variational autoencoder and x-vector-based speaker selection trained with data augmentation. Gu, Liao, Lu https://arxiv.org/pdf/2005.07976.pdf

BiERU: bidirectional emotional recurrent unit for conversational sentiment analysis. Li, Shao, Ji, Cambria https://arxiv.org/pdf/2006.00492.pdf

S3VAE: self-supervised sequential VAE for representation disentanglement and data generation. Zhu, Min, Kadav, Graf, https://arxiv.org/pdf/2005.11437.pdf

Probably approximately correct vision-based planning using motion primitives. Veer, Majumdar https://arxiv.org/abs/2002.12852

MoVi: a large multipurpose motion and video dataset . Ghorbani, Mahdaviani, Thaler, Kording, Cook, Blohm, Troje https://arxiv.org/abs/2003.01888

Temporal convolutional attention-based network for sequence modeling. Hao, Wang, Xia, Shen, Zhao https://arxiv.org/abs/2002.12530

Neuroevolution of self-interpretable agents. Tang, Nguyen, Ha https://arxiv.org/abs/2003.08165

Attentional adversarial variational video generation via decomposing motion and content. Talafha, Rekabdar, Ekenna, Mousas https://ieeexplore.ieee.org/document/9031476

Imputer sequence modelling via imputation and dynamic programming. Chan, Sharahia, Hinton, Norouzi, Jaitly https://arxiv.org/abs/2002.08926

Variational conditioning of deep recurrent networks for modeling complex motion dynamics. Buckchash, Raman https://ieeexplore.ieee.org/document/9055015?denied=

Training of deep neural networks for the generation of dynamic movement primitives. Pahic, Ridge, Gams, Morimoti, Ude https://www.sciencedirect.com/science/article/pii/S0893608020301301

PreCNet: next frame video prediction based on predictive coding. Straka, Svoboda, Hoffmann https://arxiv.org/pdf/2004.14878.pdf

Dimensionality reduction of movement primitives in parameter space. Tosatto, Stadtmuller, Peters https://arxiv.org/abs/2003.02634

Disentangling physical dynamics from unknown factors for unsupervised video prediction. Guen, Thorne https://arxiv.org/abs/2003.01460

A real-robot dataset for assessing transferability of learned dynamics models. Agudelo-Espana, Zadaianchuk, Wenk, Garg, Akpo et al https://www.is.mpg.de/uploads_file/attachment/attachment/589/ICRA20_1157_FI.pdf

Hierarchical decomposition of nonlinear dynamics and control for system indentification and policy distillation. Abdulsamad, Peters https://arxiv.org/pdf/2005.01432.pdf

Occlusion resistant learning of intuitive physics from videos. Riochet, Sivic, Laptev, Dupoux https://arxiv.org/pdf/2005.00069.pdf

Scalable learning in altent state sequence models Aicher https://digital.lib.washington.edu/researchworks/bitstream/handle/1773/45550/Aicher_washington_0250E_21152.pdf?sequence=1

How useful is self-supervised pretraining for visual tasks? Newell, Deng https://arxiv.org/pdf/2003.14323.pdf

q-VAE for disentangled representation learning and latent dynamical systems Koboyashi https://arxiv.org/pdf/2003.01852.pdf

Variational recurrent models for solving partially observable control tasks. Han, Doya, Tani https://openreview.net/forum?id=r1lL4a4tDB

Stochastic latent residual video prediction. Franceschi, Delasalles, Chen, Lamprier, Gallinari https://arxiv.org/pdf/2002.09219.pdf https://sites.google.com/view/srvp

Disentangled speech embeddings using cross-modal self-supervision. Nagrani, Chung, Albanie, Zisserman https://arxiv.org/abs/2002.08742

TwoStreamVAM: improving motion modeling in video generation. Sun, Xu, Saenko https://arxiv.org/abs/1812.01037

Variational hyper RNN for sequence modeling. Deng, Cao, Chang, Sigal, Mori, Brubaker https://arxiv.org/abs/2002.10501

Exploring spatial-temporal multi-frequency analysis for high-fidelity and temporal-consistency video prediction. Jin, Hu,Tang, Niu, Shi, Han, Li https://arxiv.org/abs/2002.09905

2019

Representing closed transformation paths in encoded network latent space. Connor, Rozell https://arxiv.org/pdf/1912.02644.pdf

Animating arbitrary objects via deep motion transfer. Siarohin, Lathuiliere, Tulyakov, Ricci, Sebe https://arxiv.org/abs/1812.08861

Feedback recurrent autoencoder. Yang, Sautiere, Ryu, Cohen https://arxiv.org/abs/1911.04018

First order motion model for image animation. Siarohin, Lathuiliere, Tulyakov, Ricci, Sebe https://papers.nips.cc/paper/8935-first-order-motion-model-for-image-animation

Point-to-point video generation. Wang, Cheng, Lin, Chen, Sun https://arxiv.org/pdf/1904.02912.pdf

Learning deep controllable and structured representations for image synthesis, structured prediction and beyond. Yan https://deepblue.lib.umich.edu/handle/2027.42/153334

Decoupling feature extraction from policy learning: assessing benefits of state representation learning in goal based robotics. Raffin, Hill, Traore, Lesort, Diaz-Rodriguez, Filliat https://arxiv.org/abs/1901.08651

Task-Conditioned variational autoencoders for learning movement primitives. Noseworthy, Paul, Roy, Park, Roy https://groups.csail.mit.edu/rrg/papers/noseworthy_corl_19.pdf

Spatio-temporal alignments: optimal transport through space and time. Janati, Cuturi, Gramfort https://arxiv.org/pdf/1910.03860.pdf

Action Genome: actions as composition of spatio-temporal scene graphs. Ji, Krishna, Fei-Fei, Niebles https://arxiv.org/pdf/1912.06992.pdf

Video-to-video translation for visual speech synthesis. Doukas, Sharmanska, Zafeiriou https://arxiv.org/pdf/1905.12043.pdf Predictive coding, variational autoencoders, and biological connections Marino https://openreview.net/pdf?id=SyeumQYUUH

Single Headed Attention RNN: stop thinking with your head. Merity https://arxiv.org/pdf/1911.11423.pdf

Hamiltonian neural networks. Greydanus, Dzamba, Yosinski https://arxiv.org/pdf/1906.01563.pdf https://github.com/greydanus/hamiltonian-nn

Learning what you can do before doing anything. Rybkin, Pertsch, Derpanis, Daniilidis, Jaegle https://openreview.net/pdf?id=SylPMnR9Ym https://daniilidis-group.github.io/learned_action_spaces

Deep Lagrangian networks: using physics as model prior for deep learning. Lutter, Ritter, Peters https://arxiv.org/pdf/1907.04490.pdf

A general framework for structured learning of mechanical systems. Gupta, Menda, Manchester, Kochenderfer https://arxiv.org/pdf/1902.08705.pdf https://github.com/sisl/machamodlearn

Learning predictive models from observation and interaction. Schmeckpeper, Xie, Rybkin, Tian, Daniilidis, Levine, Finn https://arxiv.org/pdf/1912.12773.pdf

A multigrid method for efficiently training video models. Wu, Girshick, He, Feichtenhofer, Krahenbuhl https://arxiv.org/pdf/1912.00998.pdf

Deep variational Koopman models: inferring Koopman observations for uncertainty-aware dynamics modeling and control . Morton, Witherden, Kochenderfer https://arxiv.org/pdf/1902.09742.pdf

Symplectic ODE-NET: learning hamiltonian dynamics with control. Zhong, Dey, Chakraborty https://arxiv.org/pdf/1909.12077.pdf

Hamiltonian graph networks with ODE integrators. Sanchez-Gonzalez, Bapst, Cranmer, Battaglia https://arxiv.org/pdf/1909.12790.pdf

Neural ordinary differential equations. Chen, Rubanova, Bettencourt, Duvenaud https://arxiv.org/pdf/1806.07366.pdf https://github.com/rtqichen/torchdiffeq

Variational autoencoder trajectory primitives and discrete latent codes. Osa, Ikemoto https://arxiv.org/pdf/1912.04063.pdf

Newton vs the machine: solving the chaotic three-body problem using deep neural networks. Breen, Foley, Boekholt, Zwart https://arxiv.org/pdf/1910.07291.pdf

Learning dynamical systems from partial observations. Ayed, de Bezenac, Pajot, Brajard, Gallinari https://arxiv.org/pdf/1902.11136.pdf

GP-VAE: deep probabilistic time series imputation. Fortuin, Baranchuk, Ratsch, Mandt https://arxiv.org/pdf/1907.04155.pdf https://github.com/ratschlab/GP-VAE

Ghost hunting in the nonlinear dynamic machine. Butner, Munion, Baucom, Wong https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0226572

Faster attend-infer-repeat with tractable probabilistic models. Stelzner, Peharz, Kersting http://proceedings.mlr.press/v97/stelzner19a/stelzner19a.pdf https://github/stelzner/supair

Tree-structured recurrent switching linear dynamical systems for multi-scale modeling. Nassar, Linderman, Bugallo, Park https://arxiv.org/pdf/1811.12386.pdf

DynaNet: neural Kalman dynamical model for motion estimation and prediction. Chen, Lu, Wang, Trigoni, Markham https://arxiv.org/pdf/1908.03918.pdf

Disentangled behavioral representations. Dezfouli, Ashtiani, Ghattas, Nock, Dayan, Ong https://papers.nips.cc/paper/8497-disentangled-behavioural-representations.pdf

Structured object-aware physics prediction for video modeling and planning. Kossen, Stelzner, Hussing, Voelcker, Kersting https://arxiv.org/pdf/1910.02425.pdf https://github.com/jlko/STOVE

Recurrent attentive neural process for sequential data. Qin, Zhu, Qin, Wang, Zhao https://arxiv.org/pdf/1910.09323.pdf https://kasparmartens.rbind.io/post/np/

DeepMDP: learning continuous latent space models for representation learning. Gelada, Kumar, Buckman, Nachum, Bellemare https://arxiv.org/pdf/1906.02736.pdf

Genesis: generative scene inference and sampling with object-centric latent representations. Engelcke, Kosiorek, Jones, Posner https://arxiv.org/pdf/1907.13052.pdf https://github.com/applied-ai-lab/genesis

Deep conservation: a latent dynamics model for exact satisfaction of physical conservation laws. Lee, Carlberg https://arxiv.org/pdf/1909.09754.pdf

Switching linear dynamics for variational bayes filtering. Becker-Ehmck, Peters, van der Smagt https://arxiv.org/pdf/1905.12434.pdf https://arxiv.org/pdf/1905.12434.pdf

Approximate Bayesian inference in spatial environments Mirchev, Kayalibay, Soelch, van der Smagt, Bayer https://arxiv.org/pdf/1805.07206.pdf

beta-DVBF: learning state-space models for control from high dimensional observations. Das, Karl, Becker-Ehmck, van der Smagt https://arxiv.org/pdf/1911.00756.pdf

SSA-GAN: End-to-end time-lapse video generation with spatial self-attention. Horita, Yanai http://img.cs.uec.ac.jp/pub/conf19/191126horita_0.pdf

Learning energy-based spatial-temporal generative convnets for dynamic patterns. Xie, Zhu, Wu https://arxiv.org/pdf/1909.11975.pdf http://www.stat.ucla.edu/~jxie/STGConvNet/STGConvNet.html

Multiplicative interactions and where to find them. Anon https://openreview.net/pdf?id=rylnK6VtDH

Time-series generative adversarial networks. Yoon, Jarrett, van der Schaar https://papers.nips.cc/paper/8789-time-series-generative-adversarial-networks.pdf

Explaining and interpreting LSTMs. Arras, Arjona-Medina, Widrich, Montavon, Gillhofer, Muller, Hochreiter, Samek https://arxiv.org/pdf/1909.12114.pdf

Gating revisited: deep multi-layer RNNs that can be trained. Turkoglu, D'Aronco, Wegner, Schindler https://arxiv.org/pdf/1911.11033.pdf

Re-examination of the role of latent variables in sequence modeling. Lai, Dai, Yang, Yoo https://arxiv.org/pdf/1902.01388.pdf

Improving sequential latent variable models with autoregressive flows. Marino, Chen, He, Mandt https://openreview.net/pdf?id=HklvmlrKPB

Learning stable and predictive structures in kinetic systems: benefits of a causal approach. Pfister, Bauer, Peters https://arxiv.org/pdf/1810.11776.pdf

Learning to disentangle latent physical factors for video prediction. Zhu, Munderloh, Rosenhahn, Stuckler https://link.springer.com/chapter/10.1007/978-3-030-33676-9_42

Adversarial video generation on complex datasets. Clark, Donahue, Simonyan https://arxiv.org/pdf/1907.06571.pdf

Learning to predict without looking ahead: world models without forward prediction. Freeman, Metz, Ha https://arxiv.org/pdf/1910.13038.pdf

Learning video representations using contrastive bidirectional transformer. Sun, Baradel, Murphy, Schmid https://arxiv.org/pdf/1906.05743.pdf

STCN: stochastic temporal convolution networks. Aksan, Hilliges https://arxiv.org/pdf/1902.06568.pdf http://jacobcwalker.com/DTP/DTP.html https://ait.ethz.ch/projects/2019/stcn/

Zero-shot generation of human-object interaction videos. Nawhal, Zhai, Lehrmann, Sigal https://arxiv.org/pdf/1912.02401.pdf http://www.sfu.ca/~mnawhal/projects/zs_hoi_generation.html

Learning a generative model for multi-step human-object interactions from videos. Wang, Pirk, Yumer, Kim, Sener, Sridhar, Guibas http://www.pirk.info/papers/Wang.etal-2019-LearningInteractions.pdf http://www.pirk.info/projects/learning_interactions/index.html

Dream to control: learning behaviors by latent imagining. Hafner, Lillicrap, Ba, Norouzi https://arxiv.org/pdf/1912.01603.pdf

Multistage attention network for multivariate time series prediction. Hu, Zheng https://www.sciencedirect.com/science/article/abs/pii/S0925231219316625

Predicting video-frames using encoder-convLSTM combination. Mukherjee, Ghosh, Ghosh, Kumar, Roy https://ieeexplore.ieee.org/document/8682158

A variational auto-encoder model for stochastic point processes. Mehrasa, Jyothi, Durand, He, Sigal, Mori https://arxiv.org/pdf/1904.03273.pdf

Unsupervised speech representation learning using WaveNet encoders. Chorowski, Weiss, Bengio, van den Oord https://arxiv.org/pdf/1901.08810.pdf

Local aggregation for unsupervised learning of visual embeddings. Zhuang, Zhai, Yamins http://openaccess.thecvf.com/content_ICCV_2019/papers/Zhuang_Local_Aggregation_for_Unsupervised_Learning_of_Visual_Embeddings_ICCV_2019_paper.pdf

Hamiltonian generative Networks. Toth, Rezende, Jaegle, Racaniere, Botev, higgins https://arxiv.org/pdf/1909.13789.pdf

VideoBERT: a joint model for video and language representations learning. Sun, Myers, Vondrick, Murphy, Schmid https://arxiv.org/pdf/1904.01766.pdf

Video representation learning via dense predictive coding. Han, Xie, Zisserman http://openaccess.thecvf.com/content_ICCVW_2019/papers/HVU/Han_Video_Representation_Learning_by_Dense_Predictive_Coding_ICCVW_2019_paper.pdf https://github.com/TengdaHan/DPC

Hamiltonian Generative Networks Toth, Rezende, Jaegle, Racaniere, Botev, higgins https://arxiv.org/pdf/1909.13789.pdf

Unsupervised state representation learning in Atari. Anand, Racah, Ozair, Bengio, Cote, Hjelm https://arxiv.org/pdf/1906.08226.pdf

Temporal cycle-consistency learning. Dwibedi, Aytar, Tompson, Sermanet, Zisserman http://openaccess.thecvf.com/content_CVPR_2019/papers/Dwibedi_Temporal_Cycle-Consistency_Learning_CVPR_2019_paper.pdf

Self-supervised learning by cross-modal audio-video clustering. Alwassel, Mahajan, Torresani, Ghanem, Tran https://arxiv.org/pdf/1911.12667.pdf

Human action recognition with deep temporal pyramids. Mazari, Sahbi https://arxiv.org/pdf/1905.00745.pdf

Evolving losses for unlabeled video representation learning. Piergiovanni, Angelova, Ryoo https://arxiv.org/pdf/1906.03248.pdf

MoGlow: probabilistic and controllable motion synthesis using normalizing flows. Henter, Alexanderson, Beskow https://arxiv.org/pdf/1905.06598.pdf https://www.youtube.com/watch?v=lYhJnDBWyeo

High fidelity video prediction with large stochastic recurrent neural networks. Villegas, Pathak, Kannan, Erhan, Le, Lee https://arxiv.org/pdf/1911.01655.pdf https://sites.google.com/view/videopredictioncapacity

Spatiotemporal pyramid network for video action recognition. Wang, Long, Wan, Yu https://arxiv.org/pdf/1903.01038.pdf

Attentive temporal pyramid network for dynamic scene classification. Huang, Cao, Zhen, Han https://www.aaai.org/ojs/index.php/AAAI/article/view/5184

Disentangling video with independent prediction. Whitney, Fergus https://arxiv.org/pdf/1901.05590.pdf

Disentangling state space representations. Miladinovic, Gondal, Scholkopf, Buhmann, Bauer https://arxiv.org/pdf/1906.03255.pdf

Cycle-SUM: cycle-consistent adversarial LSTM networks for unsupervised video summarization Yuan, Tay, Li, Zhou, Feng https://arxiv.org/pdf/1904.08265.pdf Unsupervised learning from video with deep neural embeddings Zhuang, Andonian, Yamins https://arxiv.org/pdf/1905.11954.pdf Scaling and benchmarking self-supervised visual representation learning. Goyal, Mahajan, Gupta, Misra https://arxiv.org/pdf/1905.01235.pdf

Self-supervised visual feature learning with deep neural networks: a survey. Jing, Tian https://arxiv.org/pdf/1902.06162.pdf Unsupervised learning of object structure and dynamics from videos Minderer, Sun, Villegas, Cole, Murphy, Lee https://arxiv.org/pdf/1906.07889.pdf

Learning correspondence from the cycle-consistency of time. Wang, Jabri, Efros https://arxiv.org/pdf/1903.07593.pdf https://ajabri.github.io/timecycle/

DistInit: learning video representations without a single labeled video . Girdhar, Tran, Torresani, Ramanan https://arxiv.org/pdf/1901.09244.pdf

VideoFlow: a flow-based generative model for video. Kumar, Babaeizadeh, Erhan, Finn, Levine, Dinh, Kingma https://arxiv.org/pdf/1903.01434.pdf find code in tensor2tensor library

Learning latent dynamics for planning from pixels. Hafner, Lillicrap, Fischer, Villegas, Ha, Lee, Davidson https://arxiv.org/pdf/1811.04551.pdf https://github.com/google-research/planet

View-LSTM: Novel-view video synthesis trough view decomposition. Lakhal, Lanz, Cavallaro http://openaccess.thecvf.com/content_ICCV_2019/papers/Lakhal_View-LSTM_Novel-View_Video_Synthesis_Through_View_Decomposition_ICCV_2019_paper.pdf

Likelihood conribution based multi-scale architecture for generative flows. Das, Abbeel, Spanos https://arxiv.org/pdf/1908.01686.pdf

Adaptive online planning for continual lifelong learning. Lu, Mordatch, Abbeel https://arxiv.org/pdf/1912.01188.pdf Exploiting video sequences for unsupervised disentangling in generative adversarial networks Tuesca, Uzal https://arxiv.org/pdf/1910.11104.pdf

Memory in memory: a predictive neural network for learning higher-order non-stationarity from spatiotemporal dynamics. Wang, Zhang, Zhe, Long, Wang, Yu https://arxiv.org/pdf/1811.07490.pdf

Improved conditional VRNNs for video prediction. Castrejon, Ballas, Courville https://arxiv.org/pdf/1904.12165.pdf

Temporal difference variational auto-encoder. Gregor, Papamakarios, Besse, Buesing, Weber https://arxiv.org/pdf/1806.03107.pdf

Time-agnostic prediction: predicting predictable video frames. Jayaraman, Ebert, Efros, Levine https://arxiv.org/pdf/1808.07784.pdf https://sites.google.com/view/ta-pred

Variational tracking and prediction with generative disentangled state-space models. Akhundov, Soelch, Bayer, van der Smagt https://arxiv.org/pdf/1910.06205.pdf

Self-supervised spatiotemporal learning via video clip order prediction. Xu, Xiao, Zhao, Shao, Xie, Zhuang https://pdfs.semanticscholar.org/558a/eb7aa38cfcf8dd9951bfd24cf77972bd09aa.pdf https://github.com/xudejing/VCOP

Self-supervised spatio-temporal representation learning for videos by predicting motion and appearance statistics. Wang, Jiao, Bao, He, Liu, Liu http://openaccess.thecvf.com/content_CVPR_2019/papers/Wang_Self-Supervised_Spatio-Temporal_Representation_Learning_for_Videos_by_Predicting_Motion_and_CVPR_2019_paper.pdf

Spatio-temporal associative representation for video person re-identification. Wu, Zhu, Gong http://www.eecs.qmul.ac.uk/~sgg/papers/WuEtAl_BMVC2019.pdf

Object segmentation using pixel-wise adversarial loss. Durall, Pfreundt, Kothe, Keuper https://arxiv.org/pdf/1909.10341.pdf

2018

The dreaming variational autoencoder for reinforcement learning environments. Andersen, Goodwin, Granmo https://arxiv.org/pdf/1810.01112v1.pdf

MT-VAE: Learning Motion Transformations to Generate Multimodal Human Dynamics. Yan, Rastogi, Villegas, Sunkavalli, Shechtman, Hadap, Yumer, Lee http://openaccess.thecvf.com/content_ECCV_2018/html/Xinchen_Yan_Generating_Multimodal_Human_ECCV_2018_paper.html

Deep learning for universal linear embeddings of nonlinear dynamics. Lusch, Kutz, Brunton https://www.nature.com/articles/s41467-018-07210-0

Variational attention for sequence-to-sequence models. Bahuleyan, Mou, Vechtomova, Poupart https://arxiv.org/pdf/1712.08207.pdf https://github.com/variational-attention/tf-var-attention

Understanding image motion with group representations. Jaegle, Phillips, Ippolito, Daniilidis https://openreview.net/forum?id=SJLlmG-AZ

Relational neural expectation maximization: unsupervised discovery of objects and their interactions. van Steenkiste, Chang, Greff, Schmidhuber https://arxiv.org/pdf/1802.10353.pdf https://sites.google.com/view/r-nem-gifs https://github.com/sjoerdvansteenkiste/Relational-NEM

A general method for amortizing variational filtering. Marino, Cvitkovic, Yue https://arxiv.org/pdf/1811.05090.pdf https://github.com/joelouismarino/amortized-variational-filtering

Deep learning for physical processes: incorporating prior scientific knowledge de Bezenac, Pajot, Gallinari https://arxiv.org/pdf/1711.07970.pdf https://github.com/emited/flow

Probabilistic recurrent state-space models . Doerr, Daniel, Schiegg, Nguyen-Tuong, Schaal, Toussaint, Trimpe https://arxiv.org/pdf/1801.10395.pdf https://github.com/boschresearch/PR-SSM

TGANv2: efficient training of large models for video generation with multiple subsampling layers. Saito, Saito https://arxiv.org/abs/1811.09245

Towards high resolution video generation with progressive growing of sliced Wasserstein GANs. Acharya, Huang, Paudel, Gool https://arxiv.org/abs/1810.02419

Representation learning with contrastive predictive coding. van den Oord, Li, Vinyas https://arxiv.org/pdf/1807.03748.pdf

Deconfounding reinforcement learning in observational settings . Lu, Scholkopf, Hernandez-Lobato https://arxiv.org/pdf/1812.10576.pdf

Flow-grounded spatial-temporal video prediction from still images. Li, Fang, Yang, Wang, Lu, Yang https://arxiv.org/pdf/1807.09755.pdf

Adaptive skip intervals: temporal abstractions for recurrent dynamical models. Neitz, Parascandolo, Bauer, Scholkopf https://arxiv.org/pdf/1808.04768.pdf

Disentangled sequential autoencoder. Li, Mandt https://arxiv.org/abs/1803.02991 https://github.com/yatindandi/Disentangled-Sequential-Autoencoder

Video jigsaw: unsupervised learning of spatiotemporal context for video action recognition. Ahsan, Madhok, Essa https://arxiv.org/pdf/1808.07507.pdf

Iterative reoganization with weak spatial constraints: solving arbitrary jigsaw puzzels for unsupervised representation learning. Wei, Xie, Ren, Xia, Su, Liu, Tian, Yuille https://arxiv.org/pdf/1812.00329.pdf

Stochastic adversarial video prediction. Lee, Zhang, Ebert, Abbeel, Finn, Levine https://arxiv.org/pdf/1804.01523.pdf https://alexlee-gk.github.io/video_prediction/

Stochastic variational video prediction. Babaeizadeh, Finn, Erhan, Campbell, Levine https://arxiv.org/pdf/1710.11252.pdf https://github.com/alexlee-gk/video_prediction

Folded recurrent neural networks for future video prediction. Oliu, Selva, Escalera https://arxiv.org/pdf/1712.00311.pdf

PredRNN++: Towards a resolution of the deep-in-time dilemma in spatiotemporal predictive learning. Wang, Gao, Long, Wang, Yu https://arxiv.org/pdf/1804.06300.pdf https://github.com/Yunbo426/predrnn-pp

Stochastic video generation with a learned prior. Denton, Fergus https://arxiv.org/pdf/1802.07687.pdf https://sites.google.com/view/svglp

Unsupervised learning from videos using temporal coherency deep networks. Redondo-Cabrera, Lopez-Sastre https://arxiv.org/pdf/1801.08100.pdf

Time-contrastive networks: self-supervised learning from video. Sermanet, Lynch, Chebotar, Hsu, Jang, Schaal, Levine https://arxiv.org/pdf/1704.06888.pdf

Learning to decompose and disentangle representations for video prediction. Hsieh, Liu, Huang, Fei-Fei, Niebles https://arxiv.org/pdf/1806.04166.pdf https://github.com/jthsieh/DDPAE-video-prediction

Probabilistic video generation using holistic attribute control. He, Lehrmann, Marino, Mori, Sigal https://arxiv.org/pdf/1803.08085.pdf

Interpretable intuitive physics model. Ye, Wang, Davidson, Gupta https://arxiv.org/pdf/1808.10002.pdf https://github.com/tianye95/interpretable-intuitive-physics-model

Video synthesis from a single image and motion stroke. Hu, Walchli, Portenier, Zwicker, Facaro https://arxiv.org/pdf/1812.01874.pdf

Graph networks as learnable physics engines for inference and control. Sanchez-Gonzalez, Heess, Springenberg, Merel, Riedmiller, Hadsell, Battaglia https://arxiv.org/pdf/1806.01242.pdf https://drive.google.com/file/d/14eYTWoH15T53a7qejvCkDLItOOE9Ve7S/view

Deep dynamical modeling and control of unsteady fluid flows. Morton, Witherden, Jameson, Kochenderfer https://arxiv.org/pdf/1805.07472.pdf https://github.com/sisl/deep_flow_control

Sequential attend, infer, repeat: generative modelling of moving objects. Kosiorek, Kim, Posner, Teh https://arxiv.org/pdf/1806.01794.pdf https://github.com/akosiorek/sqair https://www.youtube.com/watch?v=-IUNQgSLE0c&feature=youtu.be

Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. Xiong, Luo, Ma, Liu, Luo https://arxiv.org/pdf/1709.07592.pdf

Integrating accounts of behavioral and neuroimaging data using flexible recurrent neural network models. Dezfouli, Morris, Ramos, Dayan, Balleine https://papers.nips.cc/paper/7677-integrated-accounts-of-behavioral-and-neuroimaging-data-using-flexible-recurrent-neural-network-models.pdf

2017

Autoregressive attention for parallel sequence modeling. Laird, Irvin https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1174/reports/2755456.pdf

Physics informed deep learning: data-driven solutions of nonlinear partial differential equations. Raissi, Perdikaris, Karniadakis https://arxiv.org/pdf/1711.10561.pdf https://github.com/maziarraissi/PINNs

Unsupervised real-time control through variational empowerment. Karl, Soelch, Becker-Ehmck, Benbouzid, van de Smagt, Bayer https://arxiv.org/pdf/1710.05101.pdf https://github.com/tessavdheiden/Empowerment

z-forcing: training stochastic recurrent networks. Goyal, Sordoni, Cote, Ke, Bengio https://arxiv.org/abs/1711.05411 https://github.com/ujjax/z-forcing

View synthesis by appearance flow. Zhou, Tulsiani, Sun, Malik, Efros https://arxiv.org/pdf/1605.03557.pdf

Learning to see physics via visual de-animation . Wu, Lu, Kohli, Freeman, Tenenbaum https://jiajunwu.com/papers/vda_nips.pdf https://github.com/pulkitag/pyphy-engine

Deep predictive coding networks for video prediction and unsupervised learning. Lotter, Kreiman, Cox https://arxiv.org/pdf/1605.08104.pdf

The predictron: end-to-end learning and planning. Silver, Hasselt, Hessel, Schaul, Guez, Harley, Dulac-Arnold, Reichert, Rabinowitz, Barreto, Degris https://arxiv.org/pdf/1612.08810.pdf

Recurrent ladder networks. Premont-Schwarz, Llin, Hao, Rasmus, Boney, Valpola https://arxiv.org/pdf/1707.09219.pdf

A disentangled recognition and nonlinear dynamics model for unsupervised learning. Fraccaro, Kamronn, Paquet, Winther https://arxiv.org/pdf/1710.05741.pdf

MoCoGAN: decomposing motion and content for video generation. Tulyakov, Liu, Yang, Kautz https://arxiv.org/pdf/1707.04993.pdf

Temporal generative adversarial nets with singular value clipping. Saito, Matsumoto, Saito https://arxiv.org/pdf/1611.06624.pdf

Multi-task self-supervised visual learning. Doersch, Zisserman https://arxiv.org/pdf/1708.07860.pdf

Prediction under uncertainty with error-encoding networks . Henaff, Zhao, LeCun https://arxiv.org/pdf/1711.04994.pdf https://github.com/mbhenaff/EEN.

Unsupervised learning of disentangled representations from video. Denton, Birodkar https://papers.nips.cc/paper/7028-unsupervised-learning-of-disentangled-representations-from-video.pdf https://github.com/ap229997/DRNET

Self-supervised visual planning with temporal skip connections. Erbert, Finn, Lee, Levine https://arxiv.org/pdf/1710.05268.pdf

Unsupervised learning of disentangled and interpretable representations from sequential data. Hsu, Zhang, Glass https://papers.nips.cc/paper/6784-unsupervised-learning-of-disentangled-and-interpretable-representations-from-sequential-data.pdf https://github.com/wnhsu/FactorizedHierarchicalVAE https://github.com/wnhsu/ScalableFHVAE

Decomposing motion and content for natural video sequence prediction. Villegas, Yang, Hong, Lin, Lee https://arxiv.org/pdf/1706.08033.pdf

Unsupervised video summarization with adversarial LSTM networks. Mahasseni, Lam, Todorovic http://web.engr.oregonstate.edu/~sinisa/research/publications/cvpr17_summarization.pdf

Deep variational bayes filters: unsupervised learning of state space models from raw data. Karl, Soelch, Bayer, van der Smagt https://arxiv.org/pdf/1605.06432.pdf https://github.com/sisl/deep_flow_control

A compositional object-based approach to learning physical dynamics. Chang, Ullman, Torralba, Tenenbaum https://arxiv.org/pdf/1612.00341.pdf https://github.com/mbchang/dynamics

Bayesian learning and inference in recurrent switching linear dynamical systems. Linderman, Johnson, Miller, Adams, Blei, Paninski http://proceedings.mlr.press/v54/linderman17a/linderman17a.pdf https://github.com/slinderman/recurrent-slds

SE3-Nets: learning rigid body motion using deep neural networks. Byravan, Fox https://arxiv.org/pdf/1606.02378.pdf

2016

Beyond temporal pooling: recurrence and temporal convolutions for gesture recognition in video. Pigou, van den Oord, Dieleman, Van Herreweghe, Dambre https://arxiv.org/abs/1506.01911

Dynamic filter networks. De Brabandere, Jia, Tuytelaars, Gool https://arxiv.org/pdf/1605.09673.pdf

Dynamic movement primitives in latent space of time-dependent variational autoencoders. Chen, Karl, van der Smagt https://ieeexplore.ieee.org/document/7803340

Learning physical intuiting of block towers by example. Lerer, Gross, Fergus https://arxiv.org/pdf/1603.01312.pdf

Structured inference networks for nonlinear state space models. Krishnan, Shalit, Sontag https://arxiv.org/pdf/1609.09869.pdf https://github.com/clinicalml/structuredinference

A recurrent latent variable model for sequential data. Chung, Kastner, Dinh, Goel, Courville, Bengio https://arxiv.org/pdf/1506.02216.pdf https://github.com/jych/nips2015_vrnn

Recognizing micro-actions and reactions from paired egocentric videos Yonetani, Kitani, Sato http://www.cs.cmu.edu/~kkitani/pdf/YKS-CVPR16.pdf

Anticipating visual representations from unlabeled video. https://github.com/chiawen/activity-anticipation https://www.zpascal.net/cvpr2016/Vondrick_Anticipating_Visual_Representations_CVPR_2016_paper.pdf

Deep multi-scale video prediction beyond mean square error. Mathieu, Couprie, LeCun https://arxiv.org/pdf/1511.05440.pdf

Generating videos with scene dynamics. Vondrick, Pirsiavash, Torralba https://papers.nips.cc/paper/6194-generating-videos-with-scene-dynamics.pdf

Disentangling space and time in video with hierarchical variational auto-encoders. Grathwohl, Wilson https://arxiv.org/pdf/1612.04440.pdf

Understanding visual concepts with continuation learning. Whitney, Chang, Kulkarni, Tenenbaum https://arxiv.org/pdf/1602.06822.pdf

Contextual RNN-GANs for abstract reasoning diagram generation. Ghosh, Kulharia, Mukerjee, Namboodiri, Bansal https://arxiv.org/pdf/1609.09444.pdf

Interaction networks for learning about objects, relations and physics . Battaglia, Pascanu, Lai, Rezende, Kavukcuoglu https://arxiv.org/pdf/1612.00222.pdf https://github.com/jsikyoon/Interaction-networks_tensorflow https://github.com/higgsfield/interaction_network_pytorch https://github.com/ToruOwO/InteractionNetwork-pytorch

An uncertain future: forecasting from static images using Variational Autoencoders. Walker, Doersch, Gupta, Hebert https://arxiv.org/pdf/1606.07873.pdf

Unsupervised learning for physical interaction through video prediction. Finn, Goodfellow, Levine https://arxiv.org/pdf/1605.07157.pdf

Sequential neural models with stochastic layers. Fraccaro, Sonderby, Paquet, Winther https://arxiv.org/pdf/1605.07571.pdf https://github.com/marcofraccaro/srnn

Learning visual predictive models of physics for playing billiards. Fragkiadaki, Agrawal, Levine, Malik https://arxiv.org/pdf/1511.07404.pdf

Attend, infer, repeat: fast scene understanding with generative models. Eslami, Heess, Weber, Tassa, Szepesvari, Kavukcuoglu, Hinton https://arxiv.org/pdf/1603.08575.pdf http://akosiorek.github.io/ml/2017/09/03/implementing-air.html https://github.com/akosiorek/attend_infer_repeat

Synthesizing robotic handwriting motion by learning from human demonstrations. Yin, Alves-Oliveira, Melo, Billard, Paiva https://pdfs.semanticscholar.org/951e/14dbef0036fddbecb51f1577dd77c9cd2cf3.pdf?_ga=2.78226524.958697415.1583668154-397935340.1548854421

2015

Learning stochastic recurrent networks. Bayer, Osendorfer https://arxiv.org/pdf/1411.7610.pdf https://github.com/durner/STORN-keras

Deep Kalman Filters. Krishnan, Shalit, Sontag https://arxiv.org/pdf/1511.05121.pdf https://github.com/k920049/Deep-Kalman-Filter

Unsupervised learning of visual representations using videos. Wang, Gupta https://arxiv.org/pdf/1505.00687.pdf

Embed to control: a locally linear latent dynamics model for control from raw images. Watter, Springenberg, Riedmiller, Boedecker https://arxiv.org/pdf/1506.07365.pdf https://github.com/ericjang/e2c

2014

Seeing the arrow of time. Pickup, Pan, Wei, Shih, Zhang, Zisserman, Scholkopf, Freeman https://www.robots.ox.ac.uk/~vgg/publications/2014/Pickup14/pickup14.pdf

2012

Activity Forecasting. Kitani, Ziebart, Bagnell, Hebert http://www.cs.cmu.edu/~kkitani/pdf/KZBH-ECCV12.pdf

2006

Information flows in causal networks. Ay, Polani https://sfi-edu.s3.amazonaws.com/sfi-edu/production/uploads/sfi-com/dev/uploads/filer/45/5f/455fd460-b6b0-4008-9de1-825a5e2b9523/06-05-014.pdf

2002

Slow feature analysis. Wiskott, Sejnowski http://www.cnbc.cmu.edu/~tai/readings/learning/wiskott_sejnowski_2002.pdf

n.d.

Learning variational latent dynamics: towards model-based imitation and control. Yin, Melo, Billard, Paiva https://pdfs.semanticscholar.org/40af/a07f86a6f7c3ec2e4e02665073b1e19652bc.pdf