Home

Awesome

transdim

MIT License Python 3.7 repo size GitHub stars

<h6 align="center">Made by Xinyu Chen • :globe_with_meridians: <a href="https://xinychen.github.io">https://xinychen.github.io</a></h6>

logo

Transportation data imputation (a.k.a., transdim).

Machine learning models make important developments in the field of spatiotemporal data modeling - like how to forecast near-future traffic states of road networks. But what happens when these models are built on incomplete data commonly collected from real-world systems (e.g., transportation system)?

<br>

Table of Content

<br>

About this Project

In the transdim project, we develop machine learning models to help address some of the toughest challenges of spatiotemporal data modeling - from missing data imputation to time series prediction. The strategic aim of this project is creating accurate and efficient solutions for spatiotemporal traffic data imputation and prediction tasks.

In a hurry? Please check out our contents as follows.

<br>

Tasks and Challenges

Missing data are there, whether we like them or not. The really interesting question is how to deal with incomplete data.

<p align="center"> <img align="middle" src="https://github.com/xinychen/transdim/blob/master/images/missing.png" width="800" /> </p> <p align = "center"> <b>Figure 1</b>: Two classical missing patterns in a spatiotemporal setting. </p>

We create three missing data mechanisms on real-world data.

<p align="center"> <img src="https://github.com/xinychen/transdim/blob/master/images/framework.png" alt="drawing" width="800"/> </p> <p align = "center"> <b>Figure 2</b>: Tensor completion framework for spatiotemporal missing traffic data imputation. </p> <p align="center"> <img align="middle" src="https://github.com/xinychen/transdim/blob/master/images/predictor-explained.png" width="700" /> </p> <p align = "center"> <b>Figure 3</b>: Illustration of our proposed Low-Rank Autoregressive Tensor Completion (LATC) imputer/predictor with a prediction window Ï„ (green nodes: observed values; white nodes: missing values; red nodes/panel: prediction; blue panel: training data to construct the tensor). </p> <br>

Implementation

Open data

In this project, we have adapted some publicly available data sets into our experiments. The original links for these data are summarized as follows,

For example, if you want to view or use these data sets, please download them at the ../datasets/ folder in advance, and then run the following codes in your Python console:

import scipy.io

tensor = scipy.io.loadmat('../datasets/Guangzhou-data-set/tensor.mat')
tensor = tensor['tensor']

In particular, if you are interested in large-scale traffic data, we recommend PeMS-4W/8W/12W and UTD19. For PeMS data, you can download the data from Zenodo and place them at the folder of datasets (data path example: ../datasets/California-data-set/pems-4w.csv). Then you can use Pandas to open data:

import pandas as pd

data = pd.read_csv('../datasets/California-data-set/pems-4w.csv', header = None)

For model evaluation, we mask certain entries of the "observed" data as missing values and then perform imputation for these "missing" values.

Model implementation

In our experiments, we implemented some machine learning models mainly on Numpy, and written these Python codes with Jupyter Notebook. If you want to evaluate these models, please download and run these notebooks directly (prerequisite: download the data sets in advance). In the following implementation, we have improved Python codes (in Jupyter Notebook) in terms of both readiability and efficiency.

Our proposed models are highlighted in bold fonts.

NotebookGuangzhouBirminghamHangzhouSeattleLondonNYCPacific
BPMF✅✅✅✅✅🔶🔶
TRMF✅🔶✅✅✅🔶🔶
BTRMF✅🔶✅✅✅🔶🔶
BTMF✅✅✅✅✅🔶🔶
BGCP✅✅✅✅✅✅✅
BATF✅✅✅✅✅✅✅
BTTF🔶🔶🔶🔶🔶✅✅
HaLRTC✅🔶✅✅✅✅✅
LRTC-TNN✅🔶✅✅🔶🔶🔶
NotebookGuangzhouBirminghamHangzhouSeattleLondonNYCPacific
TRMF✅🔶✅✅✅🔶🔶
BTRMF✅🔶✅✅✅🔶🔶
BTRTF🔶🔶🔶🔶🔶✅✅
BTMF✅🔶✅✅✅✅✅
BTTF🔶🔶🔶🔶🔶✅✅

For the implementation of these models, we use both dense_mat and sparse_mat (or dense_tensor and sparse_tensor) as inputs. However, it is not necessary by doing so if you do not hope to see the imputation/prediction performance in the iterative process, you can remove dense_mat (or dense_tensor) from the inputs of these algorithms.

Imputation/Prediction performance

example (a) Time series of actual and estimated speed within two weeks from August 1 to 14.

example (b) Time series of actual and estimated speed within two weeks from September 12 to 25.

The imputation performance of BGCP (CP rank r=15 and missing rate α=30%) under the fiber missing scenario with third-order tensor representation, where the estimated result of road segment #1 is selected as an example. In the both two panels, red rectangles represent fiber missing (i.e., speed observations are lost in a whole day).

example

example

example

<br>

Quick Start

This is an imputation example of Low-Rank Tensor Completion with Truncated Nuclear Norm minimization (LRTC-TNN). One notable thing is that unlike the complex equations in our paper, our Python implementation is extremely easy to work with.

import numpy as np
from numpy.linalg import inv as inv
def ten2mat(tensor, mode):
    return np.reshape(np.moveaxis(tensor, mode, 0), (tensor.shape[mode], -1), order = 'F')
def mat2ten(mat, tensor_size, mode):
    index = list()
    index.append(mode)
    for i in range(tensor_size.shape[0]):
        if i != mode:
            index.append(i)
    return np.moveaxis(np.reshape(mat, list(tensor_size[index]), order = 'F'), 0, mode)
def svt_tnn(mat, tau, theta):
    [m, n] = mat.shape
    if 2 * m < n:
        u, s, v = np.linalg.svd(mat @ mat.T, full_matrices = 0)
        s = np.sqrt(s)
        idx = np.sum(s > tau)
        mid = np.zeros(idx)
        mid[:theta] = 1
        mid[theta:idx] = (s[theta:idx] - tau) / s[theta:idx]
        return (u[:,:idx] @ np.diag(mid)) @ (u[:,:idx].T @ mat)
    elif m > 2 * n:
        return svt_tnn(mat.T, tau, theta).T
    u, s, v = np.linalg.svd(mat, full_matrices = 0)
    idx = np.sum(s > tau)
    vec = s[:idx].copy()
    vec[theta:] = s[theta:] - tau
    return u[:,:idx] @ np.diag(vec) @ v[:idx,:]
def compute_rmse(var, var_hat):
    return np.sqrt(np.sum((var - var_hat) ** 2) / var.shape[0])

def compute_mape(var, var_hat):
    return np.sum(np.abs(var - var_hat) / var) / var.shape[0]
def LRTC(dense_tensor, sparse_tensor, alpha, rho, theta, epsilon, maxiter):
    """Low-Rank Tensor Completion with Truncated Nuclear Norm, LRTC-TNN."""
    
    dim = np.array(sparse_tensor.shape)
    pos_missing = np.where(sparse_tensor == 0)
    pos_test = np.where((dense_tensor != 0) & (sparse_tensor == 0))
    dense_test = dense_tensor[pos_test]
    del dense_tensor
    
    X = np.zeros(np.insert(dim, 0, len(dim))) # \boldsymbol{\mathcal{X}}
    T = np.zeros(np.insert(dim, 0, len(dim))) # \boldsymbol{\mathcal{T}}
    Z = sparse_tensor.copy()
    last_tensor = sparse_tensor.copy()
    snorm = np.sqrt(np.sum(sparse_tensor ** 2))
    it = 0
    while True:
        rho = min(rho * 1.05, 1e5)
        for k in range(len(dim)):
            X[k] = mat2ten(svt_tnn(ten2mat(Z - T[k] / rho, k), alpha[k] / rho, np.int(np.ceil(theta * dim[k]))), dim, k)
        Z[pos_missing] = np.mean(X + T / rho, axis = 0)[pos_missing]
        T = T + rho * (X - np.broadcast_to(Z, np.insert(dim, 0, len(dim))))
        tensor_hat = np.einsum('k, kmnt -> mnt', alpha, X)
        tol = np.sqrt(np.sum((tensor_hat - last_tensor) ** 2)) / snorm
        last_tensor = tensor_hat.copy()
        it += 1
        if (it + 1) % 50 == 0:
            print('Iter: {}'.format(it + 1))
            print('MAPE: {:.6}'.format(compute_mape(dense_test, tensor_hat[pos_test])))
            print('RMSE: {:.6}'.format(compute_rmse(dense_test, tensor_hat[pos_test])))
            print()
        if (tol < epsilon) or (it >= maxiter):
            break

    print('Imputation MAPE: {:.6}'.format(compute_mape(dense_test, tensor_hat[pos_test])))
    print('Imputation RMSE: {:.6}'.format(compute_rmse(dense_test, tensor_hat[pos_test])))
    print()
    
    return tensor_hat
import scipy.io

import scipy.io
import numpy as np
np.random.seed(1000)

dense_tensor = scipy.io.loadmat('../datasets/Guangzhou-data-set/tensor.mat')['tensor']
dim = dense_tensor.shape
missing_rate = 0.2 # Random missing (RM)
sparse_tensor = dense_tensor * np.round(np.random.rand(dim[0], dim[1], dim[2]) + 0.5 - missing_rate)
import time
start = time.time()
alpha = np.ones(3) / 3
rho = 1e-5
theta = 0.30
epsilon = 1e-4
maxiter = 200
tensor_hat = LRTC(dense_tensor, sparse_tensor, alpha, rho, theta, epsilon, maxiter)
end = time.time()
print('Running time: %d seconds'%(end - start))

This example is from ../imputer/LRTC-TNN.ipynb, you can check out this Jupyter Notebook for details.

<br>

Documentation

  1. Intuitive understanding of randomized singular value decomposition. July 1, 2020.
  2. Generating random numbers and arrays in Matlab and Numpy. October 9, 2021.
  3. Reduced-rank vector autoregressive model for high-dimensional time series forecasting. October 16, 2021.
  4. Dynamic mode decomposition for spatiotemporal traffic speed time series in Seattle freeway. October 29, 2021.
  5. Analyzing missing data problem in Uber movement speed data. February 14, 2022.
  6. Using conjugate gradient to solve matrix equations. February 23, 2022.
  7. Inpainting fluid dynamics with tensor decomposition (NumPy). March 15, 2022.
  8. Temporal matrix factorization for multivariate time series forecasting. March 20, 2022.
  9. Forecasting multivariate time series with nonstationary temporal matrix factorization. April 25, 2022.
  10. Implementing Kronecker product decomposition with NumPy. June 20, 2022.
  11. Tensor autoregression: A multidimensional time series model. September 3, 2022.
  12. Reproducing dynamic mode decomposition on fluid flow data in Python. September 6, 2022.
  13. Convolution nuclear norm minimization for time series modeling. October 3, 2022.
  14. Reinforce matrix factorization for time series modeling: Probabilistic sequential matrix factorization. October 5, 2022.
  15. Discrete convolution and fast Fourier transform explained and implemented step by step. October 19, 2022.
  16. Matrix factorization for image inpainting in Python. December 8, 2022.
  17. Circulant matrix nuclear norm minimization for image inpainting in Python. December 9, 2022.
  18. Low-rank Laplacian convolution model for time series imputation and image inpainting. December 10, 2022.
  19. Low-rank Laplacian convolution model for color image inpainting. December 17, 2022.
  20. Intuitive understanding of tensors in machine learning. January 20, 2023.
  21. Low-rank matrix and tensor factorization for speed field reconstruction. March 9, 2023.
  22. Bayesian vector autoregression forecasting
  23. Structured low-rank matrix completion
<br>

Publications

<br>

Collaborators

<table> <tr> <td align="center"><a href="https://github.com/xinychen"><img src="https://github.com/xinychen.png?size=80" width="80px;" alt="Xinyu Chen"/><br /><sub><b>Xinyu Chen</b></sub></a><br /><a href="https://github.com/xinychen/transdim/commits?author=xinychen" title="Code">💻</a></td> <td align="center"><a href="https://github.com/Vadermit"><img src="https://github.com/Vadermit.png?size=80" width="80px;" alt="Jinming Yang"/><br /><sub><b>Jinming Yang</b></sub></a><br /><a href="https://github.com/xinychen/transdim/commits?author=Vadermit" title="Code">💻</a></td> <td align="center"><a href="https://github.com/yxnchen"><img src="https://github.com/yxnchen.png?size=80" width="80px;" alt="Yixian Chen"/><br /><sub><b>Yixian Chen</b></sub></a><br /><a href="https://github.com/xinychen/transdim/commits?author=yxnchen" title="Code">💻</a></td> <td align="center"><a href="https://github.com/MengyingLei"><img src="https://github.com/MengyingLei.png?size=80" width="80px;" alt="Mengying Lei"/><br /><sub><b>Mengying Lei</b></sub></a><br /><a href="https://github.com/xinychen/transdim/commits?author=MengyingLei" title="Code">💻</a></td> <!-- <td align="center"><a href="https://github.com/lijunsun"><img src="https://github.com/lijunsun.png?size=80" width="80px;" alt="Lijun Sun"/><br /><sub><b>Lijun Sun</b></sub></a><br /><a href="https://github.com/xinychen/transdim/commits?author=lijunsun" title="Code">💻</a></td> <td align="center"><a href="https://github.com/HanTY"><img src="https://github.com/HanTY.png?size=80" width="80px;" alt="Tianyang Han"/><br /><sub><b>Tianyang Han</b></sub></a><br /><a href="https://github.com/xinychen/transdim/commits?author=HanTY" title="Code">💻</a></td> --> <!-- </tr> <tr> <td align="center"><a href="https://github.com/xxxx"><img src="https://github.com/xxxx.png?size=100" width="100px;" alt="xxxx"/><br /><sub><b>xxxx</b></sub></a><br /><a href="https://github.com/xinychen/transdim/commits?author=xxxx" title="Code">💻</a></td> --> </tr> </table> <table> <tr> <td align="center"><a href="https://github.com/lijunsun"><img src="https://github.com/lijunsun.png?size=80" width="80px;" alt="Lijun Sun"/><br /><sub><b>Lijun Sun</b></sub></a><br /><a href="https://github.com/xinychen/transdim/commits?author=lijunsun" title="Code">💻</a></td> <td align="center"><a href="https://github.com/nsaunier"><img src="https://github.com/nsaunier.png?size=80" width="80px;" alt="Nicolas Saunier"/><br /><sub><b>Nicolas Saunier</b></sub></a><br /><a href="https://github.com/xinychen/transdim/commits?author=nsaunier" title="Code">💻</a></td> </tr> </table>

See the list of contributors who participated in this project.

<br>

Supported by

<a href="https://ivado.ca/en"> <img align="middle" src="https://github.com/xinychen/tracebase/blob/main/graphics/ivado_logo.jpeg" alt="drawing" height="70" hspace="50"> </a> <a href="https://www.cirrelt.ca/"> <img align="middle" src="https://github.com/xinychen/tracebase/blob/main/graphics/cirrelt_logo.png" alt="drawing" height="50"> </a> <br>

License

This work is released under the MIT license.