Home

Awesome

TensorFlow White Paper Notes

Features

To-do list

White Paper available at this link


TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems

Abstract


1 Introduction


2 Programming Model and Basic Concepts

Operations and Kernels

CategoryExamples
Element-wise mathematical operationsAdd, Sub, Mul, Div, Exp, Log, Greater, Less, Equal
Array operationsConcat, Slice, Split, Constant, Rank, Shape, Shuffle
Matrix operationsMatMul, MatrixInverse, MatrixDeterminant
Stateful operationsVariable, Assign, AssignAdd
Neural-net building blocksSoftMax, Sigmoid, ReLU, Convolution2D, MaxPool
Checkpointing operationsSave, Restore
Queue and synchronization operationsEnqueue, Dequeue, MutexAcquire, MutexRelease
Control flow operationsMerge, Switch, Enter, Leave, NextIteration

Check out this directory in the TensorFlow repository for kernel implementations

Sessions

Variables

See the official How-To to learn more about TensorFlow Variables


3 Implementation

Devices

Tensors

Data typePython typeDescription
DT_FLOATtf.float3232 bits floating point
DT_DOUBLEtf.float6464 bits floating point
DT_INT8tf.int88 bits signed integer
DT_INT16tf.int1616 bits signed integer
DT_INT32tf.int3232 bits signed integer
DT_INT64tf.int6464 bits signed integer
DT_UINT8tf.uint88 bits unsigned integer
DT_STRINGtf.stringVariable length byte arrays. Each element of a Tensor is a byte array
DT_BOOLtf.boolBoolean
DT_COMPLEX64tf.complex64Complex number made of two 32 bits floating points: real and imaginary parts
DT_QINT8tf.qint88 bits signed integer used in quantized Ops
DT_QINT32tf.qint3232 bits signed integer used in quantized Ops
DT_QUINT8tf.quint88 bits unsigned integer used in quantized Ops

3.1 Single-Device Execution

NOTE: To reiterate- in this context, "single device" means using a single CPU core or single GPU, not a single machine. Similarly, "multi-device" does not refer to multiple machines, but to multiple CPU cores and/or GPUs. See "3.3 Distributed Execution" for multiple machine discussion.

3.2 Multi-Device Execution

Node Placement

NOTE: At the moment, node placement is done by a simple_placer class which only considers explicit placement requirements provided by the user and implicit colocation constraints based on node type (see documentation comments for details

Cross-Device Communication

3.3 Distributed Execution

Fault Tolerance


4 Extensions

The following subsections describe advanced features and extensions of the programming model introduced in Section 2

4.1 Gradient Computation

4.2 Partial Execution

4.3 Device Constraints

Aside: I'm not sure if this functionality is available in the open source implementation of TensorFlow yet. As of now I can only find information regarding placing nodes on specific devices. Read more about manual device placement here. Let me know if you can find the documentation for this feature! It is possible to provide partial constraints (https://www.tensorflow.org/versions/r0.11/how_tos/variables/index.html#device-placement) e.g. with tf.device("/job:ps/task:7") or with tf.device("/gpu:0").

4.4 Control Flow

4.5 Input Operations

4.6 Queues

4.7 Containers


5 Optimizations

This section describes certain performance/resource usage optimizations used in the implementation of TensorFlow

5.1 Common Subexpression Elimination

5.2 Controlling Data Communication and Memory Usage

5.3 Asynchronous Kernels

5.4 Optimized Libraries for Kernel Implementations

5.5 Lossy Compression


6 Status and Experience

Advice and Lessons Learned

The following are "words of wisdom" coming from the experience of porting Google's Inception neural network into TensorFlow. After successfully doing so, the team was rewarded with a 6-fold improvement on training time over DistBelief's implementation. This advice will hopefully be useful to others as they build their own models.

  1. Build tools to gain insight into the exact number of parameters in a given model
    • This can help you catch subtle flaws in a complex network architecture, such as operations and variables instantiated incorrectly
  2. Start small and scale up
    • The TensorFlow team started by importing a small network used by DistBelief
    • Debugging a small network gave insight into the edge cases of certain operations, while having to do the same on a larger network would be nearly impossible to figure out
  3. Always ensure that the objective (loss function) matches between machine learning systems when learning is turned off
    • By setting the learning rate to zero (i.e. turning off learning), the TensorFlow team was able to identify unexpected behavior stemming from randomly initialized variables in the model
  4. Make a single machine implementation match before debugging a distributed implementation
    • This helped the TensorFlow team separate and debug differences in training performance between DistBelief and TensorFlow
    • Once the single machine implementation worked, they were able to find bugs related to race conditions and non-atomic operations in the distributed model
  5. Guard against numerical errors
    • Different numerical libraries handle non-finite floating point numbers differently
    • Checking for non-finite floating point values can allow one to detect errors in real time, guarding against numerical instability
  6. Analyze pieces of a network and understand the magnitude of numerical error
    • By running subsections of the neural network on both DistBelief and TensorFlow in parallel, the team was able to ensure that the implemented algorithms were indeed identical
    • Note that because the networks used floating point numbers, there is a given amount of numerical error that should be expected and taken into account when comparing the two systems

7 Common Programming Idioms

This section describes how TensorFlow's basic dataflow graphs can be used to speed up training neural network models on large datasets using techniques developed by the TensorFlow team.

The techniques presented here assume that the model is using stochastic gradient descent with mini-batches of around 100-1000 examples

Data Parallel Training

Model Parallel Training

Concurrent Steps for Model Computation Pipelining


8 Performance

Stay tuned for future versions of the TensorFlow white paper, which will include performance evaluations for single machine and distributed implementations of TensorFlow


9 Tools

This section discusses additional tools, developed by the TensorFlow team, that work alongside the graph modeling and execution features described above.

9.1 TensorBoard: Visualization of Graph Structures and Summary Statistics

TensorBoard was designed to help users visualize the structure of their graphs, as well as understand the behavior of their models

Visualization of Computation Graphs

Visualization of Summary Data

9.2 Performance tracing

The following is a brief overview of what EEG does under the hood

Please see pages 14 and 15 of the November 2015 white paper to see a specific example of EEG visualization along with descriptions of the current UI


10 Future Work

This section lists areas of improvement and extension for TensorFlow identified for consideration by the TensorFlow team

Extensions:

Improvements:


11 Related Work

Open source, single machine systems with portions of similar functionality

Systems designed primarily for neural networks:

Systems that support symbolic differentiation:

Systems with a core written in C++:

Comparisons with DistBelief and Project Adam

Similarities shared with DistBelief and Project Adam:

Differences between TensorFlow and DistBelief/Project Adam:

Comparison with Hallide image processing system

Related distributed dataflow graph systems

Systems that represent complex workflows as dataflow graphs

Systems that support data-dependent control flow

Systems optimized for accessing the same data repeatedly

Systems that execute dataflow graphs across heterogenous devices, including GPUs

Features that TensorFlow incorporates from the above distributed systems

Feature implementations that are most similar to TensorFlow are listed after the feature


12 Conclusions


Figures

Figure 1: Example TensorFlow code fragment

	import tensorflow as tf
	
	# 100-d vector, init to zeros
	b = tf.Variable (tf.zeros([100])
	
	# 784x100 matrix with random values
	W = tf.Variable(tf.random_uniform([784,100], -1, 1))
	
	# Placeholder for input
	x = tf.placehoder(name="x")
	
	# Rectified linear unit of (W*x +b)
	relu = tf.nn.relu(tf.matmul(W, x) + b)
	
	# Cost computed as a function of relu
	C = [...]
	
	# Instantiate a Session
	s = tf.Session()
	
	for step in xrange(0, 10):
		# Create a 100-d vector for input
		input = ...construct 100-D input array ...
		
		# Find the cost, using the constructed vector as the placeholder input
		result = s.run(C, feed_dict = {x: input})
		print step, result

Figure 2: Corresponding computation graph for Figure 1

<img src="http://cdn.rawgit.com/samjabrahams/tensorflow-white-pages-notes/master/img/figure2.svg" id="figure2" style="max-height: 300px"></img>

Figure 3: Single machine (left) and distributed system (right) structure

<img src="http://cdn.rawgit.com/samjabrahams/tensorflow-white-pages-notes/master/img/figure3.svg" id="figure3" style="max-height: 300px"></img>

Figure 4: Before and after insertion of Send/Recieve nodes

<img src="http://cdn.rawgit.com/samjabrahams/tensorflow-white-pages-notes/master/img/figure4.svg" id="figure4" style="max-height: 300px"></img>

Figure 5: Gradients computed for graph in figure 2

<img src="http://cdn.rawgit.com/samjabrahams/tensorflow-white-pages-notes/master/img/figure5.svg" id="figure5" style="max-height: 300px"></img>

Figure 6: Before and after graph transformation for partial execution

<img src="http://cdn.rawgit.com/samjabrahams/tensorflow-white-pages-notes/master/img/figure6.svg" id="figure6" style="max-height: 300px"></img>

Figure 7: Synchronous and asynchronous data parallel training

<img src="http://cdn.rawgit.com/samjabrahams/tensorflow-white-pages-notes/master/img/figure7.svg" id="figure7" style="max-height: 300px"></img>

Figure 8: Model parallel training

<img src="http://cdn.rawgit.com/samjabrahams/tensorflow-white-pages-notes/master/img/figure8.svg" id="figure8" style="max-height: 300px"></img>

Figure 9: Concurrent steps

<img src="http://cdn.rawgit.com/samjabrahams/tensorflow-white-pages-notes/master/img/figure9.svg" id="figure9" style="max-height: 300px"></img>