Home

Awesome

<br/> <p align="center"> <img src="https://raw.githubusercontent.com/donnemartin/data-science-ipython-notebooks/master/images/README_1200x800.gif"> </p> <p align="center"> <img src="https://raw.githubusercontent.com/donnemartin/data-science-ipython-notebooks/master/images/coversmall_alt.png"> <br/> </p>

data-science-ipython-notebooks

Index

<br/> <p align="center"> <img src="http://i.imgur.com/ZhKXrKZ.png"> </p>

deep-learning

IPython Notebook(s) demonstrating deep learning functionality.

<br/> <p align="center"> <img src="https://avatars0.githubusercontent.com/u/15658638?v=3&s=100"> </p>

tensor-flow-tutorials

Additional TensorFlow tutorials:

NotebookDescription
tsf-basicsLearn basic operations in TensorFlow, a library for various kinds of perceptual and language understanding tasks from Google.
tsf-linearImplement linear regression in TensorFlow.
tsf-logisticImplement logistic regression in TensorFlow.
tsf-nnImplement nearest neighboars in TensorFlow.
tsf-alexImplement AlexNet in TensorFlow.
tsf-cnnImplement convolutional neural networks in TensorFlow.
tsf-mlpImplement multilayer perceptrons in TensorFlow.
tsf-rnnImplement recurrent neural networks in TensorFlow.
tsf-gpuLearn about basic multi-GPU computation in TensorFlow.
tsf-gvizLearn about graph visualization in TensorFlow.
tsf-lvizLearn about loss visualization in TensorFlow.

tensor-flow-exercises

NotebookDescription
tsf-not-mnistLearn simple data curation by creating a pickle with formatted datasets for training, development and testing in TensorFlow.
tsf-fully-connectedProgressively train deeper and more accurate models using logistic regression and neural networks in TensorFlow.
tsf-regularizationExplore regularization techniques by training fully connected networks to classify notMNIST characters in TensorFlow.
tsf-convolutionsCreate convolutional neural networks in TensorFlow.
tsf-word2vecTrain a skip-gram model over Text8 data in TensorFlow.
tsf-lstmTrain a LSTM character model over Text8 data in TensorFlow.
<br/> <p align="center"> <img src="http://www.deeplearning.net/software/theano/_static/theano_logo_allblue_200x46.png"> </p>

theano-tutorials

NotebookDescription
theano-introIntro to Theano, which allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. It can use GPUs and perform efficient symbolic differentiation.
theano-scanLearn scans, a mechanism to perform loops in a Theano graph.
theano-logisticImplement logistic regression in Theano.
theano-rnnImplement recurrent neural networks in Theano.
theano-mlpImplement multilayer perceptrons in Theano.
<br/> <p align="center"> <img src="http://i.imgur.com/L45Q8c2.jpg"> </p>

keras-tutorials

NotebookDescription
kerasKeras is an open source neural network library written in Python. It is capable of running on top of either Tensorflow or Theano.
setupLearn about the tutorial goals and how to set up your Keras environment.
intro-deep-learning-annGet an intro to deep learning with Keras and Artificial Neural Networks (ANN).
theanoLearn about Theano by working with weights matrices and gradients.
keras-ottoLearn about Keras by looking at the Kaggle Otto challenge.
ann-mnistReview a simple implementation of ANN for MNIST using Keras.
conv-netsLearn about Convolutional Neural Networks (CNNs) with Keras.
conv-net-1Recognize handwritten digits from MNIST using Keras - Part 1.
conv-net-2Recognize handwritten digits from MNIST using Keras - Part 2.
keras-modelsUse pre-trained models such as VGG16, VGG19, ResNet50, and Inception v3 with Keras.
auto-encodersLearn about Autoencoders with Keras.
rnn-lstmLearn about Recurrent Neural Networks (RNNs) with Keras.
lstm-sentence-genLearn about RNNs using Long Short Term Memory (LSTM) networks with Keras.

deep-learning-misc

NotebookDescription
deep-dreamCaffe-based computer vision program which uses a convolutional neural network to find and enhance patterns in images.
<br/> <p align="center"> <img src="https://raw.githubusercontent.com/donnemartin/data-science-ipython-notebooks/master/images/scikitlearn.png"> </p>

scikit-learn

IPython Notebook(s) demonstrating scikit-learn functionality.

NotebookDescription
introIntro notebook to scikit-learn. Scikit-learn adds Python support for large, multi-dimensional arrays and matrices, along with a large library of high-level mathematical functions to operate on these arrays.
knnImplement k-nearest neighbors in scikit-learn.
linear-regImplement linear regression in scikit-learn.
svmImplement support vector machine classifiers with and without kernels in scikit-learn.
random-forestImplement random forest classifiers and regressors in scikit-learn.
k-meansImplement k-means clustering in scikit-learn.
pcaImplement principal component analysis in scikit-learn.
gmmImplement Gaussian mixture models in scikit-learn.
validationImplement validation and model selection in scikit-learn.
<br/> <p align="center"> <img src="https://raw.githubusercontent.com/donnemartin/data-science-ipython-notebooks/master/images/scipy.png"> </p>

statistical-inference-scipy

IPython Notebook(s) demonstrating statistical inference with SciPy functionality.

NotebookDescription
scipySciPy is a collection of mathematical algorithms and convenience functions built on the Numpy extension of Python. It adds significant power to the interactive Python session by providing the user with high-level commands and classes for manipulating and visualizing data.
effect-sizeExplore statistics that quantify effect size by analyzing the difference in height between men and women. Uses data from the Behavioral Risk Factor Surveillance System (BRFSS) to estimate the mean and standard deviation of height for adult women and men in the United States.
samplingExplore random sampling by analyzing the average weight of men and women in the United States using BRFSS data.
hypothesisExplore hypothesis testing by analyzing the difference of first-born babies compared with others.
<br/> <p align="center"> <img src="https://raw.githubusercontent.com/donnemartin/data-science-ipython-notebooks/master/images/pandas.png"> </p>

pandas

IPython Notebook(s) demonstrating pandas functionality.

NotebookDescription
pandasSoftware library written for data manipulation and analysis in Python. Offers data structures and operations for manipulating numerical tables and time series.
github-data-wranglingLearn how to load, clean, merge, and feature engineer by analyzing GitHub data from the Viz repo.
Introduction-to-PandasIntroduction to Pandas.
Introducing-Pandas-ObjectsLearn about Pandas objects.
Data Indexing and SelectionLearn about data indexing and selection in Pandas.
Operations-in-PandasLearn about operating on data in Pandas.
Missing-ValuesLearn about handling missing data in Pandas.
Hierarchical-IndexingLearn about hierarchical indexing in Pandas.
Concat-And-AppendLearn about combining datasets: concat and append in Pandas.
Merge-and-JoinLearn about combining datasets: merge and join in Pandas.
Aggregation-and-GroupingLearn about aggregation and grouping in Pandas.
Pivot-TablesLearn about pivot tables in Pandas.
Working-With-StringsLearn about vectorized string operations in Pandas.
Working-with-Time-SeriesLearn about working with time series in pandas.
Performance-Eval-and-QueryLearn about high-performance Pandas: eval() and query() in Pandas.
<br/> <p align="center"> <img src="https://raw.githubusercontent.com/donnemartin/data-science-ipython-notebooks/master/images/matplotlib.png"> </p>

matplotlib

IPython Notebook(s) demonstrating matplotlib functionality.

NotebookDescription
matplotlibPython 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms.
matplotlib-appliedApply matplotlib visualizations to Kaggle competitions for exploratory data analysis. Learn how to create bar plots, histograms, subplot2grid, normalized plots, scatter plots, subplots, and kernel density estimation plots.
Introduction-To-MatplotlibIntroduction to Matplotlib.
Simple-Line-PlotsLearn about simple line plots in Matplotlib.
Simple-Scatter-PlotsLearn about simple scatter plots in Matplotlib.
Errorbars.ipynbLearn about visualizing errors in Matplotlib.
Density-and-Contour-PlotsLearn about density and contour plots in Matplotlib.
Histograms-and-BinningsLearn about histograms, binnings, and density in Matplotlib.
Customizing-LegendsLearn about customizing plot legends in Matplotlib.
Customizing-ColorbarsLearn about customizing colorbars in Matplotlib.
Multiple-SubplotsLearn about multiple subplots in Matplotlib.
Text-and-AnnotationLearn about text and annotation in Matplotlib.
Customizing-TicksLearn about customizing ticks in Matplotlib.
Settings-and-StylesheetsLearn about customizing Matplotlib: configurations and stylesheets.
Three-Dimensional-PlottingLearn about three-dimensional plotting in Matplotlib.
Geographic-Data-With-BasemapLearn about geographic data with basemap in Matplotlib.
Visualization-With-SeabornLearn about visualization with Seaborn.
<br/> <p align="center"> <img src="https://raw.githubusercontent.com/donnemartin/data-science-ipython-notebooks/master/images/numpy.png"> </p>

numpy

IPython Notebook(s) demonstrating NumPy functionality.

NotebookDescription
numpyAdds Python support for large, multi-dimensional arrays and matrices, along with a large library of high-level mathematical functions to operate on these arrays.
Introduction-to-NumPyIntroduction to NumPy.
Understanding-Data-TypesLearn about data types in Python.
The-Basics-Of-NumPy-ArraysLearn about the basics of NumPy arrays.
Computation-on-arrays-ufuncsLearn about computations on NumPy arrays: universal functions.
Computation-on-arrays-aggregatesLearn about aggregations: min, max, and everything in between in NumPy.
Computation-on-arrays-broadcastingLearn about computation on arrays: broadcasting in NumPy.
Boolean-Arrays-and-MasksLearn about comparisons, masks, and boolean logic in NumPy.
Fancy-IndexingLearn about fancy indexing in NumPy.
SortingLearn about sorting arrays in NumPy.
Structured-Data-NumPyLearn about structured data: NumPy's structured arrays.
<br/> <p align="center"> <img src="https://raw.githubusercontent.com/donnemartin/data-science-ipython-notebooks/master/images/python.png"> </p>

python-data

IPython Notebook(s) demonstrating Python functionality geared towards data analysis.

NotebookDescription
data structuresLearn Python basics with tuples, lists, dicts, sets.
data structure utilitiesLearn Python operations such as slice, range, xrange, bisect, sort, sorted, reversed, enumerate, zip, list comprehensions.
functionsLearn about more advanced Python features: Functions as objects, lambda functions, closures, *args, **kwargs currying, generators, generator expressions, itertools.
datetimeLearn how to work with Python dates and times: datetime, strftime, strptime, timedelta.
loggingLearn about Python logging with RotatingFileHandler and TimedRotatingFileHandler.
pdbLearn how to debug in Python with the interactive source code debugger.
unit testsLearn how to test in Python with Nose unit tests.
<br/> <p align="center"> <img src="https://raw.githubusercontent.com/donnemartin/data-science-ipython-notebooks/master/images/kaggle.png"> </p>

kaggle-and-business-analyses

IPython Notebook(s) used in kaggle competitions and business analyses.

NotebookDescription
titanicPredict survival on the Titanic. Learn data cleaning, exploratory data analysis, and machine learning.
churn-analysisPredict customer churn. Exercise logistic regression, gradient boosting classifers, support vector machines, random forests, and k-nearest-neighbors. Includes discussions of confusion matrices, ROC plots, feature importances, prediction probabilities, and calibration/descrimination.
<br/> <p align="center"> <img src="https://raw.githubusercontent.com/donnemartin/data-science-ipython-notebooks/master/images/spark.png"> </p>

spark

IPython Notebook(s) demonstrating spark and HDFS functionality.

NotebookDescription
sparkIn-memory cluster computing framework, up to 100 times faster for certain applications and is well suited for machine learning algorithms.
hdfsReliably stores very large files across machines in a large cluster.
<br/> <p align="center"> <img src="https://raw.githubusercontent.com/donnemartin/data-science-ipython-notebooks/master/images/mrjob.png"> </p>

mapreduce-python

IPython Notebook(s) demonstrating Hadoop MapReduce with mrjob functionality.

NotebookDescription
mapreduce-pythonRuns MapReduce jobs in Python, executing jobs locally or on Hadoop clusters. Demonstrates Hadoop Streaming in Python code with unit test and mrjob config file to analyze Amazon S3 bucket logs on Elastic MapReduce. Disco is another python-based alternative.
<br/> <p align="center"> <img src="https://raw.githubusercontent.com/donnemartin/data-science-ipython-notebooks/master/images/aws.png"> </p>

aws

IPython Notebook(s) demonstrating Amazon Web Services (AWS) and AWS tools functionality.

Also check out:

NotebookDescription
botoOfficial AWS SDK for Python.
s3cmdInteracts with S3 through the command line.
s3distcpCombines smaller files and aggregates them together by taking in a pattern and target file. S3DistCp can also be used to transfer large volumes of data from S3 to your Hadoop cluster.
s3-parallel-putUploads multiple files to S3 in parallel.
redshiftActs as a fast data warehouse built on top of technology from massive parallel processing (MPP).
kinesisStreams data in real time with the ability to process thousands of data streams per second.
lambdaRuns code in response to events, automatically managing compute resources.
<br/> <p align="center"> <img src="https://raw.githubusercontent.com/donnemartin/data-science-ipython-notebooks/master/images/commands.png"> </p>

commands

IPython Notebook(s) demonstrating various command lines for Linux, Git, etc.

NotebookDescription
linuxUnix-like and mostly POSIX-compliant computer operating system. Disk usage, splitting files, grep, sed, curl, viewing running processes, terminal syntax highlighting, and Vim.
anacondaDistribution of the Python programming language for large-scale data processing, predictive analytics, and scientific computing, that aims to simplify package management and deployment.
ipython notebookWeb-based interactive computational environment where you can combine code execution, text, mathematics, plots and rich media into a single document.
gitDistributed revision control system with an emphasis on speed, data integrity, and support for distributed, non-linear workflows.
rubyUsed to interact with the AWS command line and for Jekyll, a blog framework that can be hosted on GitHub Pages.
jekyllSimple, blog-aware, static site generator for personal, project, or organization sites. Renders Markdown or Textile and Liquid templates, and produces a complete, static website ready to be served by Apache HTTP Server, Nginx or another web server.
pelicanPython-based alternative to Jekyll.
djangoHigh-level Python Web framework that encourages rapid development and clean, pragmatic design. It can be useful to share reports/analyses and for blogging. Lighter-weight alternatives include Pyramid, Flask, Tornado, and Bottle.

misc

IPython Notebook(s) demonstrating miscellaneous functionality.

NotebookDescription
regexRegular expression cheat sheet useful in data wrangling.
algorithmiaAlgorithmia is a marketplace for algorithms. This notebook showcases 4 different algorithms: Face Detection, Content Summarizer, Latent Dirichlet Allocation and Optical Character Recognition.

notebook-installation

anaconda

Anaconda is a free distribution of the Python programming language for large-scale data processing, predictive analytics, and scientific computing that aims to simplify package management and deployment.

Follow instructions to install Anaconda or the more lightweight miniconda.

dev-setup

For detailed instructions, scripts, and tools to set up your development environment for data analysis, check out the dev-setup repo.

running-notebooks

To view interactive content or to modify elements within the IPython notebooks, you must first clone or download the repository then run the notebook. More information on IPython Notebooks can be found here.

$ git clone https://github.com/donnemartin/data-science-ipython-notebooks.git
$ cd data-science-ipython-notebooks
$ jupyter notebook

Notebooks tested with Python 2.7.x.

credits

contributing

Contributions are welcome! For bug reports or requests please submit an issue.

contact-info

Feel free to contact me to discuss any issues, questions, or comments.

license

This repository contains a variety of content; some developed by Donne Martin, and some from third-parties. The third-party content is distributed under the license provided by those parties.

The content developed by Donne Martin is distributed under the following license:

I am providing code and resources in this repository to you under an open source license. Because this is my personal repository, the license you receive to my code and resources is from me and not my employer (Facebook).

Copyright 2015 Donne Martin

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.