Home

Awesome

An Aggregation of a Few Decent Data Science Quick References

<a name='glossary'/>

Glossary

Copyright (c) 2016 by H2O.ai team

Source: https://github.com/h2oai/h2o-3/blob/master/h2o-docs/src/product/tutorials/glossary.md

TermDefinition
AutoencoderAn extension of the Deep Learning framework. Can be used to compress input features (similar to PCA). Sparse autoencoders are simple extensions that can increase accuracy. Use autoencoders for:<br>- generic dimensionality reduction (for pre-processing for any algorithm)<br>- anomaly detection (for comparing the reconstructed signal with the original to find differences that may be anomalies)<br>- layer-by-layer pre-training (using stacked auto-encoders)
BackpropogationUses a known, desired output for each input value to calculate the loss function gradient for training. If enabled, performed after each training sample in Deep Learning.
Balanced classesOversampling the minority classes to balance the distribution.
Beta constraintsSupplied maximum and minimum values for the predictor (Beta) parameters, typicaly for GLMs.
BinaryA variable with only two possible outcomes. Refer to binomial.
<a name="Binomial"></a>BinomialA variable with the value 0 or 1. Binomial variables assigned as 0 indicate that an event hasn't occurred or that the observation lacks a feature, where 1 indicates occurrence or instance of an attribute.
BinsBins are linear-sized from the observed min-to-max for the subset that is being split. Large bins are enforced for shallow tree depths. Based on the tree decisions, as the tree gets deeper, the bins are distributed symmetrically over the reduced range of each subset.
<a name="Categorical"></a>CategoricalA qualitative, unordered variable (for example, A, B, AB, and O would be values for the category blood type); synonym for enumerator or factor. <a name="Classification"></a>Classification
<a name="Cloud"></a>CloudSynonym for a linked cluster of computers. Refer to the definition for cluster.
<a name="Cluster"></a>Cluster1. A group of computing nodes that work together; when a job is submitted to a cluster, all the nodes in the cluster work on a portion of the job. Synonym for cloud. <br>2. In statistics, a cluster is a group of observations from a data set identified as similar according to a particular clustering algorithm.</br>
Confusion matrixTable that depicts the performance of the algorithm (using the false positive rate, false negative, true positive, and true negative rates).
<a name="Continuous"></a>ContinuousA variable that can take on all or nearly all values along an interval on the real number line (for example, height or weight). The opposite of a discrete value, which can only take on certain numerical values (for example, the number of patients treated).
CSV fileCSV is an acronym for comma-separated value. A CSV file stores data in a plain text format.
<a name="DL"></a>Deep LearningUses a composition of multiple non-linear transformations to model high-level abstractions in data. See also LeCun, Bengio and Hinton 2015.
<a name="Dependent"></a>Dependent variableThe response column in a data set; what you are trying to measure, observe, or predict. The opposite of an independent variable.
DevianceDeviance is the difference between an expected value and an observed value. It plays a critical role in defining GLM models. For a more detailed discussion of deviance.
<a name="DistKV"></a>Distributed key/value (DKV)Distributed key/value store. Refer also to key/value store.
<a name="Discrete"></a>DiscreteA variable that can only take on certain numerical values (for example, the number of patients treated). The opposite of a continuous variable.
<a name="Enum"></a>Enumerator/enumA data type where the value is one of a defined set of named values known as "elements", "members", or "enumerators." For example, cat, dog, & mouse are enumerators of the enumerated type animal.
<a name="Epoch"></a>EpochA round or iteration of model training or testing. Refer also to iteration.
<a name="Factor"></a>FactorA data type where the value is one of a defined set of categories. Refer to Enum and Categorical.
FamilyThe distribution options available for predictive modeling in GLM.
FeatureSynonym for attribute, predictor, or independent variable. Usually refers to the data observed on features given in the columns of a data set.
Feed-forwardAssociates input with output for pattern recognition.
Gzipped (gz) fileGzip is a type of commonly used file compression.
HEX formatRecords made up of hexadecimal numbers representing machine language code or constant data.
<a name="Independent"></a>Independent variableThe factors can be manipulated or controlled (also known as predictors). The opposite of a dependent variable.
Hit ratio(Multinomial only) The number of times the prediction was correct out of the total number of predictions.
IntegerA whole number (can be negative but cannot be a fraction).
<a name="Iteration"></a>IterationA round or instance of model testing or training. Also known as an epoch.
JVMJava virtual machine.
Key/value pairA type of data that associates a particular key index to a certain datum.
<a name="KVstore"></a>Key/value storeA tool that allows storage of schema-less data. Data usually consists of a string that represents the key, and the data itself, which is the value. Refer also to distributed key/value.
L1 regularizationA regularization method that constrains the absolute value of the weights and has the net effect of dropping some values (setting them to zero) from a model to reduce complexity and avoid overfitting.
L2 regularizationA regularization method that constrains the sum of the squared weights. This method introduces bias into parameter estimates but frequently produces substantial gains in modeling as estimate variance is reduced.
Link functionA user-defined option in GLM.
Loss functionThe function minimized in order to achieve a desired estimator; synonymous to objective function and criterion function. For example, linear regression defines the set of best parameter estimates as the set of estimates that produces the minimum of the sum of the squared errors. Errors are the difference between the predicted value and the observed value.
MSEMean squared error; measures the average of the squares of the error rate (the difference between the predictors and what was predicted).
MultinomialA variable where the value can be one of more than two possible outcomes (for example, blood type).
N-foldsUser-defined number of cross validation models.
NodeIn distributed computing systems, nodes include clients, servers, or peers. In statistics, a node is a decision or terminal point in a classification tree.
NumericA column type containing real numbers, small integers, or booleans.
OffsetA parameter that compensates for differences in units of observation (for example, different populations or geographic sizes) to make sure outcome is proportional.
ParseAnalysis of a string of symbols or datum that results in the conversion of a set of information from a person-readable format to a machine-readable format.
POJOPlain Old Java Object; a way to export a model and implement it in a Java application.
<a name="Regression"></a>RegressionA model where the input is numerical and the output is a prediction of numerical values. Also known as "quantitative"; the opposite of a classification model.
<a name="Response"></a>Response columnThe dependent variable in a model or data set.
RealA fractional number.
ROC CurveGraph representing the ratio to true positives to false positives.
Scoring historyRepresents the error rate of the model as it is built.
SeedA starting point for randomization. Seed specification is used when machine learning models have a random component; it allows users to recreate the exact "random" conditions used in a model at a later time.
SeparatorWhat separates the entries in the dataset; usually a comma, semicolon, etc.
SparseA dataset where many of the rows contain blank values or "NA" instead of data.
Standard deviationThe standard deviation of the data in the column, defined as the square root of the sum of the deviance of observed values from the mean divided by the number of elements in the column minus one. Abbreviated sd.
StandardizationTransformation of a variable so that it is mean-centered at 0 and scaled by the standard deviation; helps prevent precision problems.
Supervised learningModel type where the input is labeled so that the algorithm can identify it and learn from it.
Unsupervised learningModel type where the input is not labeled.
ValidationAn analysis of how well the model fits.
Variable importanceRepresents the statistical significance of each variable in the data in terms of its affect on the model.
WeightsA parameter that specifies certain outcomes as more significant (for example, if you are trying to identify incidence of disease, one "positive" result can be more meaningful than 50 "negative" responses). Higher values indicate more importance.
XLS fileA Microsoft Excel 2003-2007 spreadsheet file format.
YARNYet Another Resource Manager; used to manage Hadoop clusters.

<a name='best-practices'/>

Best Practices

Copyright (c) 2016 by SAS Institute

Source: https://github.com/sassoftware/enlighten-apply/tree/master/ML_tables

Alt text

<a name='algos-1'/>

Algorithms: Part 1

Copyright (c) 2016 by SAS Institute

Source: https://github.com/sassoftware/enlighten-apply/tree/master/ML_tables

Alt text

<a name='algos-2'/>

Algorithms: Part 2

Copyright (c) 2016 by SAS Institute

Source: https://github.com/sassoftware/enlighten-apply/tree/master/ML_tables

Alt text