
Video of our AI workshop conducted on Aug 30, 2020

A/B testing

A statistical way of comparing two (or more) techniques, typically an incumbent against a new rival. A/B testing aims to determine not only which technique performs better but also to understand whether the difference is statistically significant. A/B testing usually considers only two techniques using one measurement, but it can be applied to any finite number of techniques and measures.

Accuracy

The fraction of predictions that a classification model got right.

action

In reinforcement learning, the mechanism by which the agent transitions between states of the environment. The agent chooses the action by using a policy.

activation function

A function (for example, ReLU or sigmoid) that takes in the weighted sum of all of the inputs from the previous layer and then generates and passes an output value (typically nonlinear) to the next layer.

active learning

A training approach in which the algorithm chooses some of the data it learns from. Active learning is particularly valuable when labeled examples are scarce or expensive to obtain. Instead of blindly seeking a diverse range of labeled examples, an active learning algorithm selectively seeks the particular range of examples it needs for learning.

AdaGrad

A sophisticated gradient descent algorithm that rescales the gradients of each parameter, effectively giving each parameter an independent learning rate.

agent

In reinforcement learning, the entity that uses a policy to maximize expected return gained from transitioning between states of the environment. 
agglomerative clustering

Agglomerative clustering first assigns every example to its own cluster, and iteratively merges the closest clusters to create a hierarchical tree.

AR

Abbreviation for augmented reality. 
area under the PR curve

Area under the interpolated precisionrecall curve, obtained by plotting (recall, precision) points for different values of the classification threshold. Depending on how it's calculated, PR AUC may be equivalent to the average precision of the model.

area under the ROC curve

An evaluation metric that considers all possible classification thresholds.
The Area Under the ROC curve is the probability that a classifier will be more confident that a randomly chosen positive example is actually positive than that a randomly chosen negative example is positive.

augmented reality

A technology that superimposes a computergenerated image on a user's view of the real world, thus providing a composite view.

automation bias

When a human decision maker favors recommendations made by an automated decisionmaking system over information made without automation, even when the automated decisionmaking system makes errors.

average precision

A metric for summarizing the performance of a ranked sequence of results. Average precision is calculated by taking the average of the precision values for each relevant result (each result in the ranked list where the recall increases relative to the previous result).

backpropagation

The primary algorithm for performing gradient descent on neural networks. First, the output values of each node are calculated (and cached) in a forward pass. Then, the partial derivative of the error with respect to each parameter is calculated in a backward pass through the graph.

bag of words

A representation of the words in a phrase or passage, irrespective of order. For example, bag of words represents the following three phrases identically:
the dog jumps
jumps the dog
dog jumps the
Each word is mapped to an index in a sparse vector, where the vector has an index for every word in the vocabulary. For example, the phrase the dog jumps is mapped into a feature vector with nonzero values at the three indices corresponding to the words the, dog, and jumps. The nonzero value can be any of the following:
A 1 to indicate the presence of a word.
A count of the number of times a word appears in the bag. For example, if the phrase were the maroon dog is a dog with maroon fur, then both maroon and dog would be represented as 2, while the other words would be represented as 1.
Some other value, such as the logarithm of the count of the number of times a word appears in the bag.

baseline

A model used as a reference point for comparing how well another model (typically, a more complex one) is performing. For example, a logistic regression model might serve as a good baseline for a deep model.
For a particular problem, the baseline helps model developers quantify the minimal expected performance that a new model must achieve for the new model to be useful.

batch

The set of examples used in one iteration (that is, one gradient update) of model training.

batch normalization

Normalizing the input or output of the activation functions in a hidden layer. Batch normalization can provide the following benefits:
Make neural networks more stable by protecting against outlier weights.
Enable higher learning rates.
Reduce overfitting.

batch size

The number of examples in a batch. For example, the batch size of SGD is 1, while the batch size of a minibatch is usually between 10 and 1000. Batch size is usually fixed during training and inference; however, TensorFlow does permit dynamic batch sizes.

Bayesian neural network

A probabilistic neural network that accounts for uncertainty in weights and outputs. A standard neural network regression model typically predicts a scalar value; for example, a model predicts a house price of 853,000. By contrast, a Bayesian neural network predicts a distribution of values; for example, a model predicts a house price of 853,000 with a standard deviation of 67,200. A Bayesian neural network relies on Bayes' Theorem to calculate uncertainties in weights and predictions. A Bayesian neural network can be useful when it is important to quantify uncertainty, such as in models related to pharmaceuticals. Bayesian neural networks can also help prevent overfitting.

Bellman equation

Reinforcement learning algorithms apply Bellman equation identity to create Qlearning

bias (ethics/fairness)

1. Stereotyping, prejudice or favoritism towards some things, people, or groups over others. These biases can affect collection and interpretation of data, the design of a system, and how users interact with a system. Forms of this type of bias include:
automation bias
confirmation bias
experimenter’s bias
group attribution bias
implicit bias
ingroup bias
outgroup homogeneity bias
2. Systematic error introduced by a sampling or reporting procedure. Forms of this type of bias include:
coverage bias
nonresponse bias
participation bias
reporting bias
sampling bias
selection bias
Not to be confused with the bias term in machine learning models or prediction bias.

bias (math)

An intercept or offset from an origin. Bias (also known as the bias term) is referred to as b or w0 in machine learning models.

bigram

An Ngram in which N=2.

binary classification

A type of classification task that outputs one of two mutually exclusive classes. For example, a machine learning model that evaluates email messages and outputs either "spam" or "not spam" is a binary classifier.

binning

Converting a (usually continuous) feature into multiple binary features called buckets or bins, typically based on value range. For example, instead of representing temperature as a single continuous floatingpoint feature, you could chop ranges of temperatures into discrete bins. Given temperature data sensitive to a tenth of a degree, all temperatures between 0.0 and 15.0 degrees could be put into one bin, 15.1 to 30.0 degrees could be a second bin, and 30.1 to 50.0 degrees could be a third bin.

boosting

A machine learning technique that iteratively combines a set of simple and not very accurate classifiers (referred to as "weak" classifiers) into a classifier with high accuracy (a "strong" classifier) by upweighting the examples that the model is currently misclassfying.

bounding box

In an image, the (x, y) coordinates of a rectangle around an area of interest

broadcasting

Expanding the shape of an operand in a matrix math operation to dimensions compatible for that operation. For instance, linear algebra requires that the two operands in a matrix addition operation must have the same dimensions. Consequently, you can't add a matrix of shape (m, n) to a vector of length n. Broadcasting enables this operation by virtually expanding the vector of length n to a matrix of shape (m,n) by replicating the same values down each column

bucketing

Converting a (usually continuous) feature into multiple binary features called buckets or bins, typically based on value range. For example, instead of representing temperature as a single continuous floatingpoint feature, you could chop ranges of temperatures into discrete bins. Given temperature data sensitive to a tenth of a degree, all temperatures between 0.0 and 15.0 degrees could be put into one bin, 15.1 to 30.0 degrees could be a second bin, and 30.1 to 50.0 degrees could be a third bin.

calibration layer

A postprediction adjustment, typically to account for prediction bias. The adjusted predictions and probabilities should match the distribution of an observed set of labels.

candidate generation

The initial set of recommendations chosen by a recommendation system. For example, consider a bookstore that offers 100,000 titles. The candidate generation phase creates a much smaller list of suitable books for a particular user, say 500. But even 500 books is way too many to recommend to a user. Subsequent, more expensive, phases of a recommendation system (such as scoring and reranking) whittle down those 500 to a much smaller, more useful set of recommendations.

candidate sampling

A trainingtime optimization in which a probability is calculated for all the positive labels, using, for example, softmax, but only for a random sample of negative labels. For example, if we have an example labeled beagle and dog candidate sampling computes the predicted probabilities and corresponding loss terms for the beagle and dog class outputs in addition to a random subset of the remaining classes (cat, lollipop, fence). The idea is that the negative classes can learn from less frequent negative reinforcement as long as positive classes always get proper positive reinforcement, and this is indeed observed empirically. The motivation for candidate sampling is a computational efficiency win from not computing predictions for all negatives

categorical data

Features having a discrete set of possible values. For example, consider a categorical feature named house style, which has a discrete set of three possible values: Tudor, ranch, colonial. By representing house style as categorical data, the model can learn the separate impacts of Tudor, ranch, and colonial on house price.
Sometimes, values in the discrete set are mutually exclusive, and only one value can be applied to a given example. For example, a car maker categorical feature would probably permit only a single value (Toyota) per example. Other times, more than one value may be applicable. A single car could be painted more than one different color, so a car color categorical feature would likely permit a single example to have multiple values (for example, red and white).
Categorical features are sometimes called discrete features.
Contrast with numerical data.

centroid

The center of a cluster as determined by a kmeans or kmedian algorithm. For instance, if k is 3, then the kmeans or kmedian algorithm finds 3 centroids.

centroidbased clustering

A category of clustering algorithms that organizes data into nonhierarchical clusters. kmeans is the most widely used centroidbased clustering algorithm.
Contrast with hierarchical clustering algorithms.

checkpoint

Data that captures the state of the variables of a model at a particular time. Checkpoints enable exporting model weights, as well as performing training across multiple sessions. Checkpoints also enable training to continue past errors (for example, job preemption). Note that the graph itself is not included in a checkpoint.

class

One of a set of enumerated target values for a label. For example, in a binary classification model that detects spam, the two classes are spam and not spam. In a multiclass classification model that identifies dog breeds, the classes would be poodle, beagle, pug, and so on.

classification model

A type of machine learning model for distinguishing among two or more discrete classes. For example, a natural language processing classification model could determine whether an input sentence was in French, Spanish, or Italian. Compare with regression model.

classification threshold

A scalarvalue criterion that is applied to a model's predicted score in order to separate the positive class from the negative class. Used when mapping logistic regression results to binary classification. For example, consider a logistic regression model that determines the probability of a given email message being spam. If the classification threshold is 0.9, then logistic regression values above 0.9 are classified as spam and those below 0.9 are classified as not spam.

classimbalanced dataset

A binary classification problem in which the labels for the two classes have significantly different frequencies. For example, a disease dataset in which 0.0001 of examples have positive labels and 0.9999 have negative labels is a classimbalanced problem, but a football game predictor in which 0.51 of examples label one team winning and 0.49 label the other team winning is not a classimbalanced problem.

clipping

A technique for handling outliers. Specifically, reducing feature values that are greater than a set maximum value down to that maximum value. Also, increasing feature values that are less than a specific minimum value up to that minimum value.
For example, suppose that only a few feature values fall outside the range 40–60. In this case, you could do the following:
Clip all values over 60 to be exactly 60.
Clip all values under 40 to be exactly 40.
In addition to bringing input values within a designated range, clipping can also used to force gradient values within a designated range during training.

Cloud TPU

A specialized hardware accelerator designed to speed up machine learning workloads on Google Cloud Platform.

clustering

Grouping related examples, particularly during unsupervised learning. Once all the examples are grouped, a human can optionally supply meaning to each cluster.
Many clustering algorithms exist. For example, the kmeans algorithm clusters examples based on their proximity to a centroid.

coadaptation

When neurons predict patterns in training data by relying almost exclusively on outputs of specific other neurons instead of relying on the network's behavior as a whole. When the patterns that cause coadaption are not present in validation data, then coadaptation causes overfitting. Dropout regularization reduces coadaptation because dropout ensures neurons cannot rely solely on specific other neurons.

collaborative filtering

Making predictions about the interests of one user based on the interests of many other users. Collaborative filtering is often used in recommendation systems.

confirmation bias

The tendency to search for, interpret, favor, and recall information in a way that confirms one's preexisting beliefs or hypotheses. Machine learning developers may inadvertently collect or label data in ways that influence an outcome supporting their existing beliefs. Confirmation bias is a form of implicit bias.
Experimenter's bias is a form of confirmation bias in which an experimenter continues training models until a preexisting hypothesis is confirmed.

confusion matrix

An NxN table that summarizes how successful a classification model's predictions were; that is, the correlation between the label and the model's classification. One axis of a confusion matrix is the label that the model predicted, and the other axis is the actual label. N represents the number of classes. In a binary classification problem, N=2. For example, here is a sample confusion matrix for a binary classification problem:
Tumor (predicted) NonTumor (predicted)
Tumor (actual) 18 1
NonTumor (actual) 6 452
The preceding confusion matrix shows that of the 19 samples that actually had tumors, the model correctly classified 18 as having tumors (18 true positives), and incorrectly classified 1 as not having a tumor (1 false negative). Similarly, of 458 samples that actually did not have tumors, 452 were correctly classified (452 true negatives) and 6 were incorrectly classified (6 false positives).
The confusion matrix for a multiclass classification problem can help you determine mistake patterns. For example, a confusion matrix could reveal that a model trained to recognize handwritten digits tends to mistakenly predict 9 instead of 4, or 1 instead of 7.
Confusion matrices contain sufficient information to calculate a variety of performance metrics, including precision and recall.

continuous feature

A floatingpoint feature with an infinite range of possible values. Contrast with discrete feature.

convenience sampling

Using a dataset not gathered scientifically in order to run quick experiments. Later on, it's essential to switch to a scientifically gathered dataset.

convergence

Informally, often refers to a state reached during training in which training loss and validation loss change very little or not at all with each iteration after a certain number of iterations. In other words, a model reaches convergence when additional training on the current data will not improve the model. In deep learning, loss values sometimes stay constant or nearly so for many iterations before finally descending, temporarily producing a false sense of convergence.
See also early stopping.
See also Boyd and Vandenberghe, Convex Optimization.

convex function

A function in which the region above the graph of the function is a convex set. The prototypical convex function is shaped something like the letter U. A strictly convex function has exactly one local minimum point, which is also the global minimum point. The classic Ushaped functions are strictly convex functions. However, some convex functions (for example, straight lines) are not Ushaped.
A lot of the common loss functions, including the following, are convex functions:
L2 loss
Log Loss
L1 regularization
L2 regularization
Many variations of gradient descent are guaranteed to find a point close to the minimum of a strictly convex function. Similarly, many variations of stochastic gradient descent have a high probability (though, not a guarantee) of finding a point close to the minimum of a strictly convex function.
The sum of two convex functions (for example, L2 loss + L1 regularization) is a convex function.
Deep models are never convex functions. Remarkably, algorithms designed for convex optimization tend to find reasonably good solutions on deep networks anyway, even though those solutions are not guaranteed to be a global minimum.

convex optimization

The process of using mathematical techniques such as gradient descent to find the minimum of a convex function. A great deal of research in machine learning has focused on formulating various problems as convex optimization problems and in solving those problems more efficiently.
For complete details, see Boyd and Vandenberghe, Convex Optimization.

convex set

A subset of Euclidean space such that a line drawn between any two points in the subset remains completely within the subset.

convolution

In mathematics, casually speaking, a mixture of two functions. In machine learning, a convolution mixes the convolutional filter and the input matrix in order to train weights.
The term "convolution" in machine learning is often a shorthand way of referring to either convolutional operation or convolutional layer.
Without convolutions, a machine learning algorithm would have to learn a separate weight for every cell in a large tensor. For example, a machine learning algorithm training on 2K x 2K images would be forced to find 4M separate weights. Thanks to convolutions, a machine learning algorithm only has to find weights for every cell in the convolutional filter, dramatically reducing the memory needed to train the model. When the convolutional filter is applied, it is simply replicated across cells such that each is multiplied by the filter.

convolutional filter

One of the two actors in a convolutional operation. (The other actor is a slice of an input matrix.) A convolutional filter is a matrix having the same rank as the input matrix, but a smaller shape. For example, given a 28x28 input matrix, the filter could be any 2D matrix smaller than 28x28.
In photographic manipulation, all the cells in a convolutional filter are typically set to a constant pattern of ones and zeroes. In machine learning, convolutional filters are typically seeded with random numbers and then the network trains the ideal values.

convolutional layer

A layer of a deep neural network in which a convolutional filter passes along an input matrix.

convolutional neural network (CNN)

A neural network in which at least one layer is a convolutional layer. A typical convolutional neural network consists of some combination of the following layers:
convolutional layers
pooling layers
dense layers
Convolutional neural networks have had great success in certain kinds of problems, such as image recognition.

convolutional operation

The following twostep mathematical operation:
Elementwise multiplication of the convolutional filter and a slice of an input matrix. (The slice of the input matrix has the same rank and size as the convolutional filter.)
Summation of all the values in the resulting product matrix. A convolutional layer consists of a series of convolutional operations, each acting on a different slice of the input matrix.

cost

Synonym for loss.

counterfactual fairness

A fairness metric that checks whether a classifier produces the same result for one individual as it does for another individual who is identical to the first, except with respect to one or more sensitive attributes. Evaluating a classifier for counterfactual fairness is one method for surfacing potential sources of bias in a model.

coverage bias

Errors in conclusions drawn from sampled data due to a selection process that generates systematic differences between samples observed in the data and those not observed. The following forms of selection bias exist:
coverage bias: The population represented in the dataset does not match the population that the machine learning model is making predictions about.
sampling bias: Data is not collected randomly from the target group.
nonresponse bias (also called participation bias): Users from certain groups optout of surveys at different rates than users from other groups.
For example, suppose you are creating a machine learning model that predicts people's enjoyment of a movie. To collect training data, you hand out a survey to everyone in the front row of a theater showing the movie. Offhand, this may sound like a reasonable way to gather a dataset; however, this form of data collection may introduce the following forms of selection bias:
coverage bias: By sampling from a population who chose to see the movie, your model's predictions may not generalize to people who did not already express that level of interest in the movie.
sampling bias: Rather than randomly sampling from the intended population (all the people at the movie), you sampled only the people in the front row. It is possible that the people sitting in the front row were more interested in the movie than those in other rows.
nonresponse bias: In general, people with strong opinions tend to respond to optional surveys more frequently than people with mild opinions. Since the movie survey is optional, the responses are more likely to form a bimodal distribution than a normal (bellshaped) distribution.

crash blossom

A sentence or phrase with an ambiguous meaning. Crash blossoms present a significant problem in natural language understanding. For example, the headline Red Tape Holds Up Skyscraper is a crash blossom because an NLU model could interpret the headline literally or figuratively.

critic

Synonym for Deep QNetwork.

crossentropy

A generalization of Log Loss to multiclass classification problems. Crossentropy quantifies the difference between two probability distributions. See also perplexity.

crossvalidation

A mechanism for estimating how well a model will generalize to new data by testing the model against one or more nonoverlapping data subsets withheld from the training set.

custom Estimator

An Estimator that you write yourself by following these directions.
Contrast with premade Estimators.

data analysis

Obtaining an understanding of data by considering samples, measurement, and visualization. Data analysis can be particularly useful when a dataset is first received, before one builds the first model. It is also crucial in understanding experiments and debugging problems with the system.

data augmentation

Artificially boosting the range and number of training examples by transforming existing examples to create additional examples. For example, suppose images are one of your features, but your dataset doesn't contain enough image examples for the model to learn useful associations. Ideally, you'd add enough labeled images to your dataset to enable your model to train properly. If that's not possible, data augmentation can rotate, stretch, and reflect each image to produce many variants of the original picture, possibly yielding enough labeled data to enable excellent training.

DataFrame

A popular datatype for representing datasets in pandas. A DataFrame is analogous to a table. Each column of the DataFrame has a name (a header), and each row is identified by a number.

data set or dataset

A collection of examples or observations

Dataset API (tf.data)

A highlevel TensorFlow API for reading data and transforming it into a form that a machine learning algorithm requires. A tf.data.Dataset object represents a sequence of elements, in which each element contains one or more Tensors. A tf.data.Iterator object provides access to the elements of a Dataset.

decision boundary

The separator between classes learned by a model in a binary class or multiclass classification problems.

decision threshold

Synonym for classification threshold.

decision tree

A model represented as a sequence of branching statements. Machine learning can generate deep decision trees.

deep model

A type of neural network containing multiple hidden layers.
Contrast with wide model.

deep neural network

A type of neural network containing multiple hidden layers.
Contrast with wide model.

Deep QNetwork (DQN)

In Qlearning, a deep neural network that predicts Qfunctions.
Critic is a synonym for Deep QNetwork.

demographic parity

A fairness metric that is satisfied if the results of a model's classification are not dependent on a given sensitive attribute.
For example, if both Lilliputians and Brobdingnagians apply to Glubbdubdrib University, demographic parity is achieved if the percentage of Lilliputians admitted is the same as the percentage of Brobdingnagians admitted, irrespective of whether one group is on average more qualified than the other.
Contrast with equalized odds and equality of opportunity, which permit classification results in aggregate to depend on sensitive attributes, but do not permit classification results for certain specified groundtruth labels to depend on sensitive attributes.

dense feature

A feature in which most values are nonzero, typically a Tensor of floatingpoint values. Contrast with sparse feature.

dense layer

Synonym for fully connected layer.

depth

The number of layers (including any embedding layers) in a neural network that learn weights. For example, a neural network with 5 hidden layers and 1 output layer has a depth of 6.

depthwise separable convolutional neural network (sepCNN)

A convolutional neural network architecture based on Inception, but where Inception modules are replaced with depthwise separable convolutions. Also known as Xception.
A depthwise separable convolution (also abbreviated as separable convolution) factors a standard 3D convolution into two separate convolution operations that are more computationally efficient: first, a depthwise convolution, with a depth of 1 (n ✕ n ✕ 1), and then second, a pointwise convolution, with length and width of 1 (1 ✕ 1 ✕ n).

device

A category of hardware that can run a TensorFlow session, including CPUs, GPUs, and TPUs.

dimension reduction

Decreasing the number of dimensions used to represent a particular feature in a feature vector, typically by converting to an embedding.

dimensions

Overloaded term having any of the following definitions:
The number of levels of coordinates in a Tensor. For example:
A scalar has zero dimensions; for example, ["Hello"].
A vector has one dimension; for example, [3, 5, 7, 11].
A matrix has two dimensions; for example, [[2, 4, 18], [5, 7, 14]].
You can uniquely specify a particular cell in a onedimensional vector with one coordinate; you need two coordinates to uniquely specify a particular cell in a twodimensional matrix.
The number of entries in a feature vector.
The number of elements in an embedding layer.

discrete feature

A feature with a finite set of possible values. For example, a feature whose values may only be animal, vegetable, or mineral is a discrete (or categorical) feature. Contrast with continuous feature.

discriminative model

A model that predicts labels from a set of one or more features. More formally, discriminative models define the conditional probability of an output given the features and weights; that is:
p(output  features, weights)
For example, a model that predicts whether an email is spam from features and weights is a discriminative model.
The vast majority of supervised learning models, including classification and regression models, are discriminative models.
Contrast with generative model.

discriminator

A system that determines whether examples are real or fake.
The subsystem within a generative adversarial network that determines whether the examples created by the generator are real or fake.

disparate impact

Making decisions about people that impact different population subgroups disproportionately. This usually refers to situations where an algorithmic decisionmaking process harms or benefits some subgroups more than others.
For example, suppose an algorithm that determines a Lilliputian's eligibility for a miniaturehome loan is more likely to classify them as “ineligible” if their mailing address contains a certain postal code. If BigEndian Lilliputians are more likely to have mailing addresses with this postal code than LittleEndian Lilliputians, then this algorithm may result in disparate impact.
Contrast with disparate treatment, which focuses on disparities that result when subgroup characteristics are explicit inputs to an algorithmic decisionmaking process.

disparate treatment

Factoring subjects' sensitive attributes into an algorithmic decisionmaking process such that different subgroups of people are treated differently.
For example, consider an algorithm that determines Lilliputians’ eligibility for a miniaturehome loan based on the data they provide in their loan application. If the algorithm uses a Lilliputian’s affiliation as BigEndian or LittleEndian as an input, it is enacting disparate treatment along that dimension.
Contrast with disparate impact, which focuses on disparities in the societal impacts of algorithmic decisions on subgroups, irrespective of whether those subgroups are inputs to the model.

divisive clustering

A category of clustering algorithms that create a tree of clusters. Hierarchical clustering is wellsuited to hierarchical data, such as botanical taxonomies. There are two types of hierarchical clustering algorithms:
Agglomerative clustering first assigns every example to its own cluster, and iteratively merges the closest clusters to create a hierarchical tree.
Divisive clustering first groups all examples into one cluster and then iteratively divides the cluster into a hierarchical tree.
Contrast with centroidbased clustering.

downsampling

Overloaded term that can mean either of the following:
Reducing the amount of information in a feature in order to train a model more efficiently. For example, before training an image recognition model, downsampling highresolution images to a lowerresolution format.
Training on a disproportionately low percentage of overrepresented class examples in order to improve model training on underrepresented classes. For example, in a classimbalanced dataset, models tend to learn a lot about the majority class and not enough about the minority class. Downsampling helps balance the amount of training on the majority and minority classes.

DQN

Abbreviation for Deep QNetwork.

dropout regularization

A form of regularization useful in training neural networks. Dropout regularization works by removing a random selection of a fixed number of the units in a network layer for a single gradient step. The more units dropped out, the stronger the regularization. This is analogous to training the network to emulate an exponentially large ensemble of smaller networks. For full details, see Dropout: A Simple Way to Prevent Neural Networks from Overfitting.

dynamic model

A model that is trained online in a continuously updating fashion. That is, data is continuously entering the model.

eager execution

A TensorFlow programming environment in which operations run immediately. By contrast, operations called in graph execution don't run until they are explicitly evaluated. Eager execution is an imperative interface, much like the code in most programming languages. Eager execution programs are generally far easier to debug than graph execution programs.

early stopping

A method for regularization that involves ending model training before training loss finishes decreasing. In early stopping, you end model training when the loss on a validation dataset starts to increase, that is, when generalization performance worsens.

embeddings

A categorical feature represented as a continuousvalued feature. Typically, an embedding is a translation of a highdimensional vector into a lowdimensional space. For example, you can represent the words in an English sentence in either of the following two ways:
As a millionelement (highdimensional) sparse vector in which all elements are integers. Each cell in the vector represents a separate English word; the value in a cell represents the number of times that word appears in a sentence. Since a single English sentence is unlikely to contain more than 50 words, nearly every cell in the vector will contain a 0. The few cells that aren't 0 will contain a low integer (usually 1) representing the number of times that word appeared in the sentence.
As a severalhundredelement (lowdimensional) dense vector in which each element holds a floatingpoint value between 0 and 1. This is an embedding.
In TensorFlow, embeddings are trained by backpropagating loss just like any other parameter in a neural network.

embedding space

The ddimensional vector space that features from a higherdimensional vector space are mapped to. Ideally, the embedding space contains a structure that yields meaningful mathematical results; for example, in an ideal embedding space, addition and subtraction of embeddings can solve word analogy tasks.
The dot product of two embeddings is a measure of their similarity.

empirical risk minimization (ERM)

Choosing the function that minimizes loss on the training set. Contrast with structural risk minimization.

ensemble

A merger of the predictions of multiple models. You can create an ensemble via one or more of the following:
different initializations
different hyperparameters
different overall structure
Deep and wide models are a kind of ensemble.

environment

In reinforcement learning, the world that contains the agent and allows the agent to observe that world's state. For example, the represented world can be a game like chess, or a physical world like a maze. When the agent applies an action to the environment, then the environment transitions between states.

episode

In reinforcement learning, each of the repeated attempts by the agent to learn an environment.

epoch

A full training pass over the entire dataset such that each example has been seen once. Thus, an epoch represents N/batch size training iterations, where N is the total number of examples.

epsilon greedy policy

In reinforcement learning, a policy that either follows a random policy with epsilon probability or a greedy policy otherwise. For example, if epsilon is 0.9, then the policy follows a random policy 90% of the time and a greedy policy 10% of the time.
Over successive episodes, the algorithm reduces epsilon’s value in order to shift from following a random policy to following a greedy policy. By shifting the policy, the agent first randomly explores the environment and then greedily exploits the results of random exploration.

equality of opportunity

A fairness metric that checks whether, for a preferred label (one that confers an advantage or benefit to a person) and a given attribute, a classifier predicts that preferred label equally well for all values of that attribute. In other words, equality of opportunity measures whether the people who should qualify for an opportunity are equally likely to do so regardless of their group membership.
For example, suppose Glubbdubdrib University admits both Lilliputians and Brobdingnagians to a rigorous mathematics program. Lilliputians’ secondary schools offer a robust curriculum of math classes, and the vast majority of students are qualified for the university program. Brobdingnagians’ secondary schools don’t offer math classes at all, and as a result, far fewer of their students are qualified. Equality of opportunity is satisfied for the preferred label of "admitted" with respect to nationality (Lilliputian or Brobdingnagian) if qualified students are equally likely to be admitted irrespective of whether they're a Lilliputian or a Brobdingnagian.
For example, let's say 100 Lilliputians and 100 Brobdingnagians apply to Glubbdubdrib University, and admissions decisions are made as follows:
Table 1. Lilliputian applicants (90% are qualified)
Qualified Unqualified
Admitted 45 3
Rejected 45 7
Total 90 10
Percentage of qualified students admitted: 45/90 = 50%
Percentage of unqualified students rejected: 7/10 = 70%
Total percentage of Lilliputian students admitted: (45+3)/100 = 48%
Table 2. Brobdingnagian applicants (10% are qualified):
Qualified Unqualified
Admitted 5 9
Rejected 5 81
Total 10 90
Percentage of qualified students admitted: 5/10 = 50%
Percentage of unqualified students rejected: 81/90 = 90%
Total percentage of Brobdingnagian students admitted: (5+9)/100 = 14%
The preceding examples satisfy equality of opportunity for acceptance of qualified students because qualified Lilliputians and Brobdingnagians both have a 50% chance of being admitted.
Note: While equality of opportunity is satisfied, the following two fairness metrics are not satisfied:
demographic parity: Lilliputians and Brobdingnagians are admitted to the university at different rates; 48% of Lilliputians students are admitted, but only 14% of Brobdingnagian students are admitted.
equalized odds: While qualified Lilliputian and Brobdingnagian students both have the same chance of being admitted, the additional constraint that unqualified Lilliputians and Brobdingnagians both have the same chance of being rejected is not satisfied. Unqualified Lilliputians have a 70% rejection rate, whereas unqualified Brobdingnagians have a 90% rejection rate.

equalized odds

A fairness metric that checks if, for any particular label and attribute, a classifier predicts that label equally well for all values of that attribute.
For example, suppose Glubbdubdrib University admits both Lilliputians and Brobdingnagians to a rigorous mathematics program. Lilliputians' secondary schools offer a robust curriculum of math classes, and the vast majority of students are qualified for the university program. Brobdingnagians' secondary schools don’t offer math classes at all, and as a result, far fewer of their students are qualified. Equalized odds is satisfied provided that no matter whether an applicant is a Lilliputian or a Brobdingnagian, if they are qualified, they are equally as likely to get admitted to the program, and if they are not qualified, they are equally as likely to get rejected.
Let’s say 100 Lilliputians and 100 Brobdingnagians apply to Glubbdubdrib University, and admissions decisions are made as follows:
Table 3. Lilliputian applicants (90% are qualified)
Qualified Unqualified
Admitted 45 2
Rejected 45 8
Total 90 10
Percentage of qualified students admitted: 45/90 = 50%
Percentage of unqualified students rejected: 8/10 = 80%
Total percentage of Lilliputian students admitted: (45+2)/100 = 47%
Table 4. Brobdingnagian applicants (10% are qualified):
Qualified Unqualified
Admitted 5 18
Rejected 5 72
Total 10 90
Percentage of qualified students admitted: 5/10 = 50%
Percentage of unqualified students rejected: 72/90 = 80%
Total percentage of Brobdingnagian students admitted: (5+18)/100 = 23%
Equalized odds is satisfied because qualified Lilliputian and Brobdingnagian students both have a 50% chance of being admitted, and unqualified Lilliputian and Brobdingnagian have an 80% chance of being rejected.
Note: While equalized odds is satisfied here, demographic parity is not satisfied. Lilliputian and Brobdingnagian students are admitted to Glubbdubdrib University at different rates; 47% of Lilliputian students are admitted, and 23% of Brobdingnagian students are admitted.
Equalized odds is formally defined in "Equality of Opportunity in Supervised Learning" as follows: "predictor Ŷ satisfies equalized odds with respect to protected attribute A and outcome Y if Ŷ and A are independent, conditional on Y."
Note: Contrast equalized odds with the more relaxed equality of opportunity metric.

Estimator

An instance of the tf.Estimator class, which encapsulates logic that builds a TensorFlow graph and runs a TensorFlow session. You may create your own custom Estimators (as described here) or instantiate premade Estimators created by others.

example

One row of a dataset. An example contains one or more features and possibly a label. See also labeled example and unlabeled example.

experience replay

In reinforcement learning, a DQN technique used to reduce temporal correlations in training data. The agent stores state transitions in a replay buffer, and then samples transitions from the replay buffer to create training data.

experimenter's bias

The tendency to search for, interpret, favor, and recall information in a way that confirms one's preexisting beliefs or hypotheses. Machine learning developers may inadvertently collect or label data in ways that influence an outcome supporting their existing beliefs. Confirmation bias is a form of implicit bias.
Experimenter's bias is a form of confirmation bias in which an experimenter continues training models until a preexisting hypothesis is confirmed.

exploding gradient problem

The tendency for gradients in a deep neural networks (especially recurrent neural networks) to become surprisingly steep (high). Steep gradients result in very large updates to the weights of each node in a deep neural network.
Models suffering from the exploding gradient problem become difficult or impossible to train. Gradient clipping can mitigate this problem.
Compare to vanishing gradient problem.

fairness constraint

Applying a constraint to an algorithm to ensure one or more definitions of fairness are satisfied. Examples of fairness constraints include:
Postprocessing your model's output.
Altering the loss function to incorporate a penalty for violating a fairness metric.
Directly adding a mathematical constraint to an optimization problem.

fairness metric

A mathematical definition of “fairness” that is measurable. Some commonly used fairness metrics include:
equalized odds
predictive parity
counterfactual fairness
demographic parity
Many fairness metrics are mutually exclusive; see incompatibility of fairness metrics.

false negative (FN)

An example in which the model mistakenly predicted the negative class. For example, the model inferred that a particular email message was not spam (the negative class), but that email message actually was spam.

false positive (FP)

An example in which the model mistakenly predicted the positive class. For example, the model inferred that a particular email message was spam (the positive class), but that email message was actually not spam.

false positive rate (FPR)

The xaxis in an ROC curve.

feature

An input variable used in making predictions.

Feature column (tf.feature_column)

A function that specifies how a model should interpret a particular feature. A list that collects the output returned by calls to such functions is a required parameter to all Estimators constructors.
The tf.feature_column functions enable models to easily experiment with different representations of input features. For details, see the Feature Columns chapter in the TensorFlow Programmers Guide.
"Feature column" is Googlespecific terminology. A feature column is referred to as a "namespace" in the VW system (at Yahoo/Microsoft), or a field.

feature cross

A synthetic feature formed by crossing (taking a Cartesian product of) individual binary features obtained from categorical data or from continuous features via bucketing. Feature crosses help represent nonlinear relationships.

feature engineering

The process of determining which features might be useful in training a model, and then converting raw data from log files and other sources into said features. In TensorFlow, feature engineering often means converting raw log file entries to tf.Example protocol buffers. See also tf.Transform.
Feature engineering is sometimes called feature extraction.

feature extraction

Overloaded term having either of the following definitions:
Retrieving intermediate feature representations calculated by an unsupervised or pretrained model (for example, hidden layer values in a neural network) for use in another model as input.
Synonym for feature engineering.

feature set

The group of features your machine learning model trains on. For example, postal code, property size, and property condition might comprise a simple feature set for a model that predicts housing prices.

feature spec

Describes the information required to extract features data from the tf.Example protocol buffer. Because the tf.Example protocol buffer is just a container for data, you must specify the following:
the data to extract (that is, the keys for the features)
the data type (for example, float or int)
The length (fixed or variable)
The Estimator API provides facilities for producing a feature spec from a list of FeatureColumns.

feature vector

The list of feature values representing an example passed into a model.

federated learning

A distributed machine learning approach that trains machine learning models using decentralized examples residing on devices such as smartphones. In federated learning, a subset of devices downloads the current model from a central coordinating server. The devices use the examples stored on the devices to make improvements to the model. The devices then upload the model improvements (but not the training examples) to the coordinating server, where they are aggregated with other updates to yield an improved global model. After the aggregation, the model updates computed by devices are no longer needed, and can be discarded.
Since the training examples are never uploaded, federated learning follows the privacy principles of focused data collection and data minimization.

feedback loop

In machine learning, a situation in which a model's predictions influence the training data for the same model or another model. For example, a model that recommends movies will influence the movies that people see, which will then influence subsequent movie recommendation models.

feedforward neural network (FFN)

A neural network without cyclic or recursive connections. For example, traditional deep neural networks are feedforward neural networks. Contrast with recurrent neural networks, which are cyclic.

fewshot learning

A machine learning approach, often used for object classification, designed to learn effective classifiers from only a small number of training examples.
See also oneshot learning.

fine tuning

Perform a secondary optimization to adjust the parameters of an already trained model to fit a new problem. Fine tuning often refers to refitting the weights of a trained unsupervised model to a supervised model.

forget gate

The portion of a Long ShortTerm Memory cell that regulates the flow of information through the cell. Forget gates maintain context by deciding which information to discard from the cell state.

full softmax

A function that provides probabilities for each possible class in a multiclass classification model. The probabilities add up to exactly 1.0. For example, softmax might determine that the probability of a particular image being a dog at 0.9, a cat at 0.08, and a horse at 0.02. Contrast with candidate sampling.

fully connected layer

A hidden layer in which each node is connected to every node in the subsequent hidden layer.
A fully connected layer is also known as a dense layer.


Research material on AI


Research material on AI


Research material on AI


Research material on AI


Research material on AI


Research material on AI


