I need to choosea topic from the reading, for example:
- The notion of deep learning.
- Why large benchmarks evaluations are important? Describe the MNIST evaluation.
- Deep neural network architectures.
- Activation functions.
- Validation, cross-validation and hyperparameter tuning in training and evaluating deep networks.
- What are convolutional neural networks?
Slide 1 Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 10, Deep learning of Data Mining by I. H. Witten, E. Frank, M. A. Hall, and C. J. Pal Introducing Deep Learning • In recent years, so-called “deep learning” approaches to machine learning have had a major impact on speech recognition and computer vision • Other disciplines, such as natural language processing, are also starting to see benefits • A critical ingredient is the use of much larger quantities of data than has heretofore been possible • Recent successes have arisen in settings involving high capacity models—ones with many parameters • Here, deep learning methods create flexible models that exploit information buried in massive datasets far more effectively than do traditional machine learning techniques using hand-engineered features Views on machine learning • One way to view machine learning is in terms of three general approaches: 1. Classical machine learning techniques, which make predictions directly from a set of features that have been pre-specified by the user; 2. Representation learning techniques, which transform features into some intermediate representation prior to mapping them to final predictions; and 3. Deep learning techniques, a form of representation learning that uses multiple transformation steps to create very complex features The neural network renaissance and deep learning revolution • The term “renaissance” captures a massive resurgence of interest in neural networks and deep learning techniques • Many high-profile media (e.g. The New York Times) have documented the striking successes of deep learning techniques on key benchmark problems • Starting around 2012, impressive results were achieved on long-standing problems in speech recognition and computer vision, and in competitive challenges such as the ImageNet Large Scale Visual Recognition Challenge and the Labeled Faces in the Wild evaluation GPUs, graphs and tensors • The easy availability of high-speed computation in the form of graphics processing units has been critical to the success of deep learning techniques • When formulated in matrix-vector form, computation can be accelerated using optimized graphics libraries and hardware • This is why we will study backpropagation in matrix-vector form – Readers unfamiliar with manipulating functions that have matrix arguments, and their derivatives are advised to consult Appendix A.1 for a summary of some useful background • As network models become more complex, some quantities can only be represented using multidimensional arrays of numbers – Such arrays are sometimes referred to as tensors, a generalization of matrices that permit an arbitrary number of indices • Software for deep learning supporting computation graphs and tensors is therefore invaluable for accelerating the creation of complex network structures and making it easier to learn them Key developments The following developments have played a crucial role in the resurgence of neural network methods: • the proper evaluation of machine learning methods; • vastly increased amounts of data; • deeper and larger network architectures; • accelerated training using GPU techniques Mixed National Institute of Standards and Technology (MNIST) • Is a database and evaluation setup for handwritten digit recognition • Contains 60,000 training and 10,000 test instances of hand-written digits, encoded as 28×28 pixel grayscale images • The data is a re-mix of an earlier NIST dataset in which adults generated the training data and high school students generated the test set • Lets compare the performance of different methods MNIST Losses and regularization • Logistic regression can be viewed as a simple neural network with no hidden units • The underlying optimization criterion for predicting i=1,…,N labels yi from features xi with parameters θ consisting of a matrix of weights W and a vector of biases b can be viewed as • where the first term, , is the negative conditional log-likelihood or loss, and • the second term, , is a weighted regularizer used to prevent overfitting - log p(yi | xi;W, b) i=1 N å + l w j 2 j=1 M å = L( fi(xi;q ), yi ) i=1 N å + lR(q ) L( fi(xi;q), yi ) lR(q ) Empirical risk minimization • This formulation as a loss- and regularizer-based objective function gives us the freedom to choose either probabilistic losses or other loss functions • Using the average loss over the training data, called the empirical risk, leads to the following formulation of the optimization problem: minimize the empirical risk plus a regularization term, i.e. • Note that the factor N must be accounted for if one relates the regularization weight here to the corresponding parameter derived from a formal probabilistic model for a distribution on parameters argmin q 1 N L( fi (xi;q ), yi ) i=1 N å + lR(q ) é ë ê ù û ú. In practice • In deep learning we are often interested in examining learning curves that show the loss or some other performance metric on a graph as a function of the number of passes that an algorithm has taken over the data. • It is much easier to compare the average loss over a training set with the average loss over a validation set on the same graph, because dividing by N gives them the same scale. Common losses for neural networks • The final output function of a neural network typically has the form fk(x)=fk(ak(x)), where ak(x) is just one of the elements of vector fuction a(x)=Wh(x)+b • Commonly used output loss functions, output activation functions, and the underlying distributions from which they derive are shown below Deep neural network architectures • Compose computations performed by many layers • Denoting the output of hidden layers by h(l)(x), the computation for a network with L hidden layers is: • Where pre-activation functions a(l)(x) are typically linear, of the form with matrix W(l) and bias b(l) • This formulation can be expressed using a single parameter matrix θ with the trick of defining as x with a 1 appended to the end of the vector; we then have f(x) = f a(L+1) h(L) a(L) ... h(2) a(2) h(1) a(1)(x)( )( )( )( )( )æèç ö ø ÷ æ è ç ö ø ÷ é ë ê ù û ú a(l)(x) = W(l )x +b(l ) x̂ Deep feedforward networks … xD x2 x1 y1 y2 yK … … h2 (1) h1 ( L ) h2 ( L ) hML ( L ) … 1 1 1 hM1 (1) h1 (1) … • Unlike Bayesian networks the hidden units here are intermediate deterministic computations not random variables, which is why they are not represented as circles • However, the output variables yk are drawn as circles because they can be formulated probabilistically Activation functions • Activation functions, h(l)(x) generally operate on the pre- activation vectors in an element-wise fashion • While sigmoid functions have been popular, the hyperbolic tangent function is sometimes preferred, partly because it has a steady state at 0 • More recently the rectify() function or rectified linear units (ReLUs) have been found to yield superior results in many different settings – Since ReLUs are 0 for negative argument values, some units in the model will yield activations that are 0, giving a sparseness property that is useful in many contexts – The gradient is particularly simple—either 0 or 1 – This helps address the exploding gradient problem • A number of software packages make it easy to use a variety of activation functions, determining gradients automatically using symbolic computations Activation functions Bibliographic Notes & Further Reading • The backpropagation algorithm has been known in close to its current form since Werbos (1974)’s PhD thesis • In his extensive literature review of deep learning, Schmidhuber (2015) traces key elements of the algorithm back even further. – He also traces the idea of “deep networks” back to the work of Ivakhnenko and Lapa (1965). • The popularity of neural network techniques has gone through several cycles and while some factors are social, there are important technical reasons behind the trends. • A single-layer neural network cannot solve the XOR problem, a failing that was derided by Minsky and Papert (1969) and which stymied neural network development in the following decades. Bibliographic Notes & Further Reading • It is well known that networks with one additional layer can approximate any function (Cybenko, 1989; Hornik, 1991), and Rumelhart et al. (1986)’s influential work re- popularized neural network methods for a while. • By the early 2000s neural network methods had fallen out of favor again – kernel methods like SVMs yielded state of the art results on many problems and were convex • Indeed, the organizers of NIPS, the Neural Information Processing Systems conference, which was (and still is) widely considered to be the premier forum for neural network research, found that the presence of the term “neural networks” in the title was highly correlated with the paper’s rejection! – A fact that is underscored by citation analysis of key neural network papers during this period. • In this context, the recent resurgence of interest in deep learning really does feel like a “revolution.” Bibliographic Notes & Further Reading • It is known that most complex Boolean functions require an exponential number of two-step logic gates for their representation (Wegener, 1987). • The solution appears to be greater depth: according to Bengio (2014), the evidence strongly suggests that “functions that can be compactly represented with a depth-k architecture could require a very large number of elements in order to be represented by a shallower architecture”. Backpropagation revisited in vector matrix form Backpropagation in matrix vector form • Backpropagation is based on the chain rule of calculus • Consider the loss for a single-layer network with a softmax output (which corresponds exactly to the model for multinomial logistic regression) • We use multinomial vectors y, with a single dimension yk = 1 for the corresponding class label and whose other dimensions are 0 • Define