深度学习名词解释

    A

    Activate


    B

    Batch

    BP



    C

    CNN




    Dropout

    Dropout is a recently introduced regularization method that has been very successful with feed-forward neural networks.
    it is a regularization technique for Neural Networks that prevents overfitting. It prevents neurons from co-adapting by randomly setting a fraction of them to 0 at each training iteration.
    Here is the Python code


    from theano.tensor.shared_randomstreams import RandomStreams
    def dropout_layer(state_prev, dropout_percent):                
        state=T.switch(trng.binomial(size=state_prev.shape,p=dropout_percent),state_prev,0)
        return state




    DQN







    Reference:
    Recurrent Neural network regularization

    F

    float32 (theano)

    The default floating point data type is float64, however, data must be tranferred to float32 to store in the GPU.

    • convert to float32
    epilson = np.float32(0.01)
    • use shared statement
    import theano import theano.tensor as T w = theano.shared((np.random.randn(input_dimension,output_dimension).astype('float32'), name='w')

    L

    Loss Function

    zero-one loss

    The objective of training a classifier is to minimize the number of errors on unseen examples. The zero-one loss is a very simple classifier returns a value of 0 if the prediction is true and 1 if the prediction in wrong!
    We’d like to maximize the probability of seeing if and parameter are given. In this glossary, is defined as:


    In python:

    # T.neq returns the result of logical inequality(x!=y) zero_one_loss = T.sum(T.neq(T.argmax(p_y_given_x),y))

    negative log likelihood loss

    Generally, the zero-one loss is not differentiable and require large computation. In practice, the negative log likelihood loss is widely used and proven to be powerful!


    In python:

    # This function is a little trivial, please be serious and looks what happen inside. # T.log(p_y_given_x) is the log value of p_y,noted as LP # y.shape[0] returns the number of our training samples. # LP[T.arange(y.shape[0],y] is a vector contains LP[0,y[0]]...LP[n-1,y[n-1]] and finally sum up all of them! NLL = -T.sum(T.log(p_y_given_x)[T.arange(y.shape[0]),y])

    It’s just like the tip that we generate a one-hot vector, but at this time, we select an exact element(y[n]) from row-n.


    LSTM

    LSTM(Long-Short Term Memory,LSTM)是一种时间递归神经网络,是一种 RNN 特殊的类型,可以学习长期依赖信息。



    M

    MNIST dataset

    The MNIST dataset is a universally-used dataset for digit recognition, its characters can be summed up as the following:

    1. train set:50,000, validation set:10,000,test set:10,000
    2. 28 x 28 pixels (input xof each training example is represented as a 1-dimensional array whose length is 784 while output label y is simply a scalar ranging from 0 to 9.

    Now, we begin with opening the dataset in Python and try to optimize it to be used for GPU acceleration.

    import cPickle, gzip, numpy, theano import theano.tensor as T # Load the dataset f = gzip.open('mnist.pkl.gz', 'rb') train_set, valid_set, test_set = cPickle.load(f) f.close()
    # Next, store the data into GPU memory def share_dataset(data_xy): # use theano shared value form data_x, data_y = data_xy shared_x = theano.shared(numpy.asarray(data_x, dtype=theano.config.floatX)) shared_y = theano.shared(numpy.asarray(data_y, dtype=theano.config.floatX)) ''' Can also use the following syntax, it also works! shared_x = theano.shared(data_x.astype('float32')) shared_y = theano.shared(data_y.astype('float32')) ''' # Since 'Y' should be intergers, not floats, we cast it return shared_x, T.cast(shared_y, 'int32') # Now try it! test_set_x, test_set_y = share_dataset(test_set) valid_set_x, valid_set_y = share_dataset(valid_set) train_set_x, train_set_y = share_dataset(train_set)

    It is very common to use batch gradient descent(see later) rather than use the whole dataset, for it is computational heavilly!

    # Use batch_zize of 500 and select the third batch. batch_size = 500 data = train_set_x[2 * batch_size: 3 * batch_size] label = train_set_y[2 * batch_size: 3 * batch_size]

    O

    one-hot vector

    one-hot vector 在自然语言中处理非常重要, 常作为神经网络的输入, 有indexing的效果. 那么,实际情况中如何建立这样一个矩阵呢. 先考虑小的数据集. 比如有数据标记为两类0,1
    one-hot vector is a term in NLP, as its name indicates, it is a vector where only one element is 1 and the others are 0s. Suppose that we have a vocabulary consists of 4000 words for text generation, there should exist 4000 unique one-hot vector for each word. For different tasks, there are different ways to initialize the vectors.

    • classification
      Suppose that there are only 2 classes: 0 and 1. The two one-hot vectors should be [1,0],[0,1]. suppose that we have six learning samples but they are store in an array like [0,1,0,1,1,0], so, we produce an eye matrix first and let the array selects which vector they belong to form a matrix includes all samples.
    >>> import numpy as np >>> x = np.eye(2) # Two types of vectors >>> y = np.array([0,1,0,1,1,0]) # classes >>> x array([[ 1., 0.], [ 0., 1.]]) >>> y array([0, 1, 0, 1, 1, 0]) >>> x[y] # By indexing, we generate a matrix for learning array([[ 1., 0.], [ 0., 1.], [ 1., 0.], [ 0., 1.], [ 0., 1.], [ 1., 0.]])

    R

    Regularization

    L1 and L2 regularization

    and regularization involves adding an extra term to the loss function to prevent the problem of overfitting. Formally,


    where

    which is the norm of .
    In python:

    L1 = T.sum(abs(param)) L2 = T.sum(param**2) loss = NLL + lambda_1 * L1 + lambda_2 * L2

    early-stopping

    Early-stopping combats overfitting by monitoring the model’s performance on a validation set. If the model’s performance ceases to improve sufficiently on the validation set, or even degrades with further optimization, we give up further optimization.

    S

    Stochastic Gradient Descent

    gradient descent

    An ordinary gradient descent refer to the training method in which we repeatedly make small steps downward on an error surface defined by a loss function of some parameters.
    Pseudocode:

    while True: loss = f(params) d_loss_wrt_params = ... #compute gradient params -= learning_rate * d_loss_wrt_params if <stopping_condition is met>: return params

    stochastic gradient descent

    Stochastic gradient descent works just like gradient descent but proceeds quickly by estimating the gradient from just a few samples at a time instead of the entire dataset.
    Pseudocode:

    for (x_i,y_i) in training_set: loss = f(params, x_i, y_i) d_loo_wrt_params = ... params -= learning_rate * d_loss_wrt_params if <stopping_condition> is met>: return params

    minibatch SGD

    More importantly, it is recommended to use minibatches rather than use just one training example at a time. The technique reduces variance in the estimate of the gradient.
    Pseudocode:

    for (x_batch, y_batch) in train_batches: loss = f(params, x_batch, y_batch) d_loss_wrt_params = ... params -= learning_rate * d_loss_wrt_params if <stopping_condition is met>: return params

    Note: minibatch size is dependent of our model,dataset and hardware, ranging from 1 to several hundreds. In deeplearning.net tutorial, it is set to 20.
    Warning: If we are training for a fixed number of epochs, the minibatch size becomes important !

    In theano, the general form of gradient descent is as follows:

    # Symbolic description of parameters and fucntions! d_loss_wrt_params = T.grad(loss,params) updates = [(params, params - learning_rate * d_loss_wrt_params)] MSGD = theano.function([x_batch, y_batch],loss, update = updates) for(x_batch, y_batch) in train_batchs: print ("current loss is ",MSGD(x_batch,y_batch)) if stopping_condition_is_met: return params



    Others





    Reference

    1. http://deeplearning.net/tutorial/
    2. http://blog.csdn.net/klaas/article/details/50765062
    3. 刘昕 深度学习大讲堂 https://mp.weixin.qq.com/s?__biz=MzI1NTE4NTUwOQ==&mid=2650324652&idx=1&sn=229a7f0498aca9...


标签: 深度学习 DL