In our previous tutorials, we explored different machine-learning tools to solve problems like predicting house prices or classifying emails. But what if we want computers to understand the world more humanly, like recognizing different cat breeds in a photo or translating languages fluently? That’s where Deep Learning comes in! Deep learning is an advanced and dynamic subset of machine learning that has revolutionized the field of artificial intelligence. While machine learning provides valuable insights and predictions based on data, deep learning takes a giant leap forward by leveraging complex neural networks to automatically learn and make intelligent decisions.
It allows computers to learn directly from data, becoming better at complex tasks than traditional methods. Let’s dive in and discover the exciting world of Deep Learning!
Table of Contents
Prerequisites:
- Python, Numpy, Sklearn, Pandas and Matplotlib.
- Familiarity with TensorFlow and Keras
- Linear Algebra For Machine Learning.
- Statistics And Probability Theory.
- All of our previous machine-learning tutorials
Introduction To Deep Learning: What Is Deep Learning?
Imagine yourself teaching a child the intricacies of the world. You start with fundamental shapes, colours, and textures, slowly building their understanding layer by layer. But as their curiosity grows, so too does the complexity of what they can grasp. Traditional machine learning algorithms, like decision trees and support vector machines, are excellent at handling these well-defined structures. However, they often struggle with more complex tasks or perceptual tasks in domains like computer vision, natural language processing, speech recognition etc.
This is where deep learning enters the scene, a powerful tool inspired by the very structure and function of the human brain. It is just an inspiration and not the exact representation of our neurons. At its core, deep learning relies on artificial neural networks (ANNs), interconnected layers of processing units that learn and improve through experience. These networks, like their biological counterparts, can automatically extract intricate features and relationships from vast amounts of data, enabling them to tackle challenges that were previously insurmountable.
In the 1940s, the foundational concept of artificial neurons was introduced, laying the groundwork for what we now know as deep learning. However, progress was slow due to limitations in both computing power and theoretical understanding during that time.
Fast forward to the 1980s, interest was resurgent with the development of a crucial training algorithm called backpropagation. This algorithm played a key role in optimizing neural networks, but deep learning still faced challenges.
It wasn’t until the 2010s that deep learning truly flourished. This period saw significant advancements in computing power, the explosion of Big Data, and the emergence of improved algorithms. These factors collectively fueled the growth of deep learning, leading to groundbreaking breakthroughs in various fields. The combination of more powerful hardware, vast amounts of data, and enhanced algorithms enabled deep learning to reach new heights, transforming the way we approach and solve complex problems across different domains.
Let’s look at how other people define it so that we can get a better understanding of what exactly it is:
- Andrew Ng: “Deep learning is a part of machine learning that uses artificial neural networks, learning algorithms inspired by the human brain, to automatically learn and extract features from data.”
- François Chollet: “Deep learning is a set of algorithms that attempt to learn in a way that is similar to how humans learn.”
- Yann LeCun: “Deep learning is like building a child a Lego set with no instructions, and the child ends up building a spaceship.”
- Jeff Hawkins: “Deep learning is about giving computers the ability to learn the way we do, by building models of the world based on the data we experience.”
- Google AI: “Deep learning is a powerful tool for identifying and understanding complex patterns in data, enabling computers to make accurate predictions and decisions on a wide range of tasks.”
- OpenAI: “Deep learning allows computers to learn from large amounts of data and improve their performance over time without being explicitly programmed.”
Deep learning is a specific subset of machine learning that approaches representation learning differently. It places a strong emphasis on learning successive layers of representations from data. The term “deep” in deep learning doesn’t signify a deeper understanding achieved by the approach; rather, it refers to the idea of learning these layers of representations one after another.
The depth of the model is determined by how many layers contribute to its representation of the data. Alternative names for the field could have been “layered representations learning” or “hierarchical representations learning.” In modern deep learning, models often involve tens or even hundreds of successive layers of representations, all learned automatically from exposure to training data. In contrast, other machine learning approaches tend to focus on learning only one or two layers of representations, earning them the label “shallow learning.”
Deep Learning with Python, Book
Deep Learning vs. Linear Models:
Remember our tutorials on linear models like linear regression and classifiers? They were great for tackling simple problems, but what happens when we encounter more complex functions? For example, in a lock with a non-linear keyhole, linear models simply don’t fit. There are three key differences that make deep learning a paradigm shift from the limitations of linear models. These are:
1. Representation Power:
- Linear models: Think of them like straight lines, unable to capture the curves and bends of real-world data. This limits their ability to represent complex functions like XOR or even a simple parabola.
- Deep learning: Imagine a network of interconnected nodes, each performing simple calculations. By stacking these layers upon layers, deep learning models can build intricate representations, mimicking even the most complex functions. It’s like having a toolbox with tools for every shape, not just straight lines!
2. Feature Engineering:
- Linear models: Finding the right features for complex problems can be like searching for a needle in a haystack, requiring expert knowledge and intuition. It’s time-consuming and often relies on trial and error.
- Deep learning: Deep learning models automatically learn these features directly from the data! No need for manual engineering, the model itself discovers the hidden patterns and relationships within the data, building its own “toolbox” as it learns.
3. Flexibility and Scalability:
- Linear models: With limited representation power, they struggle with large or diverse datasets. Adding more features can become cumbersome and computationally expensive.
- Deep learning: The more layers a deep learning model has, the more complex functions it can represent. This allows it to handle large datasets and diverse problems efficiently, scaling its power as needed. It’s like having a model that can grow and adapt to your challenges.
Feature | Classical Machine Learning | Deep Learning |
---|---|---|
Feature Representation | Manual | Automatic |
Model Complexity | Shallow | Deep |
Learning Process | Explicit | Implicit |
Strengths | Interpretable, Efficient for small datasets, Easily explainable decisions | Can handle complex patterns and relationships, Excels with large datasets, Automatic feature learning |
Weaknesses | May struggle with complex patterns, Limited scalability, Requires significant feature engineering effort | Can be computationally expensive, Less interpretable (“black box”), Prone to overfitting with small datasets |
So, How Does Deep Learning Models Solve A Problem?
Deep learning is like learning about something step by step. Imagine you want a computer to recognize digits. A deep learning model does this by breaking down the task into different steps. Don’t worry about the technical stuff; just think of it like layers in the model learning various things from the digit picture. They work one after the other, figuring out different aspects, and in the end, they give you an answer about the digit it recognizes. It’s a way to learn data representation in multiple layers.
Another example to understand how deep learning models solve a problem: Imagine you want to train a model to recognize different dog breeds in photos. A deep learning model wouldn’t need you to manually define features like “floppy ears” or “short legs.” Instead, it would analyze millions of dog pictures, automatically identifying and combining various features (e.g., ear shape, fur texture, body proportions) to create its own internal representation of different breeds. This allows the model to recognize breeds it might never have seen before, showcasing its pattern recognition prowess.
Deep learning models aren’t just about identifying features; they can also learn intricate relationships between them. Imagine you want to predict housing prices. A simple model might only consider factors like square footage and number of bedrooms. But a deep learning model could go further, analyzing factors like neighbourhood demographics, proximity to amenities, and even historical market trends. By considering these complex relationships, the model can make more accurate and nuanced predictions.
One more important thing about deep learning is models are trained on large and diverse datasets, allowing them to generalize well to new data. This means they can be trained on a specific task (e.g., recognizing dogs) and then apply their learned knowledge to solve similar tasks (e.g., recognizing cats, or identifying other animals) with minimal adjustments. This adaptability makes them valuable for real-world applications where data may vary significantly.
Limitations:
Until now, we haven’t delved into the math behind how deep learning models work. We’ll get there soon, but for now, you’ve got a grasp of how deep learning sets itself apart from traditional machine learning methods and where it excels. Now, let’s shift gears and discuss where it falls short:
- Black box: Sometimes, even the experts can’t explain how a deep learning model arrived at its decision. This lack of transparency can be a concern in areas like healthcare or finance.
- Data dependence: Like a picky eater, deep learning models need a lot of data to perform well. If you don’t have enough data, or if it’s biased, your model might end up making some seriously flawed decisions.
- Computational cost: Training deep learning models can be expensive and time-consuming, especially if you’re using a potato for a computer (don’t do that).
- Ethical Concerns: Deep learning applications raise ethical concerns related to biases in data, transparency, and fairness in decision-making processes.
Artificial Neural Networks (ANNs)
The core of deep learning is ANNs, they are inspired by biological neural connections but they are not exactly the same. All deep learning networks that we will study are some form of ANNs with different architectures. Different architectures are useful in different tasks. We will study each network type in different tutorials but first, let’s have an overview of what all types of ANNs exist. Don’t worry about the terms. We will learn each concept one by one.
1. Feedforward Neural Networks (FNNs):
- How they work: FNNs consist of multiple interconnected layers of neurons (nodes) arranged successively. Information flows from the input layer through hidden layers (if any) to the output layer.
- Strengths: Simple, versatile, and efficient for various tasks. Able to learn complex relationships between input and output data.
- Other names: Multi-Layer Perceptrons (MLPs)
- Applications: Image recognition, spam filtering, classification and regression tasks.
We will study FNN or MLPs in this tutorial and other architectures in future tutorials.
2. Convolutional Neural Networks (CNNs):
- How they work: CNNs utilize a specialized architecture with convolutional layers that extract features from input data (typically images or videos) through convolutions and pooling operations. These features are then processed by fully connected layers for classification or regression tasks.
- Strengths: Highly effective for image and video processing tasks due to their ability to learn spatial features and hierarchical representations.
- Other names: ConvNets
- Applications: Self-driving cars, medical image analysis, image captioning, object detection, facial recognition.
3. Recurrent Neural Networks (RNNs):
- How they work: RNNs are designed to handle sequential data like text and speech. They incorporate an internal memory mechanism that allows them to process information one step at a time, taking into account the context from previous inputs.
- Strengths: Able to learn long-term dependencies within sequential data, making them suitable for tasks like machine translation and language modelling.
- Other names: Simple RNNs
- Applications: Machine translation, sentiment analysis, text generation, music generation.
4. Long Short-Term Memory (LSTMs):
- How they work: LSTMs are a variant of RNNs specifically designed to address the vanishing gradient problem in RNNs. They utilize special memory cells that can store information for longer periods, allowing them to learn long-term dependencies more effectively.
- Strengths: Improved memory capabilities compared to vanilla RNNs, making them ideal for tasks with long-term dependencies like natural language processing and video captioning.
- Other names: Gated Recurrent Units (GRUs) are similar with slight variations.
- Applications: Machine translation, natural language processing, video captioning, speech recognition.
5. Gated Recurrent Units (GRUs):
- How they work: GRUs are a variant of RNNs similar to LSTMs, but with a simpler architecture. They utilize update gates and reset gates to control the flow of information within the memory cell, allowing them to learn long-term dependencies.
- Strengths: Offer similar capabilities to LSTMs in handling long-term dependencies, but with fewer parameters and potentially faster training times.
- Applications: Similar to LSTMs, including machine translation, natural language processing, speech recognition, and text generation.
6. Transformers:
- How they work: Transformers utilize an attention mechanism that allows them to focus on specific parts of the input data, leading to more accurate and nuanced results. This attention mechanism can be applied to various tasks, including machine translation and text summarization.
- Strengths: Revolutionizing tasks like machine translation and summarization with their powerful attention mechanism. Can handle long-range dependencies effectively.
- Other names: Attention-based models
- Applications: Machine translation, text summarization, question answering, speech recognition.
7. Generative Adversarial Networks (GANs):
- How they work: GANs consist of two competing networks: a generator and a discriminator. The generator creates new data (e.g., images, text), while the discriminator tries to distinguish the generated data from real data. This adversarial training process allows the generator to create increasingly realistic and complex outputs.
- Strengths: Able to generate realistic and creative data, making them suitable for tasks like image generation, music composition, and text style transfer.
- Other names: None widely used, but often described by their specific application (e.g., StyleGAN for image generation).
- Applications: Creating realistic images and videos, composing music, generating art styles, drug discovery.
There can be other network types but these are a few of them that are mostly used. Also, each of these architectures is used depending on the kind of problem you are working on. Now that we have a basic understanding of deep learning let’s learn each of the architecture one by one in depth. In this tutorial, I will only explain FNN and its implementation, the rest in future tutorials.
Feedforward Neural Networks (FNNs) In Depth: Concept And Application
The Feedforward Neural Network (FNN), also known as a Multi-Layer Perceptron (MLP), is a fundamental building block in deep learning. Its simplicity and versatility have made it a mainstay in various tasks like image recognition, classification, and regression. An FNN is like a simple decision-making system.
Imagine it as a stack (layers) of interconnected nodes (neurons), each handling a specific piece of information. Information flows through these blocks in one direction, from the input to the output, without any loops or cycles. It’s like a conveyor belt for information. You can imagine them in this manner:
An FNN consists of inputs, which are passed through successive layers to give you the output. There are various components to FNN architecture:
- Neurons: The basic processing units, inspired by biological neurons, that receive input, perform calculations, and generate output.
- Layers: Arranged sequentially, with information flowing from the input layer through hidden layers (if any) to the output layer. Each layer transforms the data using mathematical functions.
- Activation Functions: Introduce non-linearity into the network, allowing it to learn complex relationships. Common examples include sigmoid, ReLU, and tanh functions.
- Weights and Biases: Adjustable parameters in each neuron that determine how strongly inputs influence the output. These are optimized during training to improve the network’s performance.
To understand these terms in a much better way, let’s see what they look like in mathematical representation:
Neurons:
Imagine a neuron in a neural network as a tiny decision-maker. It’s like a little brain cell that receives input, processes it, and produces an output. The components of a neuron are:
- Inputs: Information or signals coming into the neuron. For example, In predicting a student’s grade, inputs could be the number of study hours and attendance.
- Weights: They determine the importance of each input like coefficients in the equations y = wx + b, w is weights.
- Weighted Sum: In a neural network, each neuron receives multiple inputs from other neurons or the input layer. Each input has an associated weight, which represents the strength or importance of that input’s influence on the neuron’s output. The weighted sum is calculated by multiplying each input by its corresponding weight and then summing all these products together. This gives a single value that captures the combined influence of all the inputs on the neuron. Let’s say a neuron receives n inputs, denoted as
x1
,x2
, …,xn
, and their corresponding weights arew1
,w2
, …,wn
. The bias of the neuron is denoted byb
. Then, the weighted sum is calculated as:w1x1 + w2x2 + ... + wn*xn + b
This equation essentially represents a linear combination of the inputs, where each input’s contribution is scaled by its weight. - Imagine a simple neural network with one hidden layer and one output layer. Let’s say the input layer has 3 neurons, the hidden layer has 2 neurons, and the output layer has 1 neuron. Here’s how the weighted sum would be calculated at each layer:
- Input Layer: No weighted sum calculation happens in the input layer, as it simply receives the raw input data.
- Hidden Layer:
- For the first neuron in the hidden layer:
- Weighted sum = w11x1 + w12x2 + w13*x3 + b1
- Here, w11, w12, and w13 are the weights connecting the first neuron in the hidden layer to the 3 neurons in the input layer, and b1 is the bias of the first hidden neuron.
- Similarly, calculate the weighted sum for the second neuron in the hidden layer using its corresponding weights and bias.
- For the first neuron in the hidden layer:
- Output Layer: The output neuron receives inputs from the hidden layer neurons.
- Weighted sum = w21h1 + w22h2 + b2
- Here, w21 and w22 are the weights connecting the output neuron to the two hidden layer neurons, h1 and h2 are the outputs of the hidden layer neurons (calculated using their weighted sums and activation functions), and b2 is the bias of the output neuron.
- Weighted Sum: In a neural network, each neuron receives multiple inputs from other neurons or the input layer. Each input has an associated weight, which represents the strength or importance of that input’s influence on the neuron’s output. The weighted sum is calculated by multiplying each input by its corresponding weight and then summing all these products together. This gives a single value that captures the combined influence of all the inputs on the neuron. Let’s say a neuron receives n inputs, denoted as
- Activation Function: After calculating the weighted sum, an activation function is applied to introduce non-linearity and allow the neuron to learn complex relationships. The choice of activation function depends on the specific task and network architecture. Some common activation functions include:
- Sigmoid: Outputs a value between 0 and 1.
- ReLU (Rectified Linear Unit): Outputs the input directly if positive, otherwise outputs 0.
- tanh (Hyperbolic tangent): Outputs a value between -1 and 1.
- The activation function is applied to the weighted sum to produce the final output of the neuron.
- This is a simplified example, and real-world neural networks can have many more layers and neurons. However, the concept of the weighted sum and its role in activation functions remains the same. We will see an end-to-end example.
- Bias: Imagine adding an extra, constant value to the weighted sum of inputs before sending it to the activation function. That constant value is the bias. It acts like an offset that allows the neuron to shift its activation function to the left or right, adjusting its sensitivity to different input combinations. Without bias, if all weights are negative and the sum of their products is also negative, the activation function might never get activated, effectively shutting down the neuron. The bias term helps prevent this by providing a positive or negative offset, allowing the neuron to potentially be activated even with negative weighted inputs. Bias allows the neuron to learn relationships that might not be possible with just weighted sums. For example, imagine a neuron trying to classify images containing cats or dogs. With only weighted sums, it might struggle to learn a threshold for distinguishing the two classes. Introducing a bias can help shift the activation function to create a more appropriate decision boundary for classification.
- Output: A single numerical value calculated based on the processed information.
What does it mean when we say neurons are “fired”?
Imagine the neuron receives input values, applies weights to them, calculates the weighted sum, adds the bias term, and then passes this value through the activation function. If the resulting output from the activation function is greater than the threshold, the neuron is considered “fired.” A “fired” neuron in deep learning doesn’t necessarily mean it’s sending signals to other neurons like our brain cells (neurons).
Instead, it signifies that the neuron’s processing of the input has resulted in a non-zero or significant output value, indicating potential activation of its downstream connections. The “firing” behaviour of neurons is crucial for learning in artificial neural networks. The network learns by adjusting the weights based on the errors between its outputs and the desired outputs. If neurons don’t “fire” at all, they won’t contribute to the learning process.
Choosing optimal numbers of neurons:
Now, you must be wondering how to choose the optimal number of neurons in each layer. So, the simple answer is to start with a simple model and then increase model complexity based on the score. It may depend on:
1. Data Complexity:
- Simple data: Fewer neurons might suffice for tasks like classifying basic shapes.
- Complex data: More neurons are often needed to capture intricate patterns in images, text, or other high-dimensional data.
2. Network Capacity:
- More neurons generally increase network capacity: The ability to learn and represent complex relationships.
- Too many neurons can lead to overfitting, where the network memorizes the training data too well and fails to generalize to unseen data.
3. Computational Resources:
- More neurons: Require more memory and processing power to train and run the network.
- Limited resources: Might necessitate using fewer neurons or optimizing network architecture for efficiency.
4. Layer Purpose:
- Input Layer: The number of neurons typically matches the number of input features in the data. Each input neuron in a neural network corresponds to a single feature in your input data. For example, if you’re working with images represented by 28×28 pixels, your input layer would have 28*28 = 784 neurons, one for each pixel value. Similarly, if your data is text represented as word embeddings with 300 dimensions, you would have 300 input neurons. Unlike hidden and output layer neurons, input layer neurons don’t have weights or biases and therefore don’t “learn” during training. They simply pass the raw input data to the network. The quality of the input representation has a significant impact on the network’s performance. Choosing appropriate features and preprocessing techniques is crucial for capturing meaningful information in the input layer.
- Hidden Layers: Finding the optimal number is often an iterative process through experimentation and evaluation. A common starting point is the average of the input and output layer sizes.
- Output Layer: Depends on the task:
- Single neuron for binary classification (e.g., cat/not cat).
- Multiple neurons for multi-class classification (e.g., different types of flowers).
- One neuron per output dimension for regression tasks (e.g., predicting house prices).
Different activation functions can impact the number of neurons needed. For example, ReLU activations might require fewer neurons compared to sigmoid activations. We will talk about these functions shortly.
I hope this clarifies any doubt about neurons, now let’s talk about layers.
Layers:
Layers play a crucial role in organizing and processing information. There are typically three types of layers: the input layer, hidden layers, and the output layer. The choice of the number of layers and neurons in each layer depends on the complexity of the task and the characteristics of the data. For simple tasks with linear relationships in the data, you can use fewer layers, called shallow networks (Linear regression problems). For complex tasks with non-linear patterns, use deep networks (Image recognition, natural language processing).
Hidden layers act as intermediate stages between the input and output layers, allowing the network to learn complex, non-linear relationships within the data. Hidden layers can be chosen based on performance evaluation, starting with one or two hidden layers and then experimenting based on performance. How many neurons to choose in a hidden layer? A common practice is to have a decreasing number of neurons in subsequent layers. It’s all about trial and error. We will see in practice.
The output layer is the final layer in a neural network, responsible for producing the network’s predictions or outputs based on the processed information from the previous layers. The structure and characteristics of the output layer depend on the nature of the task the neural network is designed to solve. For binary classification tasks (e.g., yes/no, 0/1), a single neuron in the output layer is typically used. For tasks with multiple classes, each class is represented by a neuron in the output layer. In regression tasks (predicting numerical values), the output layer usually consists of a single neuron.
The output layer also consists of an activation function and choice may depend on the task at hand. For binary classification, the sigmoid or logistic function is used to squash the output to a range between 0 and 1. For multiclass classification, the softmax activation function is often used. It converts raw scores into probabilities, facilitating the identification of the most likely class. For regression, no activation function or a linear activation function may be used, allowing the network to output a wide range of numerical values.
Role of Activation Functions:
We have mentioned activation functions several times but what do they do exactly? Imagine a neural network without activation functions. Each layer would simply perform a linear transformation of the input, essentially multiplying by weights and adding a bias. Stacking multiple linear layers wouldn’t lead to anything more complex than a single linear transformation, severely limiting the network’s ability to learn intricate patterns in data.
The real world is rarely linear. Data often exhibits complex, non-linear relationships that linear models cannot capture. Activation functions introduce non-linearity into the network, allowing it to model these intricate relationships and perform tasks like image recognition, language understanding, and more. Activation functions act as “gates” within the network, determining whether a neuron should “fire” or not based on the weighted sum of its inputs. This creates decision boundaries that separate different classes or patterns in the data, allowing the network to make accurate predictions.
Common Activation Functions:
I have explained these activation functions in the notes below but first, let’s see which ones we use. Also, don’t get confused about how to use them, I will show you shortly!
Highly Popular and Widely Used:
- ReLU (Rectified Linear Unit): Simple, efficient, avoids vanishing gradients, promotes sparsity. Used in diverse tasks, especially for deep networks.
- Leaky ReLU: Mitigates dying ReLU problem, and retains efficiency benefits of ReLU. A popular choice for tasks with frequent negative inputs.
- Softmax: Outputs probability distribution across multiple classes. Essential for multi-class classification tasks.
Other Popular Options:
- ELU (Exponential Linear Unit): Smooth, avoids vanishing gradients, addresses dying ReLU. Used in image classification, natural language processing, and time series forecasting.
- Swish: Smooth, non-monotonic, avoids vanishing gradients. Recent development, showing promise in various tasks.
- Mish: Similar to Swish, potentially faster to compute. A new option, gaining traction in diverse tasks.
Less Common but Promising:
- SELU (Scaled Exponential Linear Unit): Self-normalizing, potentially faster convergence. Used in image recognition, natural language processing, and time series forecasting.
- Parametric ReLU (PReLU): Flexible, captures complex relationships. Used in image recognition, natural language processing, and speech recognition.
- Maxout: Increased representational power, but computationally expensive. Used in variational autoencoders, image recognition, and natural language processing.
Choosing the Right One:
- Task: ReLU, Leaky ReLU, and Softmax are often default choices for specific task types (e.g., ReLU for regression, Softmax for multi-class classification).
- Data characteristics: Consider input ranges and distributions when choosing (e.g., Softmax might not be suitable for large output values).
- Network architecture: Deeper networks benefit from vanishing gradient-friendly options like ReLU or Leaky ReLU.
- Computational resources: ReLU and Leaky ReLU are generally more efficient than complex options like PReLU or Maxout.
We have already seen the sigmoid function in depth in the logistic regression tutorial, now let’s see other activation functions:
Loss Functions:
Now, let’s move on to our next topic of loss functions that we are already familiar with. In deep learning, the primary goal is to teach a model to make accurate predictions or decisions based on input data. The loss function quantifies how far off the model’s predictions are from the actual values (ground truth) in the training data. The objective during training is to minimize this loss, effectively guiding the model towards making better predictions. I have defined common loss functions in the document below:
Training Deep Learning Models: Backpropagation
The remarkable ability of deep learning models to learn complex patterns and achieve impressive results stems from a powerful technique called backpropagation. We already know about SGD and deep learning optimizers from the tutorial on SGD. Let’s see what is backpropagation and how we use it to train our model.
A Historical Perspective:
- Early Days (1950s – 1970s):
- The concept of learning in artificial neural networks existed, but training them effectively was a major challenge.
- Perceptrons, simple single-layer networks, faced limitations in representing complex functions.
- Gradient descent, a fundamental optimization algorithm, was explored but struggled with vanishing gradients in deeper architectures.
- The Backpropagation Breakthrough (1970s – 1980s):
- Researchers like David Rumelhart, Geoffrey Hinton, and Ronald Williams developed backpropagation, a refined gradient descent algorithm for multi-layer networks.
- This breakthrough enabled training deeper networks, paving the way for the resurgence of deep learning.
- The Deep Learning Explosion (2000s – Present):
- Advancements in computing power and the availability of large datasets fueled the widespread adoption of deep learning.
- Backpropagation remains at the core of training deep learning models, with various optimizations and improvements developed along the way.
So, how does it work? Imagine you’re trying to teach a computer how to recognize cats in pictures. You start by showing it many cat images and telling it, “This is a cat.” The computer tries its best to identify cats, but initially, it might make mistakes.
1. Forward Pass:
- You give the computer a picture and let it make a guess—like, “Is this a cat?”
- The computer’s guess might be wrong, but that’s okay for now.
2. Calculate Error:
- You then tell the computer, “Actually, this is a cat,” and it realizes how wrong it was.
- You calculate the difference between its guess and the correct answer. This difference is the “error.”
3. Backward Pass (Backpropagation):
- Now, here’s where the magic happens. Backpropagation is like helping the computer learn from its mistakes.
- You look at each part of the computer’s brain (neural network) that contributed to the wrong guess.
- For each part, you ask, “How much did you contribute to the mistake?”
- Imagine it’s like telling the computer, “The tail in the picture was important, but you missed that. Pay more attention to tails next time.”
4. Adjust Weights:
- The computer then adjusts the “weights” in its brain. These weights determine how much importance it gives to different features (like tails, ears, etc.).
- If the tail is crucial and it misses it, the computer increases the importance of the tail by adjusting its weights.
5. Repeat:
- You repeat this process many times—showing pictures, calculating errors, and helping the computer learn from its mistakes.
- With each iteration, the computer gets better at recognizing cats because it fine-tunes its brain based on the errors it made.
In simple terms, backpropagation is like a learning loop where the computer gradually gets better at understanding what features matter in identifying cats. It’s a way of teaching the computer to adjust its “thinking” based on the mistakes it makes, so it becomes more accurate over time. In more, formal terms, these are the steps followed:
- Forward Pass: The input data flows through the network’s layers, producing an output prediction.
- Loss Calculation: The loss function compares the predicted output with the actual value, generating a numerical error score.
- Backward Pass: The error is propagated backwards through the network, calculating the gradient of the loss with respect to each parameter.
- Parameter Update: The optimizer uses the gradients to adjust the network’s weights in a direction that minimizes the loss.
- Repeat: Steps 1-4 are repeated for each training example, gradually improving the model’s performance.
Backpropagation can be computationally expensive, especially for large and deep networks. Techniques like mini-batch training and efficient gradient calculations are crucial for practical applications. In deep networks, gradients can become very small or large as they propagate through many layers called vanishing or exploding gradients. Techniques like ReLU activation functions and normalization layers help mitigate these issues.
We already know about the optimizers, but what exactly do they do during the training process? Most optimizers utilize the technique of gradient descent. This involves calculating the gradient of the loss function with respect to each parameter. Based on the calculated gradient, the optimizer takes a step in that direction by updating the model’s parameters. The size of this step is determined by a learning rate, which is another important hyperparameter that influences the training process.
Selecting the optimal optimizer depends on several factors, including:
- The type of data you are working with: Some optimizers perform better with specific data types or problem settings.
- The complexity of your model: Deeper models might benefit from adaptive optimizers like Adam or RMSprop.
- The computational resources available: Some optimizers are more computationally expensive than others.
Try to experiment with different optimizers and hyperparameter settings to find the best combination for your specific deep-learning task. I have shared the calculations below also if you want to visualize, visit the source link mentioned in the document:
Now let’s see how we can apply MLPs to real-world applications using Tensorflow and Keras:
FNN In Tensforflow:
Further Readings: