Before I start, I would like to mention that these 19 basic machine learning questions have been taken from the amazing book Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow by Aurélien Géron. I am trying to put answers to all the basic questions in one place so that anyone can re-visit the post and get an overall idea or refresh their concepts.
Table of Contents
1. What Is Machine Learning?
Ans: Machine Learning is a science of programming computers to allow them to learn from the given data. (Comment your definitions as well 😅)
A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E. – Tom Mitchell, 1997
2. What Kind Of Problems Can Be Solved With Machine Learning?
Ans: Machine learning is mostly helpful for problems that are either too complex for traditional programming approaches or have no known algorithms. For example, email spam protection or voice recognition with the traditional programming approach is more complex whereas, with ML, the program is shorter, easier to maintain and likely to be more accurate than the former approach. Additionally, machine learning models are adaptable to new data and can offer us deeper insight into the data or problems at hand based on the learned pattern.
3. What Is A Labeled Training Set?
Ans: Training set that includes the desired solution. These solutions are called labels. These types of training sets are often used in supervised learning problems like classification (logistic regression) or prediction (regression).
4. What Are The Types Of Machine Learning Systems?
Ans: Depending on the amount and type of supervision during the training phase, ML systems can be classified as:
- Supervised Learning
- Unsupervised Learning
- Semisupervised Learning
- Reinforcement Learning
5. What Are The Two Most Common Supervised Tasks?
Ans: Classification And Regression problems.
6. What Are The Most Important Supervised Algorithms?
- k-Nearest Neighbors
- Linear Regression
- Logistic Regression
- Support Vector Machines (SVMs)
- Decision Trees and Random Forests
- Neural networks (Some can be unsupervised and semisupervised too)
7. What Is Unsupervised Learning?
Ans: In unsupervised learning, the training data passed to models are unlabeled unlike in supervised learning.
8. What Are The Most Important Unsupervised Algorithms?
Ans: The most important unsupervised algorithms are:
- Clustering: (K-Means, DBSCAN, Hierarchical Cluster Analysis (HCA)) Application: Detecting groups of similar visitors on your blog etc.
- Anomaly detection and novelty detection: (One-class SVM, Isolation Forest) Application: Detecting unusual credit card transactions to prevent fraud. Models are trained on normal instances to detect anomalies.
- Visualization and dimensionality reduction: (Principal Component Analysis (PCA), Kernel PCA, Locally-Linear Embedding (LLE), t-distributed Stochastic Neighbor Embedding (t-SNE)) Application: Simplify the data without losing too much information by merging several correlated features into one.
- Association rule learning: (Apriori, Eclat) Application: Models dig into large amounts of data and discover interesting relations between attributes.
9. What Is Semisupervised Learning?
Ans: Models are trained on a lot of unlabeled data and very few labeled data. Semisupervised learning algorithms are combinations of unsupervised and supervised algorithms.
10. What Is Reinforcement Learning?
Ans: It involves the idea of agents, rewards, environment, policy and penalities. Agents are the learning systems, they learn using the best strategy called policy to get the most rewards or penalties (negative rewards) within their environment(Such as game screen etc.)
11. What Is the Difference Between Batch Learning And Online Learning?
- Batch Learning(Offline Learning): The model is trained offline and then launched into production. It applies only what it has learned from the training data. For models to learn about new data, it has to be trained on old and new data. Training the systems can be done periodically after 24 hours or weeks but it can be a costly affair if we have huge data.
- Online Learning: Models are trained on new streams of data sequentially (either individually or in mini-batches). Models learn incrementally on new data but remember garbage in garbage out. Bad data will produce bad results.
12. What Is The Difference Between Instance-Based Versus Model-Based Learning?
- Instance-Based Learning: The system learns the examples by heart, then generalizes to new cases by comparing them to the learned examples (or a subset of them), using a similarity measure.
- Model-based learning: Creating a model based on a given set of examples and then using the model to make predictions. Basically, it’s studying the data, choosing a model, training the model and making predictions on new data.
13. What Are The Main Challenges Of Machine Learning?
Ans: The main challenges can be:
- Insufficient Quantity of Training Data: Try to add more data for better learning.
- Nonrepresentative Training Data: Try to add more data for better generalization. Always consider sampling bias (data sample is too big) and sampling noise (data sample is too small) in this case.
- Poor-Quality Data: Try to spend more time cleaning up the data.
- Irrelevant Features: Try to use better features using feature engineering.
14. What Is Feature Engineering?
Ans: In simple words, feature engineering is the process of coming up with a good set of features for training your model so that it can make relevant predictions. The process involves:
- Feature selection: selecting the most useful features to train on among existing features.
- Feature extraction: combining existing features to produce a more useful one.
- Creating new features by gathering new data.
15. What Is Generalization Error?
Ans: The error rate on new cases (e.g. test sets) is called generalization error.
16. What Is Overfitting And Its Possible Solution?
Ans: If the training error is low but the generalization error is high then the model is said to be overfitting the training data. Overfitting means that the model performs well on the training data, but it does not generalize well. Here is what Geron recommends:
- Simplify the model by selecting one with fewer parameters (e.g., a linear model rather than a high-degree polynomial model), by reducing the number of attributes in the training data or by constraining the model
- Gather more training data.
- Reduce the noise in the training data (e.g., fix data errors and remove outliers)
17. What Is Underfitting And Its Possible Solution?
Ans: Underfitting happens when the model is too simple to learn the underlying structure of the data. In simple terms, the high error rate on training and test data. Possible solutions:
- Select a more powerful model, with more parameters.
- Feed better features to the learning algorithm (feature engineering)
- Reducing the constraints on the model (e.g., reducing the regularization hyper-parameter.
18. What Is Validation Set?
Ans: When we are testing the same models with different hyperparameters we hold out a fraction of training data to test different models so that we can select the best model. This set is called the validation set or dev set or development set. The model is then trained on the full training set to get the final model. If the validation set is too small or too large, perform cross-validation to choose the best model otherwise you will end up using the wrong model.
19. What Is No Free Lunch Theorem?
Ans: There is no guarantee for any model to be the right fit for your data. You can either evaluate all the available models or you can save your time and make some assumptions about the data and evaluate them over some reasonable models for your problem.
Well, thanks for reading. Please note: In a few cases, I have borrowed the exact definitions or wording used by the author due to its simplicity and clarity. Make sure to leave your thoughts and let me know what did I miss.