What is Curse of DimensionalityCurse of Dimensionality refers to non-intuitive properties of data observed when working in high-dimensional space., specifically related to usability and interpretation of distances and volumes. This is one of my favourite topics in Machine Learning and Statistics since it has broad applications (not specific to any machine learning method), it is very counter-intuitive and hence awe-inspiring, it has profound application for any of analytics techniques, and it has ‘cool’ scary name like some Egyptian curse!For quick grasp, consider: Say, you dropped a coin on a 100 meter line. How do you find it? Simple, just walk on the line and search. But what if it’s 100 x 100 sq.
It’s already getting tough, trying to search a (roughly) football ground for a single coin. But what if it’s 100 x 100 x 100 cu.m space?! You know, football ground now has thirty-story height.
Curse of Dimensionality: Number of Samples We need 92 samples to maintain the same density as in 1D 9. Curse of Dimensionality: Number of Samples Of course, when we go from 1 feature to 2, no one gives us more samples, we still have 9 This is way too sparse for 1NN to work well 10. Thus overcoming the “curse” of dimensionality 6. One of the most widely used dimension reduction tech-niques in remote sensing is the principal component analysis (PCA). PCA computes orthogonal projections that maximize the amount of data variance, and yields a dataset in a new uncorrelated coordinate system. Unfortunately, information.
Good luck finding a coin there! That, in essence is “curse of dimensionality”. Many ML methods use Distance MeasureMost segmentation and clustering methods rely on computing distances between observations. Well known k-Means segmentation assigns points to nearest center. DBSCAN and Hierarchical clustering also required distance metrics.
Distribution and density based outlier detection algorithms also make use of distance relative to other distances to mark outliers.Supervised classification solutions like k-Nearest Neighbours method also use distance between observations to assign class to unknown observation. Support Vector Machine method involves transforming observations around select Kernels based on distance between observation and the kernel.Common form of recommendation systems involve distance based similarity among user and item attribute vectors. Even when other forms of distances are used, number of dimensions plays a role in analytic design.One of the most common distance metrics is Euclidian Distance metric, which is simply linear distance between two points in multi-dimensional hyper-space. Euclidian Distance for point i and point j in n dimensional space can be computed as:Distance plays havoc in high-dimensionConsider simple process of data sampling. Suppose the black outside box in Fig. 1 is data universe with uniform distribution of data points across whole volume, and that we want to sample 1% of observations as enclosed by red inside box.
Black box is hyper-cube in multi-dimensional space with each side representing range of value in that dimension. For simple 3-dimensional example in Fig. 1, we may have following range:Figure 1: SamplingWhat is proportion of each range should we sample to obtain that 1% sample?
For 2-dimensions, 10% of range will achieve overall 1% sampling, so we may select x∈(0,10) and y∈(0,50) and expect to capture 1% of all observations. This is because 10%2=1%. Do you expect this proportion to be higher or lower for 3-dimension?Even though our search is now in additional direction, proportional actually increases to 21.5%. And not only increases, for just one additional dimension, it doubles! And you can see that we have to cover almost one-fifth of each dimension just to get one-hundredth of overall! In 10-dimensions, this proportion is 63% and in 100-dimensions – which is not uncommon number of dimensions in any real-life machine learning – one has to sample 95% of range along each dimension to sample 1% of observations!
This mind-bending result happens because in high dimensions spread of data points becomes larger even if they are uniformly spread.This has consequence in terms of design of experiment and sampling. Process becomes very computationally expensive, even to the extent that sampling asymptotically approaches population despite sample size remaining much smaller than population.Consider another huge consequence of high dimensionality.
Many algorithms measure distance between two data points to define some sort of near-ness (DBSCAN, Kernels, k-Nearest Neighbour) in reference to some pre-defined distance threshold. In 2-dimensions, we can imagine that two points are near if one falls within certain radius of another. Consider left image in Fig. What’s share of uniformly spaced points within black square fall inside the red circle? That is aboutFigure 2: Near-nessSo if you fit biggest circle possible inside the square, you cover 78% of square. Yet, biggest sphere possible inside the cube covers onlyof the volume. This volume reduces exponentially to 0.24% for just 10-dimension!
What it essentially means that in high-dimensional world every single data point is at corners and nothing really is center of volume, or in other words, center volume reduces to nothing because there is (almost) no center! This has huge consequences of distance based clustering algorithms. All the distances start looking like same and any distance more or less than other is more random fluctuation in data rather than any measure of dissimilarity!Fig. 3 shows randomly generated 2-D data and corresponding all-to-all distances. Coefficient of Variation in distance, computed as Standard Deviation divided by Mean, is 45.9%. Corresponding number of similarly generated 5-D data is 26.5% and for 10-D is 19.1%.
Admittedly this is one sample, but trend supports the conclusion that in high-dimensions every distance is about same, and none is near or far!Figure 3: Distance ClusteringHigh-dimension affects other things tooApart from distances and volumes, number of dimensions creates other practical problems. Solution run-time and system-memory requirements often non-linearly escalate with increase in number of dimensions. Due to exponential increase in feasible solutions, many optimization methods cannot reach global optima and have to make do with local optima. Further, instead of closed-form solution, optimization must use search based algorithms like gradient descent, genetic algorithm and simulated annealing.
More dimensions introduce possibility of correlation and parameter estimation can become difficult in regression approaches. Dealing with High-dimensionThis will be separate blog post in itself, but correlation analysis, clustering, information value, variance inflation factor, principal component analysis are some of the ways in which number of dimensions can be reduced. Number of variables, observations or features a data point is made up of is called dimension of data. For instance, any point in space can be represented using 3 co-ordinates of length, breadth, and height, and has 3 dimensions. DisclaimerGARP does not endorse, promote, review or warrant the accuracy of the products or services offered by EduPristine, nor does it endorse the scores claimed by the Exam Prep Provider. Further, GARP is not responsible for any fees paid by the user to EduPristine nor is GARP responsible for any remuneration to any person or entity providing services to EduPristine.
ERP ®, FRM ®, GARP ® and Global Association of Risk Professionals™ are trademarks owned by the Global Association of Risk Professionals, Inc.CFA Institute does not endorse, promote, or warrant the accuracy or quality of the products or services offered by EduPristine. CFA Institute, CFA ®, and Chartered Financial Analyst ® are trademarks owned by CFA Institute.We try our best to ensure that our content is plagiarism free and does not violate any copyright law. However, if you feel that there is a copyright violation of any kind in our content then you can send an email to care@edupristine.com.2019 © EduPristine. All rights reserved.
DisclaimerGARP does not endorse, promote, review or warrant the accuracy of the products or services offered by EduPristine of GARP Exam related information, nor does it endorse any pass rates that may be claimed by the Exam Prep Provider. Further, GARP is not responsible for any fees or costs paid by the user to EduPristine nor is GARP responsible for any fees or costs of any person or entity providing any services to EduPristine. ERP ®, FRM ®, GARP ® and Global Association of Risk Professionals™ are trademarks owned by the Global Association of Risk Professionals, Inc.CFA® Institute does not endorse, promote, or warrant the accuracy or quality of the products or services offered by EduPristine.CFA ® Institute, CFA ®, CFA ® Institute Investment Foundations™ and Chartered Financial Analyst ® are trademarks owned by CFA ® Institute.
Utmost care has been taken to ensure that there is no copyright violation or infringement in any of our content.Still, in case you feel that there is any copyright violation of any kind please send a mail to abuse@edupristine.com and we will rectify it.2019 © EduPristine. ALL Rights Reserved.
Welcome to Part 2 of our tour through modern machine learning algorithms. In this part, we’ll cover methods for Dimensionality Reduction, further broken into Feature Selection and Feature Extraction.
In general, these tasks are rarely performed in isolation. Instead, they’re often preprocessing steps to support other tasks.If you missed Part 1, you can. It explains our methodology for categorization algorithms, and it covers the “Big 3” machine learning tasks:. Regression. Classification. ClusteringIn this part, we’ll cover:.We will also cover other tasks, such as Density Estimation and Anomaly Detection, in dedicated guides in the future.
The Curse of DimensionalityIn machine learning, “dimensionality” simply refers to the number of features (i.e. Input variables) in your dataset.When the number of features is very large relative to the number of observations in your dataset, certain algorithms struggle to train effective models. This is called the “Curse of Dimensionality,” and it’s especially relevant for algorithms that rely on distance calculations. A user has provided an excellent analogy for the Curse of Dimensionality, which we'll borrow here:Let's say you have a straight line 100 yards long and you dropped a penny somewhere on it.
It wouldn't be too hard to find. You walk along the line and it takes two minutes.Now let's say you have a square 100 yards on each side and you dropped a penny somewhere on it. It would be pretty hard, like searching across two football fields stuck together. It could take days.Now a cube 100 yards across. That's like searching a 30-story building the size of a football stadium. Ugh.The difficulty of searching through the space gets a lot harder as you have more dimensions.In this guide, we'll look at the 2 primary methods for reducing dimensionality: Feature Selection and Feature Extraction.
Feature SelectionFeature selection is for filtering irrelevant or redundant features from your dataset. The key difference between feature selection and extraction is that feature selection keeps a subset of the original features while feature extraction creates brand new ones.To be clear, some supervised algorithms already have built-in feature selection, such as Regularized Regression and Random Forests. Typically, we recommend starting with these algorithms if they fit your task.
They're covered in.As a stand-alone task, feature selection can be unsupervised (e.g. Variance Thresholds) or supervised (e.g. Genetic Algorithms). You can also combine multiple methods if needed.
4.1. Variance ThresholdsVariance thresholds remove features whose values don't change much from observation to observation (i.e. Their variance falls below a threshold). These features provide little value.For example, if you had a public health dataset where 96% of observations were for 35-year-old men, then the 'Age' and 'Gender' features can be eliminated without a major loss in information.Because variance is dependent on scale, you should always your features first. Strengths: Applying variance thresholds is based on solid intuition: features that don't change much also don't add much information. This is an easy and relatively safe way to reduce dimensionality at the start of your modeling process.
Weaknesses: If your problem does require dimensionality reduction, applying variance thresholds is rarely sufficient. Furthermore, you must manually set or tune a variance threshold, which could be tricky. We recommend starting with a conservative (i.e. Lower) threshold.
Implementations: /4.2. Correlation ThresholdsCorrelation thresholds remove features that are highly correlated with others (i.e. Its values change very similarly to another's). These features provide redundant information.For example, if you had a real-estate dataset with 'Floor Area (Sq.
Ft.)' and 'Floor Area (Sq. Meters)' as separate features, you can safely remove one of them.Which one should you remove? Well, you'd first calculate all pair-wise correlations. Then, if the correlation between a pair of features is above a given threshold, you'd remove the one that has larger mean absolute correlation with other features. Strengths: Applying correlation thresholds is also based on solid intuition: similar features provide redundant information. Some algorithms are not robust to correlated features, so removing them can boost performance.
Weaknesses: Again, you must manually set or tune a correlation threshold, which can be tricky to do. Plus, if you set your threshold too low, you risk dropping useful information. Whenever possible, we prefer algorithms with built-in feature selection over correlation thresholds. Even for algorithms without built-in feature selection, Principal Component Analysis (PCA) is often a better alternative.
Implementations: /4.3. Genetic Algorithms (GA)Genetic algorithms (GA) are a broad class of algorithms that can be adapted to different purposes. They are search algorithms that are inspired by evolutionary biology and natural selection, combining mutation and cross-over to efficiently traverse large solution spaces. Here's a great.
In machine learning, GA's have two main uses. The first is for optimization, such as finding the best weights for a neural network.The second is for supervised feature selection. In this use case, 'genes' represent individual features and the 'organism' represents a candidate set of features. Each organism in the 'population' is graded on a fitness score such as model performance on a hold-out set. The fittest organisms survive and reproduce, repeating until the population converges on a solution some generations later. Strengths: Genetic algorithms can efficiently select features from very high dimensional datasets, where exhaustive search is unfeasible.
When you need to preprocess data for an algorithm that doesn't have built-in feature selection (e.g. Nearest neighbors) and when you must preserve the original features (i.e. No PCA allowed), GA's are likely your best bet. These situations can arise in business/client settings that require a transparent and interpretable solution. Weaknesses: GA's add a higher level of complexity to your implementation, and they aren't worth the hassle in most cases. If possible, it's faster and simpler to use PCA or to directly use an algorithm with built-in feature selection.
Implementations: /4.4. Honorable Mention: Stepwise SearchStepwise search is a supervised feature selection method based on sequential search, and it has two flavors: forward and backward. For forward stepwise search, you start without any features. Then, you'd train a 1-feature model using each of your candidate features and keep the version with the best performance.
You'd continue adding features, one at a time, until your performance improvements stall.Backward stepwise search is the same process, just reversed: start with all features in your model and then remove one at a time until performance starts to drop substantially.We note this algorithm purely for historical reasons. Despite many textbooks listing stepwise search as a valid option, it almost always underperforms other supervised methods such as regularization. Stepwise search has, one of the most fatal being that it's a greedy algorithm that can't account for future effects of each change. We don't recommend this method.
Feature ExtractionFeature extraction is for creating a new, smaller set of features that stills captures most of the useful information. Again, feature selection keeps a subset of the original features while feature extraction creates new ones.As with feature selection, some algorithms already have built-in feature extraction. The best example is Deep Learning, which extracts increasingly useful representations of the raw input data through each hidden neural layer. We covered this in.As a stand-alone task, feature extraction can be unsupervised (i.e. PCA) or supervised (i.e. Principal Component Analysis (PCA)Principal component analysis (PCA) is an unsupervised algorithm that creates linear combinations of the original features.
The new features are orthogonal, which means that they are uncorrelated. Furthermore, they are ranked in order of their 'explained variance.' The first principal component (PC1) explains the most variance in your dataset, PC2 explains the second-most variance, and so on. Therefore, you can reduce dimensionality by limiting the number of principal components to keep based on cumulative explained variance. For example, you might decide to keep only as many principal components as needed to reach a cumulative explained variance of 90%.You should always normalize your dataset before performing PCA because the transformation is dependent on scale. If you don't, the features that are on the largest scale would dominate your new principal components.
Strengths: PCA is a versatile technique that works well in practice. It's fast and simple to implement, which means you can easily test algorithms with and without PCA to compare performance. In addition, PCA offers several variations and extensions (i.e. Kernel PCA, sparse PCA, etc.) to tackle specific roadblocks. Weaknesses: The new principal components are not interpretable, which may be a deal-breaker in some settings.
In addition, you must still manually set or tune a threshold for cumulative explained variance. Implementations: /4.2. Linear Discriminant Analysis (LDA)Linear discriminant analysis (LDA) - not to be confused with latent Dirichlet allocation - also creates linear combinations of your original features. However, unlike PCA, LDA doesn't maximize explained variance. Instead, it maximizes the separability between classes.Therefore, LDA is a supervised method that can only be used with labeled data. So which is better: LDA and PCA? Well, results will vary from problem to problem, and the same 'No Free Lunch' theorem from applies.The LDA transformation is also dependent on scale, so you should normalize your dataset first.
Strengths: LDA is supervised, which can (but doesn't always) improve the predictive performance of the extracted features. Furthermore, LDA offers variations (i.e. Quadratic LDA) to tackle specific roadblocks. Weaknesses: As with PCA, the new features are not easily interpretable, and you must still manually set or tune the number of components to keep. LDA also requires labeled data, which makes it more situational.
Implementations: /4.3. AutoencodersAutoencoders are neural networks that are trained to reconstruct their original inputs. For example, image autoencoders are trained to reproduce the original images instead of classifying the image as a dog or a cat.So how is this helpful? Well, the key is to structure the hidden layer to have fewer neurons than the input/output layers. Thus, that hidden layer will learn to produce a smaller representation of the original image.
Because you use the input image as the target output, autoencoders are considered unsupervised. They can be used directly (e.g. Image compression) or stacked in sequence (e.g. Deep learning). Strengths: Autoencoders are neural networks, which means they perform well for certain types of data, such as image and audio data. Weaknesses: Autoencoders are neural networks, which means they require more data to train.
They are not used as general-purpose dimensionality reduction algorithms. Implementations: /Parting WordsWe've just taken a whirlwind tour through modern algorithms for Dimensionality Reduction, broken into Feature Selection and Feature Extraction.We'll leave you with the same parting advice from. Practice, practice, practice.
Grab a dataset and strike while the iron is hot. Master the fundamentals. For example, it's more fruitful to first understand the differences between PCA and LDA than to dive into the nuances of LDA versus quadratic-LDA.
Remember, better data beats fancier algorithms. We repeat this a lot, but it's the honest darn truth!