Dimensionality Reduction Techniques -part I-

Ismailouahbi
5 min readMar 27, 2023

--

Photo by Steve Johnson on Unsplash

Dealing with big quantities of data, especially unstructured data can be hard if we don’t specify which technique to use to preserve the quantity of information gained through this data as well as reduce its size to improve computations.

In this article, I will share with you some dimensionality reduction methods helping you preserve the maximum amount of information while reducing the size of your data.

Questions to answer:

I will be covering dimensionality reduction methods applied to structured data and unstructured data(images).

Before we start, there are some questions in our mind:

  • How to deal with big quantities of data?
  • How can we preserve a big amount of information gained through data while reducing its size?
  • Which technique to use when dealing with big quantities of data?
  • How can we decide that technique X is best suitable for our business problem?

Dimensionality Reduction:

Credit: https://www.researchgate.net/

Dimensionality reduction, or dimension reduction, is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data.

Working in high-dimensional spaces can be undesirable for many reasons; raw data are often sparse as a consequence of the curse of dimensionality, and analyzing the data is usually computationally intractable (hard to control or deal with).

Dimensionality reduction. (2023, February 2). In Wikipedia. https://en.wikipedia.org/wiki/Dimensionality_reduction

What’s next?

As I said in my previous article I will try to explain each aspect of each topic in a beginner-friendly way so even non-technical people can understand us.

How to deal with big quantities of data?

While simple tasks require a little part of data to gain insights and answer business questions, often we deal with large quantities of data during our data science job thus, we need to be familiar with big data processing methods and how to start using them.

This is a non-exhaustive list but, it serves to give an overall view of this process:

  1. Use a distributed file system: If you have a lot of data, it may not fit on a single machine. A distributed file system such as Hadoop Distributed File System (HDFS) or Amazon S3 can help you store and access your data across multiple machines.
  2. Employ parallel processing: Parallel processing can help you process large quantities of data more quickly by spreading the workload across multiple processors or machines. Tools like Apache Spark, Apache Hadoop, and Apache Flink are popular choices for distributed data processing.
  3. Employ data sampling: Instead of processing the entire dataset, you can use statistical sampling to select a representative subset of the data for analysis. This can be a faster and more cost-effective way to analyze large datasets.
  4. PCA can help to simplify a dataset and make it easier to work with. This can be particularly useful in situations where many variables or features may be correlated.

As you can notice, suggestion number 4 employs a dimensionality reduction method as a sort of technique to deal with a high amount of data and this will be our covered topic during this series of articles.

PCA

https://en.wikipedia.org/

Principal component analysis (PCA) is a popular technique for analyzing large datasets containing a high number of dimensions/features per observation, increasing the interpretability of data while preserving the maximum amount of information, and enabling the visualization of multidimensional data.

Principal component analysis. (2023, March 10). In Wikipedia. https://en.wikipedia.org/wiki/Principal_component_analysis

Some Use cases of PCA

  • Facial Recognition
  • Image Compression
  • Finding patterns in data of high dimension

Let’s jump to coding

Photo by Florian Olivo on Unsplash

In the following chunk of code, my final goal is to demonstrate PCA as a method and not to deal with data analysis and cleaning steps otherwise, please refer to this article in which I explain data cleaning step by step.

(Note: in this demonstration I’ll be using the famous IRIS dataset to perform a basic PCA but, through my upcoming articles I’ll be dealing with different dimensionality reduction techniques applied to other datasets)

# Import necessary libraries

#numpy for numerical computing
import numpy as np
#sklearn for machine learning
from sklearn.decomposition import PCA
#pandas for data manipulation and analysis
import pandas as pd

# Load the dataset
path = './data/IRIS.csv'
data = pd.read_csv(path)

# shape of data
data.shape

## output
(150, 5) which means 150 rows and 5 columns

# data overview(first 5 rows)
data.head()

## output
data.head() output
# Split the data into features and labels
X = data.iloc[:, :-1].values
y = data.iloc[:, -1].values

# Standardize the features
from sklearn.preprocessing import StandardScaler
X = StandardScaler().fit_transform(X)

This is an important step as it ensures that each feature has a mean of 0 and a standard deviation of 1, which is necessary for PCA

# Create a PCA object with the desired number of components
# here we're using 2 components
pca = PCA(n_components=2)

# Fit the PCA model on the standardized features
principalComponents = pca.fit_transform(X)

# Create a new dataframe to store the principal components and the labels
principalDf = pd.DataFrame(data = principalComponents, columns = ['principal component 1', 'principal component 2'])
finalDf = pd.concat([principalDf, data[['species']]], axis = 1)

# preview final data frame
finalDf.head()

## output
finalDf.head() output
# Visualize the results
import matplotlib.pyplot as plt
fig = plt.figure(figsize = (8,8))
ax = fig.add_subplot(1,1,1)
ax.set_xlabel('Principal Component 1', fontsize = 15)
ax.set_ylabel('Principal Component 2', fontsize = 15)
ax.set_title('2 component PCA', fontsize = 20)
targets = ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']
colors = ['r', 'g', 'b']
for target, color in zip(targets,colors):
indicesToKeep = finalDf['species'] == target
ax.scatter(finalDf.loc[indicesToKeep, 'principal component 1']
, finalDf.loc[indicesToKeep, 'principal component 2']
, c = color
, s = 50)
ax.legend(targets)
ax.grid()
plt.show()
final output

In this example, we create a scatter plot of the first two principal components, with different colors for each label.

Conclusion

As seen during this article, we’ve performed a basic easy-to-understand PCA as well as explained how this affects final outcomes.
As this will be a series of articles, my next article will be about PCA hyperparameter tuning as well as KernelPCA (KPCA) for further analysis.

Thanks.

--

--

Ismailouahbi

I share my unique experiences and insights, unraveling the complexities of machine learning and data science in an engaging and accessible manner.