Clustering is a sort of
unsupervised learning in which gatherings of data focuses are framed in view of
similitude measure with the end goal that comparative items are available in a
similar gathering. The most regularly utilized strategies for Clustering are k-means
and hierarchical clustering. These algorithm utilize distance, for example,
Euclidean separation or Manhattan distance as the likeness measure. The
component subset choice if there should be an occurrence of unsupervised
learning turns into a testing work as a result of inaccessibility of earlier
class label information. This approach, be that as it may, has not been connected
on an assortment of datasets. More investigation is required to utilize this
strategy in generic scenario.
Below Table provides
comparative analysis of four dimensionality reduction schemes introduced in this
chapter. Advantages and disadvantages of each scheme are presented to give a
summarized overview and comparison.
Evaluation and Feature
well suited for unsupervised learning and datasets with very large number of dimensions
Low noise sensitivity
to evaluate covariance matrix
that features with maximum variance are most importance ones
Depends upon scaling of data
dimensionality efficiently for datasets having non linear structure
complexity of algorithms
Large computational time
identifies feature subset
Low computational time
Works for both supervised and unsupervised learning
only to a limited number of dataset
not work well for datasets having non-linear structure
The main objective of
this chapter is to identify various schemes that are being used to reduce the
dimensionality of high dimensional datasets in order to improve accuracy and time
complexity of machine learning algorithms such as classification and
clustering. In this chapter, it is broadly categorized DR methods into four
categories. The first method, using feature evaluation and ranking algorithms,
is observed to be capable of achieving high classification accuracy but takes high
computation time since each feature is to be evaluated with respect to class.
Hence, this scheme needs improvement in terms of reducing the computation time.
The second scheme uses linear DR algorithms such as PCA. The major issue with PCA
is that principal components are not easy to interpret and PCA depends on the
scaling of data. The features providing maximal variance are assumed to be the
most important features. Third scheme uses non-linear DR methods to reduce dimensions
of datasets having inherently non-linear structure. Studies have shown that for
such datasets, non-linear methods outperform linear methods in terms of various
performance measures. Potential subclasses within the data are also identified
using non-linear methods. Drawbacks of non-linear method are high complexity of
algorithms and large computational time. It has also been observed that for a
few non-linear datasets, PCA gives better accuracy than non- linear methods.
The last category uses clustering algorithms for dimensionality reduction. It
has been experimentally proved that feature clustering using hierarchical
clustering can significantly reduce dimensions while achieving high accuracy and
less computational time. This method, however, has not been applied to a
variety of datasets. It is concluded that in order to select a scheme to reduce
dimensionality, we should consider the type of dataset and specific requirement
of machine learning algorithm. Table has shown the comparison between them and can
be referred for this purpose. A combination of schemes may also be used to
overcome the disadvantages of one scheme over another.
Divya Suryakumar and Andrew H. Sung and
Qingzhong Liu, “Model Evaluation of datasets using Critical Dimension Model
Invariant”, IEEE 12th ISDA, pp. 740-745, 2012.
T.Chandrasekhar, “A Novel Approach for Single Gene Selection Using Clustering
and Dimensionality Reduction, International Journal of Scientific &
Engineering Research”, Volume 4, Issue 5, page no 1540-1545, May-2013.
Rahmat Widia Sembiring, Jasni Mohamad
Zain, Abdullah Embong, “Dimension Reduction of health data clustering”,
(IJNCAA) 1(4): 1018- 1026, 2011.
Li, B. and Wang, S. (2007), “On
directional regression for dimension reduction,”Journal of the American
Statistical Association, 102, 997-1008.
Li, B., Artemiou, A., and Li, L. (2011), “Principal
support vector machines for linear
nonlinear sufficient dimension reduction,” The Annals of Statistics,
Subhadra Mishra, Debahuti Mishra,
Satyabrata Das and Amiya Kumar Rath , “Feature Reduction using PCA for
agricultural dataset”, ICECT 2011 – 2011 3rd International Conference on
Electronics Computer Technology, Vol. 2. Kanyakumari; pp. 209 – 213, 2011.
George Lee, Carlos Rodriguez, and Anant
Madabhushi, “Investigating the efficacy of Non-linear DR Schemes in classifying
Gene and Protein Expression Study”, IEEE/ACM Transactions On Computational
Biology And Bioinformatics, Vol. 5, No. 3, July-September, 2008.
P. Mitra, C. A. Murthy, and Sankar K. Pal,
“Unsupervised Feature Selection Using Feature Similarity”, IEEE Transactions on
Pattern Analysis and Machine Intelligence, vol. 24, pp. 301-312, 2002.
Nitika Sharma and Kriti Saroha, “Study of
Dimension Reduction Methodologies in Data Mining”, International
Conference on Computing, Communication and Automation, ICCCA2015, ISBN:978-1-4799-8890-7/15/
Qiang Yang, Xindong Wu, Survey Paper: “10
Challenging problems in Data Mining research”, IJITDM Vol. 5, No. 4,597–604,
World Scientific Publishing Company, 2006.