A Framework for High Dimensional Data Reduction,
Selectivity Estimation
and NN search
With the increased abilities for automated data collection made possible
by modern technology, the typical sizes of data collections have continued
to grow in recent years. In such cases, it may be desirable to store the
data in a reduced format in order to improve the storage, transfer time,
and processing requirements on the data.
One of the challenges of designing effective data compression techniques is
to be able to preserve the ability to use the reduced format directly for a
wide range of database and data mining applications. In this paper, we propose
the novel idea of hierarchical subspace sampling in order to create a reduced
representation of the data. The method is naturally able to estimate the local
implicit dimensionalities of each point very effectively, and thereby create
a variable dimensionality reduced representation of the data. Such a technique
has the advantage that it is very adaptive about adjusting its representation
depending upon the behavior of the immediate locality of a data point. An
interesting property of the subspace sampling is that unlike all other data
reduction techniques, the overall efficiency of compression {improves} with
increasing database size. This is a highly desirable property for any data
reduction system since the problem itself is motivated by the large size
of data sets. Because of its sampling approach, the procedure is extremely
fast and scales linearly both with data set size and dimensionality.
Furthermore, the subspace sampling technique is able to reveal important
local subspace characteristics of high dimensional data, which can be harnessed
for effective solutions to problems such as selectivity estimation and approximate nearest
neighbor search.