Keyword Analysis of Very Large Databases
This is an analysis and commentary of the keywords in VLDB paper
titles since 1975. The word vectors are clustered by shape and highlight
the rise and fall of many topics in our area.
Each light line is the count per year for a keyword.
The dark line is the mean of the light lines.
The analysis and data are all on GitHub.
See the old analysis at vldb.html.
The Birth (1980s)
Our field's birth was the impending downfall of Codasyl.
During this time, researchers were designing the abstractions,
specifications and interfaces
that laid the groundwork for a nascent field.
Growing Pains (1990)
Early on, we were still trying to understand the right
model for data representation (Spoiler: not object
oriented). This also introduced
classic topics like
rule based optimization, transactions,
and Query vs SQL question.
Big Iron (1995)
Moving forward, databases have been adopted by major organizations
that want performance. This is the era of warehouses
containing multidimensional data.
Computers are cheap enough that Dewitt et. al can think about
parallel execution across multiple servers.
All Business (1995)
Warehousing continues into the early 2000s and it's
business time for technologies like OLAP.
We have more than one server so maintenance and
client server caching is an issue.
The Internet hits the scene and will fundamentally change
This cluster is a mish-mash of classic and stable
topics like databases, data management, queries and systems,
but introduces searching over your data.
Databases are fundamentally a business tool, and
businesses love XML, XQuery and Oracle.
This era also brought
forward Surajit's automated database tuning,
stream processing like STREAM, TelegraphCQ, and Borealis,
and peer-to-peer systems like Gnutella, Chord and others.
author worked on his first stream processing paper
during this time.
The internet and humans means dirty data. TRIO hoped to use
uncertain, probabilistic methods to deal
with these issues.
We also see data mining techniques like
pattern matching, top-k, and similarity.
Scalability hints at the next "revolution".
SoCC was founded to deal with the cloud (+mapreduce).
Partitioning is a classic trick to make things go fast.
Differential privacy is looked at as a way of securing data
in the cloud.
We finally have the infrastructure and social data
to work on real graph problems --
many of the best paper awards in this time focused
As a young researcher, what's the bandwagon to jump on?
The upward trending terms hint at several topics.
Performance and scaling with new hardware like GPUs
is one direction. Data analysis, such as frequent
itemset mining, diversifying results and in
general helping users
discover interesting results is hugely important in the
Or work on the crowd! That still has a few years in its