Keyword Analysis of Very Large Databases

This is an analysis and commentary of the keywords in VLDB paper titles since 1975. The word vectors are clustered by shape and highlight the rise and fall of many topics in our area.

Each light line is the count per year for a keyword.
The dark line is the mean of the light lines.

The analysis and data are all on GitHub.
See the old analysis at vldb.html.

The Birth (1980s)

Our field's birth was the impending downfall of Codasyl. During this time, researchers were designing the abstractions, specifications and interfaces that laid the groundwork for a nascent field.
designs
file
abstraction
paper
invited
codasyl
specifications
conceptual
interface
office

Growing Pains (1990)

Early on, we were still trying to understand the right model for data representation (Spoiler: not object oriented). This also introduced classic topics like rule based optimization, transactions, and Query vs SQL question.
object
model
oriented
transaction
language
rules
activity
panel
methods
extended

Big Iron (1995)

Moving forward, databases have been adopted by major organizations that want performance. This is the era of warehouses containing multidimensional data. Computers are cheap enough that Dewitt et. al can think about parallel execution across multiple servers.
parallel
server
heterogeneity
materialization
level
warehousing
associative
load
multidimensional
universal

All Business (1995)

Warehousing continues into the early 2000s and it's business time for technologies like OLAP. We have more than one server so maintenance and client server caching is an issue. The Internet hits the scene and will fundamentally change database's future.
service
cached
technology
warehouse
internet
olap
site
business
space
maintenance

Stable Topics

This cluster is a mish-mash of classic and stable topics like databases, data management, queries and systems, but introduces searching over your data.
data
queries
database
systems
efficient
web
based
search
processing
management

Sensors+Web (2000)

Databases are fundamentally a business tool, and businesses love XML, XQuery and Oracle. This era also brought forward Surajit's automated database tuning, stream processing like STREAM, TelegraphCQ, and Borealis, and peer-to-peer systems like Gnutella, Chord and others. This author worked on his first stream processing paper during this time.
xml
streams
engineering
xquery
peer
dimensional
oracle
cube
automatic
platforms

Integration (2007)

The internet and humans means dirty data. TRIO hoped to use uncertain, probabilistic methods to deal with these issues.
We also see data mining techniques like pattern matching, top-k, and similarity. Scalability hints at the next "revolution".
scalability
match
uncertain
probabilistic
similarity
analysis
top
awareness
pattern
mapping

Cloud (2010)

SoCC was founded to deal with the cloud (+mapreduce). Partitioning is a classic trick to make things go fast. Differential privacy is looked at as a way of securing data in the cloud.
We finally have the infrastructure and social data to work on real graph problems -- many of the best paper awards in this time focused on graphs.
graphs
mapreduce
social
cloud
fast
results
distance
analytic
partitioning
differential

The Future

As a young researcher, what's the bandwagon to jump on? The upward trending terms hint at several topics. Performance and scaling with new hardware like GPUs is one direction. Data analysis, such as frequent itemset mining, diversifying results and in general helping users discover interesting results is hugely important in the bigdata world.
Or work on the crowd! That still has a few years in its life.
scale
crowd
labeling
points
frequent
discovering
rdf
comparison
gpus
diversification