Data virtualization is a method of accessing data. The virtualization in this case refers to the way the data is handled rather than the data itself. In a virtual data system, multiple databases, repositories and data storehouses are accessed simultaneously, and the information is combined into a single report before it is handed over to the user. The process of virtualizing data is a complex one. As a result, data virtualization is occasionally used within a company.
The basic idea behind data virtualization is deceptively simple. A masking layer, called the data virtualization layer, is placed between the user and a data storage system. When the user asks for data, such as an invoice number, the query goes through the virtual layer and out to multiple storage systems. It quarries each data storage system with the number and brings all of the results back to the data virtualization layer. While inside the layer, the system compiles all the information into a single report which it then gives to the user.
On the user side, data virtualization isn’t any different from a standard database query. The user asks for information, and a few seconds later, it comes up on the screen. The data provided for the invoice query may have marketing information about the order, order histories for the purchaser and stock information about the items purchased, all from different databases. This wealth of data allows the user a well-rounded viewpoint on the query and gives contextual information for the search.
Setting up a data virtualization system is very complex. Bringing up a pile of disconnected data about an invoice number is simple, but the user would have to manually sort through the information to find important data. This would slow the user down and end up hurting productivity. Instead, the data layer needs to sort and present the information in a clear manner.
In order to set up the system, connections need to be made between key datasets. Before the connections are made, the data needs proper formatting and indexing. After the data is deemed usable, connections are created between compatible data. Often, computers will often have a difficult time understanding how the data fits together. Since the connections are often more contextual than direct, a human typically performs much of the task.
Since the formatting and connection work is both time-consuming and difficult, data virtualization is rarely done on a large scale. A single company may set up virtualization among its own systems, but virtualizing outside vendors or data warehouse systems is very difficult. Among smaller businesses, the shear wealth of data sources is typically absent, meaning they don’t need the systems at all.