What is Big Data

A manager asked: “I have read your article on “Big Data” but still do not understand exactly what it means. We collect a lot of data for our company and store them in our database. Is it Big data? Why it is important today. Please advice.”

Answer: There is a difference between “Big Data” and “Lots of Data” and people are often confused. For example, bank and financial companies process a lot of data but all of their data are well defined or well structured such as customer accounts, amount of money, types of loans, credits or debits etc. These are NOT Big data but only “Lots of data”. Companies can store these data in databases and using Business Intelligence (BI) software to analyze and provide reports to management.

To qualify as “Big Data” the data must meet criteria called “The three “V”: Volume, Variety and Velocity. By Volume it means the data have to be extremely large, measuring in Petabytes or Zeta bytes. By Variety it means the data are both structured and unstructured or well defined and undefined. For example some data may be textual but other may be picture images such as medical images or YouTube videos. By Velocity, it means these data often come in very fast and constantly changing such as streaming video for images, or twitter feed messages.

Because of these phenomena, current relational database will not be able to store them (To large and too unorganized) and current software will not be able to process them (To large, to unorganized, and changing too fast) and that is why it opens up a completely new challenge for Information technology people.

In the past when all data were well defined and structured, they can be stored in large files to be retrieved and updated regardless how big or how many files. In this case Business Intelligence (BI) software can sort through data, collects necessary information, analyzes them and creates reports to different levels of management. Today when the data are huge and include both structured and unstructured, some are text and some are pictures or videos, it cannot be stored into organized files but need different types of files with new software, new algorithms that can combine these data and stored them so they can be analyzed, organized, collected and created reports. Since some of these data are changing extremely fast, some of them are time dependent information such as news videos, movies and pictures etc. They require a new approach, a new way of organizing them, and new algorithms to process, correlate, and dealing with more variables than previous tools.

Sources

  • Blogs of Prof. John Vu, Carnegie Mellon University

You may like