Big Data and Data Mining

A student wrote to me: “It seems to me that Big Data is a new name for Data Mining, just process more data. Am I correct? What is the difference between Big Data and Data Mining? Please explain.”

Answer: There is a difference between Big Data and Data Mining. Many people believe by mining a massive amount of data is Big Data and it is NOT correct. Let’s start with the simple definition of Data Mining and Big Data.

Data Mining is the process of analyzing data to identify correlations or patterns among several types of data STORED in a database then summarize them into useful information. For example, an owner looks at his company business; he may see revenue, costs and profits. But with data mining, he sees much more. He knows among thousands of product that he sells, which products are best sellers. He also knows what customers want, based on their pattern of purchasing. Based on the data mining analytic report, he knows that if he reduces the price by 5% he could increases the sale by 45% and have a 25% more profit than before. Basically data mining allows the owner to use existing information to reveal additional trends that he could take advantage of.

Today Data Mining is widely used in retail, financial, communication, and marketing companies. It allows them to determine correlations among “internal” factors such as price, products, costs, with “external” factors such as customers, competition, and economic trends. Based on this additional information, companies can determine the impact on their sales, customer buying habit, and corporate profits. With Data mining, a retailer could use sale records of customer purchases to send advertising promotions based on an individual’s purchase history. For example, I always buy books at Amazon.com so each week the company sends me a list of new books, mostly computer books for me to purchase. They never send the list of romantic books, fictional books or architecture books because they know that often buy technology books. Their Data mining software already know my books buying habit.

However, with Data Mining all data must be STRUCTURED and DEFINED before they can be stored in the database. Special Data Mining tools are used to collect these data from the database, analyze them to identify patterns and generate reports to management. In other word, if the data is stored in a database and being structured in rows and columns, regardless how big are their size, it is the domain of Data Mining.

Today, there are other types of data that are NOT DEFINED NOT STRUCTURED and they scattered all over the place.

For example, data from the Internet, from millions of websites and social networks such as Facebook photos, stock market graphics, tweets from Twitter, personal data from Linkedln, electronic medical records from hospitals, economic trends data from research institutes, weather data, business data, emails, personal pictures, videos from YouTube, downloaded movies and music etc. These data CANNOT be collected or stored by typical database tools. More than that, each minute, these data changes or increases in size quickly. They add up, billion upon billion, trillion upon trillion of things happen in the “virtual world”. These data are also very valuable to determine patterns or trends too. When you combine the massive volume of data, the variety of types of data, and the speed that they change than you are dealing with the domain of Big Data.

Big Data has exceeded the traditional database concepts. Their large scale of patterns and trends are so difficult to be seen. Their relationships among all types of different data are too complex to be observed. AND they keep changing at the speed of the internet so it is hard to identify what the data reveals. Basically, the concept and tools of current database and data mining will NOT work anymore. That is why it needs new concept, new tools, new algorithms and that is why Bid Data is the new thing today.

Sources

  • Blogs of Prof. John Vu, Carnegie Mellon University