Industry attributes
Other attributes
The term "big data" refers to data that is so large, fast, or complex that it is difficult to impossible to process using traditional data methods. While data collection and analyzation has been around for a long time, the concept of big data gained momentum in the early 2000s when industry analyst Doug Laney offered a new mainstream definition of big data. This defined big data across volume, velocity, and variety.
Three V's of Big Data
Since the original definition, and with an increased understanding of big data, there has been an addition of two more V's with value and veracity. In this expanded definition, data, although it has intrinsic value, is of no use until the value is discovered. As well, how truthful the data is has become increasingly important, especially as data has become a form of capital for large companies and part of the value those companies offer comes from the company's data that is constantly being produced and analyzed for products.
The increase in the amount of data available presents both opportunities and problems. Having more data on customers, or potential customers, can allow companies to better develop products and marketing efforts to increase user satisfaction and repeat business. And companies collecting a large amount of data are provided opportunities to conduct deeper and richer analysis. An example in the increase in the amount of data can include Google's 3.5 billion daily searches during 2020, which represents 87 percent of the search engine market and is roughly 1.2 trillion searches yearly with more than 40,000 search queries per second. Or the 65 billion messages sent daily over WhatsApp in 2020 across the application's user base of 2 billion people.
Big data can also create overload and noise and reduce its usefulness, and as companies handle larger volumes of data, it is required to determine which data represents signals compared to noise, and the decision over what data is relevant becomes a key factor. The nature and format of the data can require special handling before it is acted upon. Structured data, consisting of numeric values, can be easily stored and sorted. Whereas unstructured data, such as emails, videos, and text documents, requires more sophisticated techniques before it can be considered useful.
Big data can be further categorized as unstructured or structured. Structured data consists of information already managed by the organization in databases and spreadsheets and is frequently numeric in nature. Unstructured data is information that is unorganized and does not fall into a predetermined model or format. This kind of data can be gathered from social media sources, publicly shared comments from social networks and websites, or from personal electronics and applications, through questionnaires, product purchases, and electronic check-ins. The use of sensors and other inputs in connected devices can allow for greater amounts of data across a broad spectrum of situations and circumstances.
The problem of unstructured data is cited by 95 percent of companies as a problem for their business, with the cost to manage and analyze the unstructured data a problem for the companies. With an estimated 80 to 90 percent of data generated being unstructured, this can represent a large problem for companies, especially in the case of the company's product offerings for their customers. This problem, along with poor data quality, costs businesses worldwide an estimated 9.7 million to 14.2 million yearly. As well, poor data quality and poor data analysis are considered to be causes for businesses to make poor decision making and poor business strategies, in turn causing low productivity and mistrust between customers and a brand.
While there are different approaches and strategies for dealing with big data based on a company's system and what a company expects to get from the data, there is a general category of activities involved that are shared among users of big data. This includes the ingestion of data into a system, the persistence of the data in storage, the computing and analyzing of the data, and a visualization of the results.
Big data brings together data from many disparate sources and applications, and ingesting and integrating the data into a company's system are important to the functions of big data in an organization. Traditional data integration mechanisms, such as extract, transform, and load (ETL), are generally not up to the task. Rather, big data requires new strategies to analyze big data sets at the terabyte or petabyte scale.
New technologies, such as Apache Scoop, and new tools have been developed to ingest and integrate this data, and these include tools to aggregate and normalize the output of these tools at the end of the ingestion pipeline. During the ingestion process, some level of analysis, sorting, and labelling take place. While this generally is used in ETL frameworks, with the ubiquity of big data these operations have been modified for the volume and types of data involved.
Big data requires storage. The storage can be in the cloud, on premises, or a mixture of both. The data can be stored in any form an organization wants or needs in order to meet the desired processing requirements and process engines to keep the data sets and be able to access them in an on-demand basis. Organizations can choose their storage solution to where the data is currently residing; however, cloud computing and storage is gaining popularity because it can support current compute requirements and allow an organization to develop resources as needed.
The data also needs a distributed file system, such as Apache Hadoop's HDFS filesystem, which can allow large quantities of data to be written across multiple nodes in a cluster. This type of storage system can allow the data to be accessed by computers, loaded into the cluster's RAM for in-memory operations, and handle component failures. There are other types of databases, such as distributed databases, or NoSQL databases, that are often used for big data storage because they can handle heterogeneous data and are designed to be fault tolerant. This can allow organizations to use a database based on organizational needs.
Once the data is available it can offer organizations an understanding, clarity, and visual analysis of the data sets to provide companies with the information they need and to discover connections that had not been made previously. With the inclusion of data models built with machine learning and artificial intelligence, organizations can pull out further patterns and insights.
The compute and analyze layer is also perhaps one of the more diverse parts of the big data lifecycle. The requirements and approaches towards the computing and analysis of the data can depend on the types of information the organizations wants. Data is often processed multiple times by either a single tool or by multiple tools to surface different forms of information.
One method of doing this is batch processing, which involves breaking the datasets into smaller pieces and scheduling each piece on an individual machine, before reshuffling the data based on intermediate results, calculating, and assembling the final result. This is one of more useful compute methods when dealing with very large datasets or compute intensive datasets.
Due to the qualities of big data, individual computers are often inadequate for handling data, especially at all or most stages. To address the storage and computational needs of big data, computer clusters are often used. Cluster software combines the resources of many smaller machines, which work to offer resource pooling, better availability, and scalability. The clusters are able to combine the available storage space to hold data and also combine the compute and memory capabilities to process large datasets that require larger amounts of all three substances.
Compute clusters can provide varying levels of fault tolerance and guarantees to prevent hardware or software failures from affecting access to data and processing, which can become of greater importance with the use of real-time analytics. These clusters can also make it easy to scale horizontally by adding machines to a group, and the system can, in turn, react to changes in resource requirements without needing to expand the physical resources on a machine. Using compute clusters also requires software for resource sharing, the allocation of these resources, and scheduling work on individual nodes.
Big data has use cases across a range of industries, from customer experience to analytics, for different purposes and results dependent on the business needs.
Use cases for big data
Although big data as a concept is relatively new, dating back to the early 2000s, the origins of large datasets that could be recognized as big data goes back to the 1960s and 1970s. During this period, data was first being collected in the first data centers and the development of relational databases. Around 2005, there was a realization on how much data users generated through Facebook, YouTube, and other online services. Hadoop was developed in the same year. And NoSQL gained popularity in the same time.
The development of open-source frameworks, such as Hadoop, has been essential to the growth of big data as they make the data easer to work with and cheaper to store. Since the development of Hadoop, the volume of data has drastically increased. With the advent of Internet of Things (IoT), offering more devices and more objects connecting to the internet and gathering data on customer usage patterns and product performance, the use of data has increased again. And with the emergence of machine learning and the use of datasets for training machine learning, the need for and use of data continues to grow.
While big data has come far from where it started, the usefulness of the data is only growing in use and understanding. Cloud computing has offered expanded possibilities and use cases for big data, as the cloud offers truly elastic scalability, and where developers can simply spin up ad hoc clusters to test subsets of data. Graph databases are becoming increasingly important as well, with the ability to display massive amounts of data in a way that makes analytics fast and comprehensive.