Thursday, August 21, 2014

Boundary of BIG(data)




The word Bigdata is very familiar in software industry nowadays. Through many articles, blogs, brainstorming sessions, research papers, news, presentations etc.. updates are happening around Bigdata. For students it is  in syllabus. Scientists are introducing new methodologies, Startups are bringing new ideas, by seeing all these we can come into a conclusion that Bigdata is going to rule this industry for the next few years. Apparently some people are believing that Bigdata is just a mirage of data. If then what is the point of making all these fuss. What is the need for this confusion? I don't know who introduced the term Bigdata, but all these confusions are with the word "Bigdata". Here I would like add my understanding about Bigdata. 

One thing is sure that the size of data became huge either in the data center, or in house applications. Obviously when data size is increasing it needs better storage, as well as a lot more performance tuning. There comes the first 'V' (volume). Increased number of sources made diversity in 'types' and 'formats' of data. which introduces 2nd V(variety). Data is growing in each millisecond or less than that through News, Tweets, Facebook feeds/comments,  Emails, Blogs etc.. Obviously this random change in data creating the issue of capturing as soon as it is arrived, 3rd V(Velocity). Other problem is relevance or accuracy of data whether it is a fact or an opinion or a spam, 4th V (Veracity). Some people are saying Bigdata is not 3V it is 4V.

Turning unstructured to structured data is another hot area in industry (like image analytics,Natural Language processing, Voice to text). Here the Structure of data is unpredictable. Previously software industry were dealing with applications where inputs and outputs are predictable. Now we can say it is depending on data. 

In 90's, given post which contains a photograph, technologists were thinking about storing that photo for retrieving it and viewing it again. Nowadays they are interested in 'posted by whom, when, how(through mobile/computer), what is there in that photo, which place,what is the relation between people/objects inside that picture, what is the connection between the person (who posted that pic) and photo. Basically number of metadata taken into consideration has increased. Using all these metadata information now data scientists are trying to infer new information. Aggregation of these data and its metadata is able to give useful insights.


In these cases we are crossing the boundary of data. Now engineers are attentive towards the birth,life and death of a data object. During its life time it is related with other data, which in turn give birth to new data (insights) and it will become history with time (because of trends). May be now we are trying to view the data in  different angle with different perspective. Same data object can be viewed from different angle can generate diversity in insights.

Apart from 4Vs, I consider three more points to tag this as 'bigdata' rather than just saying data. 
Here metadata and its relations are as important as data, accumulation of data is exponential with time and internet of things is seems to be a platform to enable the fusion of data from all sides. 

No comments:

Post a Comment