Scientia

Sunday, January 24, 2016

Life of data

'Data', The most mysterious face I ever met in my life. They retains relationships with all of us. The strength of relationships vary based on their relevance. A popular song lives with the name of its director for years together, a ground breaking invention can change the future of a company, A controversial announcement against a political leader will end up in life threatening phone calls, sometimes data can save thousands of life from an earth quake. Basically data has a role in each and everything that happens in this world.

It is not just Tweets or Facebook posts, it can even be the finger print on your mouse scroll button, or as an increased viewer count of this article. Even a thought or an idea generated in your mind is also data. It is getting generated all around us in each seconds, some of them are getting noticed but many of them stay behind the wall. We don't know when and how they are going to come out and playing their role.

Infinite amount of data take birth in each seconds. Among that less than one percentage of them are getting noticed, a small set of them will become very popular. At birth time, factor of popularity remains same for all. when time moves forward some of them will get noticed very soon, rest of them wait for their right time, some will wait forever and will get noticed after decades like fossils. What makes data popular?

Popularity of data depends on multiple factors like, the chances of getting captured, the type of data, the way of getting it captured, the visibility, the time of birth (timely jokes), the popularity of its creator, the popularity or importance of some related data, etc.. Some of them competes each other, for example trending news in news-channels.

If you are closely watching the life of data you can see the ups and downs of its life cycle.

Wednesday, October 29, 2014

Emotions are not just expressive but generative too

Writers used to express their emotions in letters, stories, or news articles, like for example "I love to be in the US". Well, writers can demonstrate their feelings, at the same time, they can spark off emotions in the reader's mind too. Established writers used to predict (upto some extend) their reader's view or response before publishing an article. How do they do this? What are the key factors they consider for a better response from the reader? Here I would like to share my thoughts regarding the prediction of the reader's emotions. Lets see some of the cases where posts (articles, books, tweets, or Facebook feeds) enables the reader emotions.

i, If the reader feels that his/her emotions synchronize with the inline emotions of a post, obviously there is a higher chance to get positive feelings towards the writer and the post. At the same time conflicts in ideas or intellectual challenges may bring out the wrath of the reader.

ii, Here reader doesn't have much knowledge about the post, he/she is totally blank, for example if it is about some tragic incident that happened, then the post ends up as a fresh canvas, and can directly cross the reader's emotional boundary.

iii, Consider a novel, that may generate a wave of emotions from the beginning to the end. It will finally give a conclusive emotion to the reader.

iv, If the writer is a famous influencer (could be notorious too), he/she can easily attract more people. In this case fame is the factor.

v, If the post content itself is influential among the mob, it can also cross the emotional barriers of the mind. Here popularity of the post plays a big role.

All the above cases depend on certain factors such as reader's view, inline emotions in the post, reader's emotion towards the writer, fame of the writer, popularity of the post and the reader's emotion towards the topic of post. Predicting the readers mind and influencing it, is one of the biggest challenges of all time.

Predicting the reader's emotion using computation is very complex. But if it is possible, we can calculate the public influence rate for each and every tweet, Facebook post, news article and blog based on gender, location, profession and interests of the reader.

Thursday, August 21, 2014

Boundary of BIG(data)

The word Bigdata is very familiar in software industry nowadays. Through many articles, blogs, brainstorming sessions, research papers, news, presentations etc.. updates are happening around Bigdata. For students it is in syllabus. Scientists are introducing new methodologies, Startups are bringing new ideas, by seeing all these we can come into a conclusion that Bigdata is going to rule this industry for the next few years. Apparently some people are believing that Bigdata is just a mirage of data. If then what is the point of making all these fuss. What is the need for this confusion? I don't know who introduced the term Bigdata, but all these confusions are with the word "Bigdata". Here I would like add my understanding about Bigdata.

One thing is sure that the size of data became huge either in the data center, or in house applications. Obviously when data size is increasing it needs better storage, as well as a lot more performance tuning. There comes the first 'V' (volume). Increased number of sources made diversity in 'types' and 'formats' of data. which introduces 2nd V(variety). Data is growing in each millisecond or less than that through News, Tweets, Facebook feeds/comments, Emails, Blogs etc.. Obviously this random change in data creating the issue of capturing as soon as it is arrived, 3rd V(Velocity). Other problem is relevance or accuracy of data whether it is a fact or an opinion or a spam, 4th V (Veracity). Some people are saying Bigdata is not 3V it is 4V.

Turning unstructured to structured data is another hot area in industry (like image analytics,Natural Language processing, Voice to text). Here the Structure of data is unpredictable. Previously software industry were dealing with applications where inputs and outputs are predictable. Now we can say it is depending on data.

In 90's, given post which contains a photograph, technologists were thinking about storing that photo for retrieving it and viewing it again. Nowadays they are interested in 'posted by whom, when, how(through mobile/computer), what is there in that photo, which place,what is the relation between people/objects inside that picture, what is the connection between the person (who posted that pic) and photo. Basically number of metadata taken into consideration has increased. Using all these metadata information now data scientists are trying to infer new information. Aggregation of these data and its metadata is able to give useful insights.

In these cases we are crossing the boundary of data. Now engineers are attentive towards the birth,life and death of a data object. During its life time it is related with other data, which in turn give birth to new data (insights) and it will become history with time (because of trends). May be now we are trying to view the data in different angle with different perspective. Same data object can be viewed from different angle can generate diversity in insights.

Apart from 4Vs, I consider three more points to tag this as 'bigdata' rather than just saying data. Here metadata and its relations are as important as data, accumulation of data is exponential with time and internet of things is seems to be a platform to enable the fusion of data from all sides.

Tuesday, November 19, 2013

Opinion Mining

Opinion mining:

Goal of opinion mining is to identify the textual parts that express emotions. In other words it is Sentiment analysis. Application of opinion mining comes under the decision making process. It converts people's voice into a statistic table so that it will be useful for entrepreneur. Nowadays the area of sentiment analysis is flourishing with lots of research activities.

Relevance

As per survey 81% of Internet users are surfing for their product research (for restaurants, hotels, and various services) at least once. Among those between 73% and 87% report that, reviews had a great role on their purchase. Consumers are ready to pay more for a higher-rated item than a lower-rated. Online ratings systems provide 32% reviews and 30% of them have posted as online comments or reviews.

Due to the diversity of sources it is not easy to get all the review contexts from Web. In some cases it requires authentication, some other cases opinions are hidden along with forum posts and blogs. It is very difficult for a human reader to go through relevant sources, collect contexts, extract pertinent sentences, analyze, summarize, classify and organize them into a usable form. In this situation an automated opinion mining tool will be a help desk for a customer.

Early work

How do people think about..? Researchers are trying to address this question using opinion mining. Identifies the polarity of opinion words and document-level positive or negative sentiment classiﬁcation are some of the initial work has done in this area. In fact this is not exactly needed for a feature based opinion mining process. For example let’s take a review on a phone. A customer might like its screen but dislike its battery. Then researchers started working on feature based opinion mining which mined opinions on different product features. This task is known as feature-level opinion mining.

Challenges

Let’s see some challenges that we have faced during our OM process.

Figure out the proper linking between emotions and its topic is really a thought-provoking task.

For example: “I'm looking for a good twitter app for my apple ipad”.

Here there are 2 possible heads (twitter app, apple ipad) and a single adjective (good). Proper Linking should be between “good” and “twitter app”. Like “good twitter app”.

Find out the emotions from sarcastic sentences is not easy, cases like, if the sentence has some sarcastic meaning or else if it needs an external knowledge to define the emotion.

Sometimes people may make comments sarcastically, either by putting some sarcastic smileys or by having some sarcastic meaning.

For example: “I like their product verrrrrrry much ….. ;) ;)”. It may be a sarcastic review. To determine this we need some external knowledge regarding this person or his/her previous comments.

In some other case topic might be in the previous sentence and referring that using some pronouns such as “it, he, and they” etc...

Look at this example,

“I showed it to Tom and Mary. He also liked”

Here “He also liked” is the opinion part, normally the head taken as “He”. But here the actual head is Tom. Pronouns are not proper heads.

Nowadays people using shorthand such as “U” instead of “You”, smileys etc...

“I lve ma ipad ” People are widely using shorthand to make comments. It is very difficult to resolve these shorthand words, like

“lve” = “love”, “ma” = “my” and “2moro” = “tomorrow” etc…

In some scenarios combination of some words can create some emotions,

For example: damn beauty

Here “damn” is a negative emotion and “beauty” conveys a positive emotion and “damn beauty” is a positive emotion. Other example is “deep shit”

How to get the data? We can pull out only 20-30% of user reviews from World Wide Web using some connectors to the social media websites such as Twitter, Facebook, YouTube and DIGG etc…

What are the available methods?

Basically there are 2 methods, Supervised and Unsupervised. We get more accurate results by using the Machine learning approach (Supervised), but the challenge is to get the training data and also its scope is always limited. Languages and its usages are very flexible. Even if we made some training sets it will be outdated soon. There are lot of tools are available for Topic Extraction and Sentiment analysis. Some of them are listed below.

Tools

There are some tools available for this purpose, like KEA, MAHOUT, MALLET, MAUI, WEKA, SmILE, SentiWordNet and RapidMiner.

Almost all the topic extraction tools are based on machine learning. It is useful for document level extractions and classifications. This is not what we exactly needed for feature level opinion mining. SentiWordNet is a good one for finding the emotion of a word.

Development frameworks

To develop such kind of applications there are some development frameworks like GATE, UIMA, and NLTK etc… According to the use and development criteria we can choose any one of these. These all are open source tools. It allows different types of plugins that are useful for this type of tasks.

Mining process (Unsupervised)

Every opinion has at least two parts a Head (Topic) and a Sentiword (the word describes emotion).

For the proper identification of Opinion parts (Head and Sentiword), an excellent POS Tagger and Gazetteers (list of commonly used nouns, phrases, sentiwords and smileys) are needed. Topic can be a Person, an Object or a Term and Sentiwords are basically categorized into Positive and Negative. Linking of a sentiword to a proper head is based on some constraints that we have given. Opinion text should be understandable and meaningful. A feature based classification of opinions is an added task for opinion mining.

Accuracy

How can we measure the accuracy of an Opinion mining application? While doing Opinion mining process, agreement between humans is around 85% only, using some sort of training we can make it above 90%. But agreement between human and system is pretty much lesser than this. Measurement of this can be done by precision and recall. Correlation can give the closeness towards the predicted value. Benchmarking tools are available for such type of measurements.

Uses

Today it has a wide range of applications like Brand Monitoring, Buzz Monitoring, Online Anthropology and Online Consumer Intelligence. In other words, say social media monitoring. Opinion mining helps us in decision making process. It is useful for individual as well as organization. Summarization of opinions makes consumer to take informed as well as valid decisions. Opinion mining applications are becoming as the essential part of businesses and organizations. For example, it is always critical information for a product manufacturer “how consumers accept their products” and those of its competitors. This information is not only useful for marketing and product but also useful for product design and product developments.

External References

http://www.cs.uic.edu/~liub/FBS/opinion-mining.pdf

http://en.wikipedia.org/wiki/Sentiment_analysis

http://www.cs.cornell.edu/home/llee/omsa/omsa.pdf