What is text analytics?
Text analytics combines a set of machine learning, statistical and linguistic techniques to process large volumes of unstructured text or text that does not have a predefined format, to derive insights and patterns. It enables businesses, governments, researchers, and media to exploit the enormous content at their disposal for making crucial decisions. Text analytics uses a variety of techniques – sentiment analysis, topic modelling, named entity recognition, term frequency, and event extraction.
What’s the difference between text mining and text analytics?
Text mining and text analytics are often used interchangeably. The term text mining is generally used to derive qualitative insights from unstructured text, while text analytics provides quantitative results.
For example, text mining can be used to identify if customers are satisfied with a product by analyzing their reviews and surveys. Text analytics is used for deeper insights, like identifying a pattern or trend from the unstructured text. For example, text analytics can be used to understand a negative spike in the customer experience or popularity of a product.
The results of text analytics can then be used with data visualization techniques for easier understanding and prompt decision making.
What’s the relevance of text analytics in today’s world?
As of 2020, around 4.57 billion people have access to the internet. That’s roughly 59 percent of the world’s population. Out of which, about 49 percent of people are active on social media. An enormous amount of text data is generated every day in the form of blogs, tweets, reviews, forum discussions, and surveys. Besides, most customer interactions are now digital, which creates another huge text database.
Most of the text data is unstructured and scattered around the web. If this text data is gathered, collated, structured, and analyzed correctly, valuable knowledge can be derived from it. Organizations can use these insights to take actions that enhance profitability, customer satisfaction, research, and even national security.
Benefits of text analytics
There are a range of ways that text analytics can help businesses, organizations, and event social movements:
- Help businesses to understand customer trends, product performance, and service quality. This results in quick decision making, enhancing business intelligence, increased productivity, and cost savings.
- Helps researchers to explore a great deal of pre-existing literature in a short time, extracting what is relevant to their study. This helps in quicker scientific breakthroughs.
- Assists in understanding general trends and opinions in the society, that enable governments and political bodies in decision making.
- Text analytic techniques help search engines and information retrieval systems to improve their performance, thereby providing fast user experiences.
- Refine user content recommendation systems by categorizing related content.
Text analytics techniques and use cases
There are several techniques related to analyzing the unstructured text. Each of these techniques is used for different use case scenarios.
Sentiment analysis
Sentiment analysis is used to identify the emotions conveyed by the unstructured text. The input text includes product reviews, customer interactions, social media posts, forum discussions, or blogs. There are different types of sentiment analysis. Polarity analysis is used to identify if the text expresses positive or negative sentiment. The categorization technique is used for a more fine-grained analysis of emotions - confused, disappointed, or angry.
Use cases of sentiment analysis:
- Measure customer response to a product or a service
- Understand audience trends towards a brand
- Understand new trends in consumer space
- Prioritize customer service issues based on the severity
- Track how customer sentiment evolves over time
Topic modelling
This technique is used to find the major themes or topics in a massive volume of text or a set of documents. Topic modeling identifies the keywords used in text to identify the subject of the article.
Use cases of topic modeling:
- Large law firms use topic modeling to examine hundreds of documents during large litigations.
- Online media uses topic modeling to pick up trending topics across the web.
- Researchers use topic modeling for exploratory literature review.
- Businesses can determine which of their products are successful.
- Topic modeling helps anthropologists to determine the emergent issues and trends in a society based on the content people share on the web.
Named entity recognition (NER)
NER is a text analytics technique used for identifying named entities like people, places, organizations, and events in unstructured text. NER extracts nouns from the text and determines the values of these nouns.
Use cases of named entity recognition:
- NER is used to classify news content based on people, places, and organizations featured in them.
- Search and recommendation engines use NER for information retrieval.
- For large chain companies, NER is used to sort customer service requests and assign them to a specific city, or outlet.
- Hospitals can use NER to automate the analysis of lab reports.
Term frequency – inverse document frequency
TF-IDF is used to determine how often a term appears in a large text or group of documents and therefore that term’s importance to the document. This technique uses an inverse document frequency factor to filter out frequently occurring yet non-insightful words, articles, propositions, and conjunctions.
Event extraction
This is a text analytics technique that is an advancement over the named entity extraction. Event extraction recognizes events mentioned in text content, for example, mergers, acquisitions, political moves, or important meetings. Event extraction requires an advanced understanding of the semantics of text content. Advanced algorithms strive to recognize not only events but the venue, participants, date, and time wherever applicable. Event extraction is a beneficial technique that has multiple uses across fields.
Use cases of event extraction:
- Link analysis: This is a technique to understand “who met whom and when” through event extraction from communication over social media. This is used by law enforcement agencies to predict possible threats to national security.
- Geospatial analysis: When events are extracted along with their locations, the insights can be used to overlay them on a map. This is helpful in the geospatial analysis of the events.
- Business risk monitoring: Large organizations deal with multiple partner companies and suppliers. Event extraction techniques allow businesses to monitor the web to find out if any of their partners, like suppliers or vendors, are dealing with adverse events like lawsuits or bankruptcy.
Steps involved with text analytics
Text analytics is a sophisticated technique that involves several pre-steps to gather and cleanse the unstructured text. There are different ways in which text analytics can be performed. This is an example of a model workflow.
- Data gathering - Text data is often scattered around the internal databases of an organization, including in customer chats, emails, product reviews, service tickets and Net Promoter Score surveys. Users also generate external data in the form of blog posts, news, reviews, social media posts and web forum discussions. While the internal data is readily available for analytics, the external data needs to be gathered.
- Preparation of data - Once the unstructured text data is available, it needs to go through several preparatory steps before machine learning algorithms can analyze it. In most of the text analytics software, this step happens automatically. Text preparation includes several techniques using natural language processing as follows:
- Tokenization: In this step, the text analysis algorithms break the continuous string of text data into tokens or smaller units that make up entire words or phrases. For instance, character tokens could be each individual letter in this word: F-I-S-H. Or, you can break up by subword tokens: Fish-ing. Tokens represent the basis of all natural language processing. This step also discards all the unwanted contents of the text, including white spaces.
- Part-of-speech-tagging: In this step, each token in the data is assigned a grammatical category like noun, verb, adjective, and adverb.
- Parsing: Parsing is the process of understanding the syntactical structure of the text. Dependency parsing and constituency parsing are two popular techniques used to derive syntactical structure.
- Lemmatization and stemming: These are two processes used in data preparation to remove the suffixes and affixes associated with the tokens and retain its dictionary form or lemma.
- Stopword removal: This is the phase when all the tokens that have frequent occurrence but bear no value in the text analytics. This includes words such as ‘and’, ‘the’ and ‘a’.
- Text analytics - After the preparation of unstructured text data, text analytics techniques can now be performed to derive insights. There are several techniques used for text analytics. Prominent among them are text classification and text extraction.
Text classification: This technique is also known as text categorization or tagging. In this step, certain tags are assigned to the text based on its meaning. For example, while analyzing customer reviews, tags like “positive” or “negative” are assigned. Text classification often is done using rule-based systems or machine learning-based systems. In rule-based systems, humans define the association between language pattern and a tag. “Good” may indicate positive review; “bad” may idenitfy a negative review.
Machine learning systems use past examples or training data to assign tags to a new set of data. The training data and its volume are crucial, as larger sets of data helps the machine learning algorithms to give accurate tagging results. The main algorithms used in text classification are Support Vector Machines (SVM), Naive Bayes family of algorithms (NB), and deep learning algorithms.
Text extraction: This is the process of extracting recognizable and structured information from the unstructured input text. This information includes keywords, names of people, places and events. One of the simple methods for text extraction is regular expressions. However, this is a complicated method to maintain when the complexity of input data increases. Conditional Random Fields (CRF) is a statistical method used in text extraction. CRF is a sophisticated but effective way of extracting vital information from the unstructured text.
What happens after text analytics?
Once the text analytics methods are used to process the unstructured data, the output information can be fed to data visualization systems. The results can then be visualized in the form of charts, plots, tables, infographics, or dashboards. This visual data enables businesses to quickly spot trends in the data and make decisions.