Tuesday, April 22, 2014

How does NLP engines work in a nutshell

Part 1 - Introduction

First I'll start with a little introduction about myself, I'm a senior back-end software engineer in a major international technology company part of a group that develop a personal assistant.

One of the biggest challenges we had and still have is extracting meaning from user natural language, this challenge was given to me early in the development stage when we found our regular expression model was not sufficient enough for one of the input sources we examined.

In order to tackle this problem I did an extensive resource on NLP open source engines while our business development scouted for companies that might give solution to our problem. Eventually no such company was yet to be found and I was given the mandate to develop an engine myself.

In this article I've summed up what I've learned on the subject while trying to make it as simple as I can without going into the mathematics behind the NLP statistical models and only focusing on what that is interesting in order to understand the building blocks that makes the NLP engine works.

So let's begin: 
Natural Language Processing (NLP) is a text or speech analyzing framework or tools that enable us to extract machine understandable meaning from a human language, this is done using statistical models for pattern recognition, sentence structure analyzing and semantic layering (If you're not familiar with those terms, they will all be explained later in this article).

There are many implementation of NLP engines, each is design to solve a different set of problem using models to evaluate and analyze different types of inputs. My experience is based on evaluating short text inputs (sentences) using NLP engines such as Stanford CoreNLP and apache UIMA, I'll focus here on NLP engines structure similar to those which are design to analyze textual input.

Today's NLP is becoming more and more popular and it is used in fields such as research, data mining, machine learning, speech processing, AI and others. Implementation of it can be found in Web Search Engines, mobile personal assistants, automated web crawlers and many other more.

But even though and not surprisingly it is still one of the most complex problems in computer science, extracting meaning from natural text is a challenging task, even though there are some incredible open source library and tools to work with and extensive research is done on the field, some of it is still in the POC stage and require intensive work in order to make it stable, I recommend following those steps before you start your development in order to avoid working with tools that are not designed or not mature enough to help you solve your problem:
  1. Does the engine support your preferred development language?
  2. Does it have a live and kicking community?
  3. What kind of license does it have?
  4. Does it have Independent API jars? Or is it an open 3rd party API?
  5. What is the documentation quality? 
So let's examine some of the different and most common building blocks and tools that compose text based NLP engines:

Tokenizes:
A core model for breaking down the text to tokens (usually a tokens refer to a word) using delimiters (such as space, comma and so on).

Statistical and Annotation Engines:
  1. Part Of Speech annotation (POS) - Model that annotate every word in the sentence with the part of speech grammar base it has, such as: verbs, nouns, pronouns, adverbs, adjectives and so on...
    For Example:
    Where:
    PRP$ = 
    Personal pronoun
    NN = Noun
    VBZ = Verb
    NNP = Proper noun
  2. Name Entity Recognition (NER) - Model that annotate words that have semantic layer based on a statistical model or regular expressions rules, common NER annotations will be: Location, Organization, Person, Money, Percent Time and Date…

  3. Creating new Annotation: - A very useful technique when analyzing text is using your own created annotations to highlight and give meaning to words or phrases specific for your program context, most NLP engines give you the infrastructure to create and define your own annotation which will run in the NLP engine text analyzing cycle (this is part of the training described in below "Customizing NLP Engine for your own needs" section 2).
Semantic Graph / Relationship extraction:
An important NLP model that maps the connection between all sentence entities as a graph object, this enable us to travel on the produced graph and find meaningful connections.

For example:       
"My brother Joe is a software engineer"
"Joe My brother is a software engineer"
"A software engineer is Joe my brother"

Using the semantic graph we can understand the same meaning from all above sentences 
Joe <=>  My brother
Joe <=>  software engineer

Dictionaries:
A regular expression based model that holds a set of words or regular expressions, those words should have the same meaning or translation and should be handled / annotated in the some way once found in a text. Some dictionaries are given by the NLP engine and used in the NER model or as other predefined annotation (in the core annotation engine depending on the NLP engine used), while others will be created by us when we customize the engine (as part of the training described in below "Customizing NLP Engine for your own needs" section 2).

An example for dictionary can be all Cities in a country or all Degrees in some university, other cases can be synonyms, like road is a synonym for alley, avenue, street, boulevard etc...
Once the model recognize a dictionary word it will annotate it with an annotation you predefined for example ROAD_SYN can be the annotation for all the road synonyms above.

Some Usable Add-on Models and Algorithms:
  1. Stemming (or Snowball) and lemmatization Algorithms - extracting the root or base form of a token / word, for example: "stemmer", "stemming", "stemmed" will all result with the same base form "stem". This can be very useful when adding the semantic layer on a predefined verb, no need to know all the forms it can have just the root one.
  2. TrueCase - Recognizes the true case of tokens in text where this information was lost, e.g., all upper or lower case text. This can be useful since NLP statistical models will give more accurate results if the text is well formed (grammar, punctuation, spelling and casing) and not always the input can be in a well formed casing, for example speech to text engines will commonly output text in lower case.
Customizing NLP Engine for your own needs:
The last and most important part of evaluating text is tuning and configuring the engine to best suite our needs, there are many NLP engines and each has its own models for training the data, there are two common training techniques:
  1. Training the Core engine data - This training technique is used to train the core data to give a more accurate result based on new input feed offline to the engine (very similar to machine learning), using this training model you will change the core statistical model.
    To use this training we will create a document (preferred a problematic one that the engine produce a wrong output on it or didn’t recognized entities we like to be recognized) and re-annotate it by ourselves correctly, then we will feed it to the statistical model to recalibrate using the new information, this will eventually give more accurate results when the engine will analyze text in run time that is similar to the training doc.
  2. Creating and training new data for the engine to consider - This training is done by using new created models and algorithms (such as described above) that are dedicated for our program needs, for example: new annotations, dictionaries, regular expressions and semantic layering that should help recognize phrases or patterns in the analyzed text to give them meaning in our program context.
    This training does not change the core engine statistical model and only adds new features above it which are needed to make the engine suitable to our machine needs.
This sums it up if we examine the main features functionality using most common out of the box NLP engines.

On the next part I'll give code examples on how to use Stanford CoreNLP (an open source library in Java) to start writing and configuring your own NLP engine.

2 comments:

  1. thanks for your information about nlp. insted complex examplanation, your explanation is more helpfull for me

    ReplyDelete
  2. Casinos Near Me - JTM Hub
    Harrah's Cherokee Casino Resort & Spa 창원 출장샵 Cherokee, NC, 김해 출장안마 United States - Find 공주 출장안마 out which casino is closest 하남 출장마사지 to you 삼척 출장마사지 in Cherokee, NC.

    ReplyDelete