March 12, 2019 - John Thuma | Big Data Ecosystem

The Data Science Behind Natural Language Processing

Human communication is one of the most fascinating attributes of being sentient. As a fellow human I know how we interact can be extremely complex. We often send and receive the wrong messages, or our messages are misinterpreted by others. Everyday we take for granted our ability to convey meaning to our coworkers and family members. We communicate in a variety of ways including speech and written symbols. Human communication can be as simple as a glance across a room. Chris Manning, Professor in Machine Learning at Stanford University describes communication as “a discrete, symbolic, categorical signaling system.” What does this mean? I think it is our senses such as sight, touch, hearing, and even scent that enable our communication. This brings me to the topic of this blog: What happens when we bring computing into the fold? What is natural language processing and how does it work?

Natural language processing (NLP) is a discipline in computer science and artificial intelligence. NLP is the communication between people and machines to both interpret our meaning and to construct valid responses. The field has been around since the 1950s and you may have heard of the “Turing Test” developed by Alan Turing. The Turing Test measures how well a computer responds to human written questions. If an independent person cannot tell the difference between a person and machine then that computing system is ranked intelligent. We have come a long way since the 1950s and there have been many advances in the fields of data science and linguistics. The remainder of this article will detail some of the basic capabilities of these algorithms in the field of natural language processing. We will include some code examples using Python.


To get started in natural language processing we will start with some very simple text parsing. Tokenization is the process of taking a stream of text like a sentence and breaking it down to its most basic words. For instance take the following sentence: “The red fox jumps over the moon.” Each word would represent a token of which there are seven.

To Tokenize a sentence using python:

myText = ‘The red fox jumps over the moon.’
myLowerText = myText.lower()
myTextList = myLowerText.split()
[‘the’, ‘red’, ‘fox’, ‘jumps’, ‘over’, ‘the’, ‘moon’]

Parts of Speech

Parts of speech is used to determine syntactic function. In the English language the main parts of speech are: adjective, pronoun, noun, verb, adverb, preposition, conjunction, and interjection. This is used to infer the intent of the word based on its use. For example the word PERMIT can be a noun and a verb. Verb use: “I permit you to go to the dance.” Noun use: “Did you get the permit from the county.”

To execute parts of speech using Python: (use the NLTK library)

You may have to install NLTK which is a Python library for natural language processing. Instructions on NLTK: CLICK HERE

import nltk
myText = nltk.word_tokenize(‘the red fox jumps over the moon.’)
print(‘Parts of Speech: ‘, nltk.pos_tag(myText))
Parts of Speech: [(‘the’, ‘DT’), (‘red’, ‘JJ’), (‘fox’, ‘NN’), (‘jumps’, ‘NNS’), (‘over’, ‘IN’), (‘the’, ‘DT’), (‘moon’, ‘NN’), (‘.’, ‘.’)]

So you can see how NLTK breaks sentences into tokens and interprets parts of speech, for instance (‘fox’, ‘NN’):

NN noun, singular ‘fox’

Stop Word Removal

Many sentences and paragraphs include words that have very little meaning or value. These words include “a,” “and,” “an,” and “the.” Stop word removal is a process of removing these words from a sentence or stream of words.

To perform stop word removal using Python and NLTK: (Again instructions on NLTK here)

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

example_sent = “a red fox is an animal that is able to jump over the moon.” stop_words = set(stopwords.words(‘english’)) word_tokens = word_tokenize(example_sent) filtered_sentence = [w for w in word_tokens if not w in stop_words] filtered_sentence = [] for w in word_tokens: if w not in stop_words: filtered_sentence.append(w) print(filtered_sentence)

[‘red’, ‘fox’, ‘animal’, ‘able’, ‘jump’, ‘moon’, ‘.’]


Stemming is the process of reducing noise in a word and is otherwise referred to as lexicon normalization. It reduces inflection. For example, the word “fishing” has a stem word “fish.” Stemming is used to simplify a word down to its base meaning. Another good example is the word “like” which is the stem of many words such as: “likes,” “liked,” and “likely.” Search engines use stemming for this reason. In many situations it could be useful for a search for one of these words to return documents that contain another word in the set.

To perform stemming using Python and the NLTK library:

rom nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

ps = PorterStemmer()

words = [“likes”, “likely”, “likes”, “liking”]

for w in words:
print(w, ” : “, ps.stem(w))

(‘likes’, ‘ : ‘, u’like’)
(‘likely’, ‘ : ‘, u’like’)
(‘likes’, ‘ : ‘, u’like’)
(‘liking’, ‘ : ‘, u’like’)


Stemming and lemmatization are very similar in that they enable you to get to the root word. This is called word normalization and both can generate the same output. However, they work very differently. Stemming attempts to chop words off where lemmatization provides you with the ability to see if the word is a noun, verb or other parts of speech. Let’s take the world “saw.” Stemming will bring back “saw” and lemmatization could bring back “see” or “saw.” Lemmatization usually brings back a readable word where stemming may not. See below for an example showing the difference.

Let’s take a look at a Python example which compares stemming to lemmatization:

from nltk.stem import PorterStemmer
# from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

ps = PorterStemmer()

words = [“corpora”, “constructing”, “better”, “done”, “worst”, “pony”]

for w in words:
print(w, ” STEMMING : “, ps.stem(w), ” LEMMATIZATION “, lemmatizer.lemmatize(w, pos=‘v’))

corpora STEMMING : corpora LEMMATIZATION corpora
constructing STEMMING : construct LEMMATIZATION constructing
better STEMMING : better LEMMATIZATION good


Linguistics is the study of language, morphology, syntax, phonetics, and semantics. This field, including data science and computing, has blown up over the past 60 years. We just explored some very simple text analytic capabilities in NLP. Google, Bing, and other search engines leverage this technology to help you find information on the world wide web. Think of how easy it is to have Alexa play your favorite song or how Siri helps you with directions. It is all because of NLP. Natural language in computing is not a gimmick or toy. NLP is the future of seamless computing in our lives.

Arcadia Data just released a version 5.0 which includes our natural language query capabilities we call Search Based BI. It uses some of the data science and text analytics described above. Check out this video on our Search Based BI tool to learn more: SEARCH-BASED BI.

Related Posts