Text Preprocessing PART-01

2 minute read

Published:

Natural Language Processing (NLP) is the study where we try to make machines understand and analyze human language. Processing text plays a pivotal role in shaping the path towards meaningful insights. Effective processing techniques are the key to transforming raw textual data into a format that is amenable to analysis and modeling. In this blog, we will explore various text processing techniques:

Text Cleaning

Before diving into sophisticated techniques lets begin with the basics- text cleaning. Basically it involves converting all the words in a text to a similar form and removing unnecessary words and punctuations from text. Let’s think of an example: CAR, car, Car these words have the same meaning and they can be just written as car. But while working with text data these will be redundant if we think of them as 3 different words. So we need to convert all text data into similar cases. Lowercasing the words is one of the most common practices in the NLP projects.

image

By using the python built in lower() method we can convert text data into lowercase. In the above text it seems there are some html tags. Most of the texts in the NLP projects are scrapped from different websites. Hence we need to remove html tags from our text as those tags do not add any value to our data. We can use regular expressions to remove html tags from our data.

image

Moreover, there might be cases when we need to remove punctuation from text. Hence we might need to use regular expressions to remove punctuation from text.

image

Tokenization

Token is just a fancy word for a symbol. Usually token means a word which has meaning and cannot be broken into smaller units. Tokenization means splitting a sentence into a sequence of words. Let’s give an example,

image

One of the most common techniques for text tokenization is the whitespace tokenization. In this technique we split the sentence based on the space between tokens or words.

image

However, splitting sentences based on whitespace can be troublesome as there might be multiple spaces between words and handling that will need some change in our code. So far we have been using pythons in built libraries to process texts. However, some of the functions are so much easier to use implementing them with a library like “NLTK”. We can use the “word_tokenize” function from nltk.tokenize to convert our text into tokens.

image

Like word tokenizer there is a sentence level tokenizer in nltk package which separates sentences from texts.

image

Text Processing ain’t done yet!! Check Out my next blog for more on Text Processing!! Find the code File Here.