Introduction to natural language processing

What is NLP?

Don’t confuse with neuro-linguistic programming :facepalm:

Natural language processing allows computers to access unstructured data expressed as speech or text. Speech or text data does involve linguistic structure. Linguistic structures vary depending on the language

Bender, 2019

NLP is a class of tasks (computer algorithms) to work with text in natural languages, for example: named entity recognition (NER), part-of-speech tagging (POS), text categorization, coreference resolution, etc.

NLP vs NLU

Image source: Understanding Natural Language Understanding

See paperswithcode and nlpprogress for bigger taxonomy of tasks.

Getting started

I’m not an expert in machine learning (yet), but I know something about developer experience, so I will show how to get started with NLP fast and comfortably.

We will use:

There are a lot of tools in this field, but those seem to me as approachable and modern.

Setup

Create Dockerfile:

1FROM jupyter/datascience-notebook:1386e2046833
2RUN pip install spacy
3RUN python -m spacy download en_core_web_sm

We will use awesome Jupyter Docker Stacks.

Add docker-compose.yml:

1version: "3"
2services:
3 web:
4 build: .
5 ports:
6 - "8888:8888"
7 volumes:
8 - ./work:/home/jovyan/work

Run

Run (in the terminal, in the same folder where you created files):

1docker-compose up

This command will download, build and start development environment. You will see text

1To access the notebook, copy and paste one of these URLs:
2 http://127.0.0.1:8888/?token=...

  • Open the URL in a browser
  • Navigate to “work” folder
  • Click “New” in the right top corner, select “Python 3” from the dropdown

Your notebook is ready for work.

Jupyter notebook is the mix of a runtime environment for experiments and a scientific journal.

First experiment: POS

POS stands for part-of-speech tagging - we need to identify parts for speech for each word, for the given text, for example, noun, verb.

1import spacy
2from spacy import displacy
3nlp = spacy.load("en_core_web_sm")
4doc1 = nlp(u"This is a sentence.")
5displacy.render([doc1], style="dep", page=True)

Type in the program and click “Run”.

Here is the list of all tags.

Second experiment: NER

NER stands for named entity recognition. This task is about distinguishing specific entities, for example, people names, which consist of more than one part (Siddhartha Gautama), or country name (U.K.), or amount of money (\$1 billion).

1import spacy
2from spacy import displacy
3text = u"When Sebastian Thrun started working on self-driving cars at Google in 2007, few people outside of the company took him seriously."
4nlp = spacy.load("en_core_web_sm")
5doc = nlp(text)
6displacy.render(doc, style="ent")

Type in the program and click “Run”.

Here is the list of all entity types.

Save your work

Rename your notebook (click “Untitled”) to something more meaningful, for example, “experiments”. Click the “Save” button.

Create .gitignore file:

1work/.ipynb_checkpoints

Run (in the terminal, in the same folder where you created files):

1git init
2git add .
3git commit -m "first commit"

Now you saved your work in the git.

Tutorial

The purpose of those experiments was to show how it is easy to get started. If you want actually learn it you can use this tutorial.

Good luck!

PS

Checkout spaCy universe for more cool projects. spaCy is just one of the tools, you can use any alternative you like, for example, nltk, Stanford CoreNLP, etc.

Except where otherwise noted, content on this site is licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0