Seminar: Machine Learning for Distributional Semantics

Algorithms and machine learning
Advanced studies
Year Semester Date Period Language In charge
2015 spring 11.02-29.04. 3-4 English Roman Yangarber


Time Room Lecturer Date
Wed 10-12 C222 Roman Yangarber 11.02.2015-29.04.2015


Semantics -- capturing meaning in linguistic data -- is a central problem in a
number of Natural Language Processing (NLP) tasks.  For example, the task of
information extraction (IE) is to transform data from free-form text into
structured form (e.g., database).  Input to IE is plain text, output is a set
of quantifiable "facts" found from the text.  Facts may represent entities
with properties---real-world objects: persons, organizations, etc.---and
relations between entities (corporate acquisitions, financial fraud, etc.)  
IE is used to monitor events in some subject domain; populate databases (e.g.,
gathering information about gene expression from scientific papers); or
generate quantitative data for downstream processing (e.g., data mining).

Many IE problems (as other NLP problems) may be viewed in terms of semantic
analysis.  Common approaches are rule- and ontology-based: matching text
against patterns, with lexical, syntactic and semantic constraints, with
semantic representations based on ontologies.  Ontology-based approaches
suffer from low coverage, due to high cost of building/maintenance; even if
coverage is increased, specificity drops below the threshold of minimal utility.

Recently attention has been rapidly shifting toward methods for automatic
constructtion of semantic representations for linguistic units (word, phrase,
sentence, etc.) by modeling their distributional properties, from large
amounts of text data.  The methods are based on unsupervised machine learning,
linear algebra, statistics, information theory, etc.

In the seminar, we will explore research papers about methods for capturing
semantics, and evaluating the methods on NLP tasks.  We will use papers from
recent NLP and machine-learning conferences.  
Appropriate topics include, (but are not limited to):

- vector-space and latent models (SVD, PCA, etc.),
- topic modeling, (LDA, etc.)
- clustering, dimensionality reduction,
- bootstraping,
- word embeddings,
- deep learning,
- Web-scale methods

We may have an occasional invited guest speakers from outside the class,
presenting own research.



- General understanding of machine learning;
- fundamentals of NLP, or permission from instructor.

Completing the course

Each participant prepares to present two papers of her/his choice to the
audience, and to answer questions from the audience.

The grade is based on the presentations (60%), active participation in the
presentations of others (30), and attendance (10%).

Literature and material

Suggested readings/paper selection will be posted on the Course Wiki.