Keyword extraction from Language Science Press publications

Posted on 23. June 2021 by Noël Simmel

Keyword (or keyterm) extraction is the task of automatically identifying the terms that best describe the contents of a document. It is the most important step in generating the Subject Indexes that you can find in the back of many of our publications.

Up until now we’ve been using Sketch Engine to accomplish this. Sketch Engine is proprietary software that offers many pre-compiled corpora to compare your own texts against. For example, we would upload the raw text of a new book and compare it against the British National Corpus, an extensive collection of various English language texts from the late 20th century. Since linguistic publications usually have a vocabulary that is quite different from everyday language, Sketch Engine did a good job at filtering out the terms that were unique to our book – resulting in a list of keyterms such as language acquisition or vowel harmony.

So why fix something that isn’t broken? Because we are committed to using free and open software wherever possible, and also to making things easier wherever possible. We wanted our own, local, open source keyword extraction software. As an intern with Language Science Press, I was tasked with developing that software. It should work just as well while being easier to use and possibly more tailored to the linguistic domain.

Keyword extraction models

The following paragraphs will demonstrate different approaches to keyword extraction. For illustration purposes we’re going to look at terms extracted from a forthcoming volume of papers from the 49th Annual Conference on African Linguistics. This book represents a wide variety of linguistic disciplines while adhering to the latest LangSci standards, and since I was involved in typesetting I’m somewhat familiar with its contents and possible good key terms.

Sketch Engine

First, let’s take a look at our baseline, Sketch Engine:

Since our books are written using the typesetting system TeX, one important thing to consider is whether to feed the software the raw TeX code, including all the formatting commands and linguistic examples, or a so-called “detexed” version that ideally only contains the actual text.

The current workflow needs a detexed version, so we use the detex software that comes with TeX Live. We upload the resulting text file to the Sketch Engine web interface and compare it against a reference corpus with a few clicks. The resulting list can be exported to a spreadsheet format.

noun class
matrix clause
head noun
dā gbáŋ
amba rc
object agreement
vowel system
focus particle
focus marker
proleptic object
buy book
countable mass
subject position
direct object
atr harmony
external argument
tex localbibliography
sentence final focus
final focus
grammatical role

As you can see on the right, the top 20 results are generally good. Some terms need manual editing (e.g. complete countable mass to countable mass noun, count noun and mass noun – see chapter 14) while others should be removed, namely parts of linguistic examples (dā gbáŋ and buy book) and TeX commands (tex localbibliography) that appear even though we detex’ed the document!

Continue reading →

Collecting reader feedback with PaperHive, docLoop and GitHub

Posted on 14. September 2020 by Sebastian Nordhoff

Back in 2017, we wrote a blog post on fluid publication. This explained the development of a book by the author together with the readership, reusing techniques well-known from software development.

The author 1) starts with a draft version, collects feedback from colleagues, and then the stages of 2) (open) review, 3) acceptance, 4) community proofreading and finally 5) publication of the first edition follow. A history of the different versions is kept on GitHub. GitHub also provides functionalities to manage lists of open issues which still have to be addressed before the next stage can be initiated.

As detailed in various posts on this blog, we use PaperHive for community proofreading. Today, we can showcase docLoop, which allows us to transform the community comments into todo lists on GitHub, closing the loop from author to reader and back from the reader to the author.

Let’s look at an example, Voice at the interfaces: The syntax, semantics, and morphology of the Hebrew verb by Itamar Kastner. We can see the progress of this book on its GitHub page. The book was started in June 2018 and finalised in June 2020. Between the setup of the project and the publication, we count 259 different versions. Next to the author itamarkast, who provided 227 improved versions, kopeckyf and Glottotopia from the LangSci team provided 24 and 3 commits, respectively.

The first commit of this repository, on May 17, 2018

Continue reading →

Conversion of lexical databases into printable books

Posted on 18. July 2017 by Sebastian Nordhoff

We have recently published two dictionaries in our series African Language Grammars and Dictionaries which were automatically converted from the FLEX lexical database. These two books are The Ik language and A dictionary and grammatical outline of Chakali.

In this post, I will detail how structured lexical data as found in FLEX can be converted to *tex, which can be compiled into a LangSci book. I will complement this with some observations about conversions from the XLingPaper format.

Continue reading →

Books in EPUB, HTML and XML formats

Posted on 20. May 2016 by Mathias Schenner

As mentioned in a previous post, we are working on producing electronic books in formats other than PDF. In order to give you an impression of our recent advances, here are HTML, XML and EPUB versions of the first book in the Conceptual Foundations of Language Science series, Natural Causes of Language by N.J. Enfield:

HTML Version (XHTML5)
XML Version (TEI-based)
EPUB Version

All of these formats were produced from the original LaTeX sources of the book using a development version of the texhs converter, with only some minimal styling applied.

Continue reading →

Vancouver, August 2015: PKP Scholarly Publishing Conference

Posted on 7. September 2015 by Carola Fanselow

PKP – the Public Knowledge Project – is a non-profit research initiative that focuses on how publicly funded research can be made freely available through open access policies. One of PKP’s projects is Open Monograph Press (OMP), an open source software that let us set up our web site and back-office management swiftly and with only minimal costs. The 5th PKP conference took place from August 11-14 in Vancouver, Canada. Here are my impressions.

Continue reading →

Electronic book formats

Posted on 16. March 2015 by Mathias Schenner

Language Science Press currently offers electronic versions of published books only in PDF format. This blog post describes our plans for providing additional book formats in the near future and our recent progress.

Continue reading →

Language Science Press Blog

Category Archives: development

Keyword extraction from Language Science Press publications

Keyword extraction models

Sketch Engine

Collecting reader feedback with PaperHive, docLoop and GitHub

Conversion of lexical databases into printable books

Books in EPUB, HTML and XML formats

Vancouver, August 2015: PKP Scholarly Publishing Conference

Electronic book formats