Keyword (or keyterm) extraction is the task of automatically identifying the terms that best describe the contents of a document. It is the most important step in generating the Subject Indexes that you can find in the back of many of our publications.
Up until now we’ve been using Sketch Engine to accomplish this. Sketch Engine is proprietary software that offers many pre-compiled corpora to compare your own texts against. For example, we would upload the raw text of a new book and compare it against the British National Corpus, an extensive collection of various English language texts from the late 20th century. Since linguistic publications usually have a vocabulary that is quite different from everyday language, Sketch Engine did a good job at filtering out the terms that were unique to our book – resulting in a list of keyterms such as language acquisition or vowel harmony.
So why fix something that isn’t broken? Because we are committed to using free and open software wherever possible, and also to making things easier wherever possible. We wanted our own, local, open source keyword extraction software. As an intern with Language Science Press, I was tasked with developing that software. It should work just as well while being easier to use and possibly more tailored to the linguistic domain.
Keyword extraction models
The following paragraphs will demonstrate different approaches to keyword extraction. For illustration purposes we’re going to look at terms extracted from a forthcoming volume of papers from the 49th Annual Conference on African Linguistics. This book represents a wide variety of linguistic disciplines while adhering to the latest LangSci standards, and since I was involved in typesetting I’m somewhat familiar with its contents and possible good key terms.
First, let’s take a look at our baseline, Sketch Engine:
Since our books are written using the typesetting system TeX, one important thing to consider is whether to feed the software the raw TeX code, including all the formatting commands and linguistic examples, or a so-called “detexed” version that ideally only contains the actual text.
The current workflow needs a detexed version, so we use the detex software that comes with TeX Live. We upload the resulting text file to the Sketch Engine web interface and compare it against a reference corpus with a few clicks. The resulting list can be exported to a spreadsheet format.
sentence final focus
As you can see on the right, the top 20 results are generally good. Some terms need manual editing (e.g. complete countable mass to countable mass noun, count noun and mass noun – see chapter 14) while others should be removed, namely parts of linguistic examples (dā gbáŋ and buy book) and TeX commands (tex localbibliography) that appear even though we detex’ed the document!
Back in 2017, we wrote a blog post on fluid publication. This explained the development of a book by the author together with the readership, reusing techniques well-known from software development.
The author 1) starts with a draft version, collects feedback from colleagues, and then the stages of 2) (open) review, 3) acceptance, 4) community proofreading and finally 5) publication of the first edition follow. A history of the different versions is kept on GitHub. GitHub also provides functionalities to manage lists of open issues which still have to be addressed before the next stage can be initiated.
As detailed in various posts on this blog, we use PaperHive for community proofreading. Today, we can showcase docLoop, which allows us to transform the community comments into todo lists on GitHub, closing the loop from author to reader and back from the reader to the author.
Let’s look at an example, Voice at the interfaces: The syntax, semantics, and morphology of the Hebrew verb by Itamar Kastner. We can see the progress of this book on its GitHub page. The book was started in June 2018 and finalised in June 2020. Between the setup of the project and the publication, we count 259 different versions. Next to the author itamarkast, who provided 227 improved versions, kopeckyf and Glottotopia from the LangSci team provided 24 and 3 commits, respectively.
In this post, I will detail how structured lexical data as found in FLEX can be converted to *tex, which can be compiled into a LangSci book. I will complement this with some observations about conversions from the XLingPaper format.
PKP – the Public Knowledge Project – is a non-profit research initiative that focuses on how publicly funded research can be made freely available through open access policies. One of PKP’s projects is Open Monograph Press (OMP), an open source software that let us set up our web site and back-office management swiftly and with only minimal costs. The 5th PKP conference took place from August 11-14 in Vancouver, Canada. Here are my impressions.
Language Science Press currently offers electronic versions of published books only in PDF format. This blog post describes our plans for providing additional book formats in the near future and our recent progress.