Keyword extraction from Language Science Press publications

Keyword (or keyterm) extraction is the task of automatically identifying the terms that best describe the contents of a document. It is the most important step in generating the Subject Indexes that you can find in the back of many of our publications.

Up until now we’ve been using Sketch Engine to accomplish this. Sketch Engine is proprietary software that offers many pre-compiled corpora to compare your own texts against. For example, we would upload the raw text of a new book and compare it against the British National Corpus, an extensive collection of various English language texts from the late 20th century. Since linguistic publications usually have a vocabulary that is quite different from everyday language, Sketch Engine did a good job at filtering out the terms that were unique to our book – resulting in a list of keyterms such as language acquisition or vowel harmony.

So why fix something that isn’t broken? Because we are committed to using free and open software wherever possible, and also to making things easier wherever possible. We wanted our own, local, open source keyword extraction software. As an intern with Language Science Press, I was tasked with developing that software. It should work just as well while being easier to use and possibly more tailored to the linguistic domain.

Keyword extraction models

The following paragraphs will demonstrate different approaches to keyword extraction. For illustration purposes we’re going to look at terms extracted from a forthcoming volume of papers from the 49th Annual Conference on African Linguistics. This book represents a wide variety of linguistic disciplines while adhering to the latest LangSci standards, and since I was involved in typesetting I’m somewhat familiar with its contents and possible good key terms.

Sketch Engine

First, let’s take a look at our baseline, Sketch Engine:

Since our books are written using the typesetting system TeX, one important thing to consider is whether to feed the software the raw TeX code, including all the formatting commands and linguistic examples, or a so-called “detexed” version that ideally only contains the actual text.

The current workflow needs a detexed version, so we use the detex software that comes with TeX Live. We upload the resulting text file to the Sketch Engine web interface and compare it against a reference corpus with a few clicks. The resulting list can be exported to a spreadsheet format.


noun class
matrix clause
head noun
dā gbáŋ
amba rc
object agreement
vowel system
focus particle
focus marker
proleptic object
buy book
countable mass
subject position
direct object
atr harmony
external argument
tex localbibliography
sentence final focus
final focus
grammatical role

As you can see on the right, the top 20 results are generally good. Some terms need manual editing (e.g. complete countable mass to countable mass noun, count noun and mass noun – see chapter 14) while others should be removed, namely parts of linguistic examples (dā gbáŋ and buy book) and TeX commands (tex localbibliography) that appear even though we detex’ed the document!

Continue reading

Prepublication of chapters in edited volumes

Edited volumes and time to publication

Edited volumes are an important type of publication in linguistics, and Language Science Press has published about 50 edited volumes up to now. However, the time to publication is normally an issue for edited volumes given the larger number of contributors. In a project with 12 contributors, chances are that at least one of them will not hand in the chapter, the revision or the proofs in time, which delays the whole project.

This leads to a vicious circle: everybody knows that probably the volume will be late anyway, so authors adjust their priorities and focus on some other work where the timing is more critical. Which delays the work even more, so everybody adjusts their priorities again and so on. When I was still actively publishing as a linguist, I had a paper in a volume which took 5 years to come out!

This whole problem arises because the book and all its chapters will only be published once the last chapter is in. Otherwise, page numbers cannot be assigned and cross-references will not work.

Our solution

Our series Open Slavic Linguistics was unhappy with this state of affairs and suggested that book chapters can be published ahead of the volume, as so-called prepublications. This means that all chapters which have been reviewed, revised, proofread, and typeset can be made available on our website. The setup of our LaTeX class allows to compile the same chapter either individually or as part of a book. As for the pagination problem, we simply set the page numbers to roman and add a note to the footer that the page numbering is preliminary. That way, readers are not led to mistakenly cite Smith (2020: 12) for a chapter which will have a final pagination much higher than 12.

Footer of the prepublished version of a chapter with information about the volume the chapter will appear in and a note on pagination

Continue reading

What it means to be open and community-based: The Unicode cookbook as a showcase

We are happy to announce the publication of The Unicode Cookbook for linguists: Managing writing systems using orthography profiles by Steven Moran and Michael Cysouw. Next to being a very insightful and valuable book for all linguists dealing with character encoding issues (most if not all linguists?), this publication also points the way forward in a number of domains central for the future of academic publishing in linguistics. This blog post discusses the different innovative aspects we see manifest in this book.

Multiple authors

The book has not one, but two authors. Both have contributed their respective perspectives and expertises. While we see multiple editors for edited volumes on a regular basis, multiple authors for a monograph are much less common. This has certainly to do with the fact that a monograph is much less amenable to “chunking” than an edited volume. In order to make sure that the authors do not interfere with each other’s work, a clear separation of tasks is necessary, as is version control.

Version control

The LaTeX source code of the project is available on GitHub at https://github.com/unicode-cookbook/cookbook. The authors started on 2015-03-29 with this version. All historical files are still available.
Until today, 310 updates have been made to the book, of which 174 by Moran and 121 by Cysouw.  The full history of the project can be seen at https://github.com/unicode-cookbook/cookbook/commits/master. In order to have clearly designated versions for reference, the authors have created releases. Continue reading

Workflow for edited volumes

As of today, we have published 11 edited volumes. We have found that edited volumes demand much more work from all sides, and that the procedure for publishing edited volumes with Languages Science Press seems to cause more astonishment than the process for monographs. In this blog post, I will describe some differences in the setup between monographs and edited volumes and try to explicate in more detail what volume editors can expect. Remember that, technically, submissions have to be in LaTeX. We will offer assistance to the best of our capacities if you have chapters submitted in Word, but this depends on our current work load.

Difference between monographs and edited volumes

Monograph authors directly benefit from adherence to the guidelines. They are in direct contact with the coordinator and generally understand how particular technical subtleties impact their book when explained. Their efforts will directly translate into an improvement of a work which is 100% theirs, so normally, they are eager to comply. Furthermore, they are usually responsible for any delays themselves and hence try to minimise them. Continue reading

Conversion of lexical databases into printable books

We have recently published two dictionaries in our series African Language Grammars and Dictionaries which were automatically converted from the FLEX lexical database. These two books are The Ik language and A dictionary and grammatical outline of Chakali.

 

In this post, I will detail how structured lexical data as found in FLEX can be converted to *tex, which can be compiled into a LangSci book. I will complement this with some observations about conversions from the XLingPaper format.

Continue reading

Document lifecycles and fluid publication

Traditionally, once a book was published, there was little you could do to change its content afterwards. Works in linguistics which see a second edition are few and far between. The most you could hope for is a sheet of errata distributed with the book itself. One consequence of this was, incidentally, that many works were withheld for a long time to make sure they were absolutely perfect before releasing them to the printer’s, which would make the content immutable.

Errata. CC-BY-SA Sage Ross.

With electronic publication, things are different. The publication of a new version is comparatively cheap. In this blogpost, I will detail the lifecycle of a document at Language Science Press and show how we work together with PaperHive to get the document from the initial stage to the first (and subsequent) editions. At the end, I will put this approach into a wider perspective on academic publishing.

Continue reading

Open Authoring as the obvious next step in open publishing

When it comes to writing, reviewing, and proofreading scientific publications and text books (for university students), I am convinced that a radical wisdom of the crowd paradigm does not apply, mostly because the crowds are too small and likely also too fragmented. However, the principles of open access definitely allow larger communities to contribute suggestions, ideas, and corrections to publications, simply because the hurdles and the fuss brought about by copyright restrictions are removed. In this post, I propose that there is much more potential to unleash for the writing and editing process by borrowing concepts and adopting technologies from open source software development.

Continue reading