Keyword (or keyterm) extraction is the task of automatically identifying the terms that best describe the contents of a document. It is the most important step in generating the Subject Indexes that you can find in the back of many of our publications.
Up until now we’ve been using Sketch Engine to accomplish this. Sketch Engine is proprietary software that offers many pre-compiled corpora to compare your own texts against. For example, we would upload the raw text of a new book and compare it against the British National Corpus, an extensive collection of various English language texts from the late 20th century. Since linguistic publications usually have a vocabulary that is quite different from everyday language, Sketch Engine did a good job at filtering out the terms that were unique to our book – resulting in a list of keyterms such as language acquisition or vowel harmony.
So why fix something that isn’t broken? Because we are committed to using free and open software wherever possible, and also to making things easier wherever possible. We wanted our own, local, open source keyword extraction software. As an intern with Language Science Press, I was tasked with developing that software. It should work just as well while being easier to use and possibly more tailored to the linguistic domain.
Keyword extraction models
The following paragraphs will demonstrate different approaches to keyword extraction. For illustration purposes we’re going to look at terms extracted from a forthcoming volume of papers from the 49th Annual Conference on African Linguistics. This book represents a wide variety of linguistic disciplines while adhering to the latest LangSci standards, and since I was involved in typesetting I’m somewhat familiar with its contents and possible good key terms.
First, let’s take a look at our baseline, Sketch Engine:
Since our books are written using the typesetting system TeX, one important thing to consider is whether to feed the software the raw TeX code, including all the formatting commands and linguistic examples, or a so-called “detexed” version that ideally only contains the actual text.
The current workflow needs a detexed version, so we use the detex software that comes with TeX Live. We upload the resulting text file to the Sketch Engine web interface and compare it against a reference corpus with a few clicks. The resulting list can be exported to a spreadsheet format.
sentence final focus
As you can see on the right, the top 20 results are generally good. Some terms need manual editing (e.g. complete countable mass to countable mass noun, count noun and mass noun – see chapter 14) while others should be removed, namely parts of linguistic examples (dā gbáŋ and buy book) and TeX commands (tex localbibliography) that appear even though we detex’ed the document!
Edited volumes are an important type of publication in linguistics, and Language Science Press has published about 50 edited volumes up to now. However, the time to publication is normally an issue for edited volumes given the larger number of contributors. In a project with 12 contributors, chances are that at least one of them will not hand in the chapter, the revision or the proofs in time, which delays the whole project.
This leads to a vicious circle: everybody knows that probably the volume will be late anyway, so authors adjust their priorities and focus on some other work where the timing is more critical. Which delays the work even more, so everybody adjusts their priorities again and so on. When I was still actively publishing as a linguist, I had a paper in a volume which took 5 years to come out!
This whole problem arises because the book and all its chapters will only be published once the last chapter is in. Otherwise, page numbers cannot be assigned and cross-references will not work.
Our series Open Slavic Linguistics was unhappy with this state of affairs and suggested that book chapters can be published ahead of the volume, as so-called prepublications. This means that all chapters which have been reviewed, revised, proofread, and typeset can be made available on our website. The setup of our LaTeX class allows to compile the same chapter either individually or as part of a book. As for the pagination problem, we simply set the page numbers to roman and add a note to the footer that the page numbering is preliminary. That way, readers are not led to mistakenly cite Smith (2020: 12) for a chapter which will have a final pagination much higher than 12.
Footer of the prepublished version of a chapter with information about the volume the chapter will appear in and a note on pagination
The book has not one, but two authors. Both have contributed their respective perspectives and expertises. While we see multiple editors for edited volumes on a regular basis, multiple authors for a monograph are much less common. This has certainly to do with the fact that a monograph is much less amenable to “chunking” than an edited volume. In order to make sure that the authors do not interfere with each other’s work, a clear separation of tasks is necessary, as is version control.
As of today, we have published 11 edited volumes. We have found that edited volumes demand much more work from all sides, and that the procedure for publishing edited volumes with Languages Science Press seems to cause more astonishment than the process for monographs. In this blog post, I will describe some differences in the setup between monographs and edited volumes and try to explicate in more detail what volume editors can expect. Remember that, technically, submissions have to be in LaTeX. We will offer assistance to the best of our capacities if you have chapters submitted in Word, but this depends on our current work load.
Difference between monographs and edited volumes
Monograph authors directly benefit from adherence to the guidelines. They are in direct contact with the coordinator and generally understand how particular technical subtleties impact their book when explained. Their efforts will directly translate into an improvement of a work which is 100% theirs, so normally, they are eager to comply. Furthermore, they are usually responsible for any delays themselves and hence try to minimise them. Continue reading →
In this post, I will detail how structured lexical data as found in FLEX can be converted to *tex, which can be compiled into a LangSci book. I will complement this with some observations about conversions from the XLingPaper format.
Traditionally, once a book was published, there was little you could do to change its content afterwards. Works in linguistics which see a second edition are few and far between. The most you could hope for is a sheet of errata distributed with the book itself. One consequence of this was, incidentally, that many works were withheld for a long time to make sure they were absolutely perfect before releasing them to the printer’s, which would make the content immutable.
Errata. CC-BY-SA Sage Ross.
With electronic publication, things are different. The publication of a new version is comparatively cheap. In this blogpost, I will detail the lifecycle of a document at Language Science Press and show how we work together with PaperHive to get the document from the initial stage to the first (and subsequent) editions. At the end, I will put this approach into a wider perspective on academic publishing.
When it comes to writing, reviewing, and proofreading scientific publications and text books (for university students), I am convinced that a radical wisdom of the crowd paradigm does not apply, mostly because the crowds are too small and likely also too fragmented. However, the principles of open access definitely allow larger communities to contribute suggestions, ideas, and corrections to publications, simply because the hurdles and the fuss brought about by copyright restrictions are removed. In this post, I propose that there is much more potential to unleash for the writing and editing process by borrowing concepts and adopting technologies from open source software development.
This blog post deals with two parts in the Language Science Press workflow: open reviewing and proofreading. All Language Science Press publications are reviewed by at least two external reviewers to ensure highest quality. The reviews may be open, that … Continue reading →
To get an idea of the workflow of a manuscript in Language Science Press you can always read themanuals, but now you can also have a look at our brand new screencast. Other screencasts on single steps of the workflow will follow shortly.
Introduction The review process of edited volumes is more complex than the review process of articles or books. The contributions to edited volumes are heterogeneous with regard to the quality of research and writing. The scientific merit of each contribution … Continue reading →