Keyword extraction from Language Science Press publications

Keyword (or keyterm) extraction is the task of automatically identifying the terms that best describe the contents of a document. It is the most important step in generating the Subject Indexes that you can find in the back of many of our publications.

Up until now we’ve been using Sketch Engine to accomplish this. Sketch Engine is proprietary software that offers many pre-compiled corpora to compare your own texts against. For example, we would upload the raw text of a new book and compare it against the British National Corpus, an extensive collection of various English language texts from the late 20th century. Since linguistic publications usually have a vocabulary that is quite different from everyday language, Sketch Engine did a good job at filtering out the terms that were unique to our book – resulting in a list of keyterms such as language acquisition or vowel harmony.

So why fix something that isn’t broken? Because we are committed to using free and open software wherever possible, and also to making things easier wherever possible. We wanted our own, local, open source keyword extraction software. As an intern with Language Science Press, I was tasked with developing that software. It should work just as well while being easier to use and possibly more tailored to the linguistic domain.

Keyword extraction models

The following paragraphs will demonstrate different approaches to keyword extraction. For illustration purposes we’re going to look at terms extracted from a forthcoming volume of papers from the 49th Annual Conference on African Linguistics. This book represents a wide variety of linguistic disciplines while adhering to the latest LangSci standards, and since I was involved in typesetting I’m somewhat familiar with its contents and possible good key terms.

Sketch Engine

First, let’s take a look at our baseline, Sketch Engine:

Since our books are written using the typesetting system TeX, one important thing to consider is whether to feed the software the raw TeX code, including all the formatting commands and linguistic examples, or a so-called “detexed” version that ideally only contains the actual text.

The current workflow needs a detexed version, so we use the detex software that comes with TeX Live. We upload the resulting text file to the Sketch Engine web interface and compare it against a reference corpus with a few clicks. The resulting list can be exported to a spreadsheet format.


noun class
matrix clause
head noun
dā gbáŋ
amba rc
object agreement
vowel system
focus particle
focus marker
proleptic object
buy book
countable mass
subject position
direct object
atr harmony
external argument
tex localbibliography
sentence final focus
final focus
grammatical role

As you can see on the right, the top 20 results are generally good. Some terms need manual editing (e.g. complete countable mass to countable mass noun, count noun and mass noun – see chapter 14) while others should be removed, namely parts of linguistic examples (dā gbáŋ and buy book) and TeX commands (tex localbibliography) that appear even though we detex’ed the document!

YAKE!

YAKE! (Yet Another Keyword Extractor) is a keyword extraction algorithm developed by Campos et al. in 2018. There is a web interface and a Python package, among others, and using it couldn’t be simpler. Just paste your text on the website or type a few lines of code and out comes your keyterm list.

It theoretically works on both TeX code and detexed text, but using it on TeX code obviously results in a very cluttered keywords list. Therefore I compiled a stop words list that filters out the most common TeX commands.

To the right you can see the top 20 keyterms extracted from the raw TeX code (not detexed), using the default settings and my custom stop words list.

YAKE is biased towards capitalized words, so there are quite a few language names among our keyterms (the Language Index is generated separately and not of interest here), as well as personal names (Asouk and Bisilki).

YAKE did a better job distinguishing countable vs. mass nouns than Sketch Engine, but the word/lemma noun is in there just too often. So there is definitely room for improvement.

However, note that this approach works exceptionally well on raw TeX code, so no need for that extra step here. Also, it operates directly on the given input text and doesn’t need a reference corpus, which makes it a good solution for any kind of document.


language
focus
languages
Focus marking
subject
mass nouns
countable mass nouns
English
Bantu languages
focus marker
sentence final focus
Swahili
Asouk
noun class
True mass nouns
Bisilki
Pre-negative subject position
noun
IPA
nouns

TF-IDF

This cryptic acronym stands for term frequency–inverse document frequency. I won’t go into the mathematical details here (see Wikipedia or a myriad of articles detailing the Python implementation for more info), but basically TF-IDF counts how many times each term occurs in a document as compared to other terms in the same document (term frequency) and weighs that number by how many other documents in the corpus also contain that term (inverse document frequency). Thus, a term that appears many times within our book but is rarely found in the reference corpus would receive a high TF-IDF score.

Here’s the result (using scikit-learn’s TfidfVectorizer with parameters adjusted to our needs, since the default values are useless here):

The results at the very top are somewhat similar to YAKE, below that some new keyterms like vowel system and object agreement are introduced. Interestingly, YAKE seems to focus on syntactic terms while TF-IDF favors phonological terms in this particular example.

The terms including non-alphabetical characters are filtered out in postprocessing, so no need to worry about those. The “geographical” terms come from leftover TeX commands, but we decided against including them in the stop words list because they might be of interest in a different context.


buy book
focus marking
mass nouns
Proto Bantu
3cm 3cm
noun class
controls south
matrix clause
indigenous languages
high vowels
3sg buy
south west
count nouns
vowel system
vowel systems
head noun
mid high
sentence final
mid low
object agreement

Simple maths

The last method I’m going to introduce is the benign-sounding simple maths method. Developed by the late Adam Kilgarriff in 2009, it is the basis for Sketch Engine’s keyword extraction model. It uses simple frequency ratios, i.e. how common a term is in one document versus another, but demands a computationally intensive normalization step that involves keeping track of all appearances of a term within a document.

Since first experiments with this approach did not yield more promising results than the models described above, I decided to leave it to the Sketch Engine servers and focus on the more efficient YAKE and TF-IDF.

Corpus

Before I go into details about the software I developed, I’d like to introduce the corpora that were generated to train the TF-IDF model. We decided to exclude some publications like dictionaries for which the keyword extraction method is generally not suited, as well as some non-English books. This selection is not final and may well be improved in the future.

The result is a compressed file ready for use with the software, as well as a detexed version of it, and can be found here.

The original detex software was quickly replaced with a custom version that is capable of mostly filtering out linguistic examples and tables. Though not perfect (it tends to “swallow” some paragraphs), it proved good enough for our purposes. Those interested can find the source code here.

The langscikw software

With the appropriate models selected, my task now was to piece them together to form usable software. I came up with a three-steps model:

  1. Since YAKE provided the best results in and of itself, it should be the first step. This way, it is possible to use the software without a training corpus.
  2. In a second step, the results of a TF-IDF model are added. We now have twice the number of keywords.
  3. Finally, a different TF-IDF model trained on the detexed corpus extracts even more terms. These are compared against a corpus of 10,000 keywords that appeared in published Language Science Press books and are thus assumed to be reliable, since each keywords list is edited by the author or volume editor. Only those terms that have been selected as key terms before are kept.

The software then deduplicates the list and performs some cleaning steps like removing non-alphabetical results. This works so well that despite extracting many more terms than we need, the returned list is approximately the desired length. We usually extract a list of about 300 keywords, which is manually reduced to about 100–200 terms. This is especially important here because the software deliberately only extracts bigram keyterms – in our experience single keywords are best determined by human intelligence.

Now we have an open-source Python package as well as an easy-to-use command-line interface that operate locally, don’t require registration or a license, can be used on raw TeX code – in fact, no preprocessing is necessary at all –, and extracts keyterms from a 200,000 word document in about two minutes.

Evaluation

langscikw outputs an alphabetically sorted list for technical reasons, so we can’t show any top results. Instead, I told both programs to extract 300 keywords and manually edited the lists like we do when constructing the Subject Index.

I ended up with 92 keywords for Sketch Engine (31%) vs. 107 keywords for langscikw (34% – I received 313 terms). Both lists had 46 terms in common (52% and 45%, respectively).
You can view the spreadsheet with the original and cleaned lists at the end of this post.

While Sketch Engine introduces more noise in the form of linguistic examples (up to 25%), it actually requires less editing because these are easy for humans to spot and remove. langscikw oftentimes requires the user to supply antonyms (e.g. nonfinite clauses when finite clauses is given), Sketch Engine finds these quite reliably. Also, Sketch Engine has a habit of including unigram, bigram and trigram versions of the same term, which I avoided by only working with bigrams.

Of course, my models could be tweaked to achieve even better results. For example there could be some sort of part-of-speech tagging that ensures that only noun phrases are selected as keyterms as there is a quite a bit of noise and non-sensical results.
In the future it would be nice to have a formal evaluation of the model performance and perhaps more fine-tuning of the parameters. An improvement of the detex software or internal pre- and postprocessing could also deliver better results.

All in all, I believe langscikw is a worthwhile alternative to Sketch Engine for our purposes.

[You can find the published package on GitHub and PyPI]

Appendix: Key terms extracted via the different approaches

Sketch EngineSketch Engine (cleaned)langscikwlangscikw (cleaned)
1accessibility hierarchyaccessibility hierarchyAda P-marketaccessibility hierarchy
2affecting mass communicationaffirmative sentenceAfrican languagesadjective
3affirmative sentenceAfranaph ProjectAgree relationaffirmative sentence
4afranaph idagreementAkan SVCSaffix
5agreement markerancestral cultAkan tenseaffix hopping
6agreement morphologyapplicative suffixAkie documentationagreement
7amba rcargument licensingAkie languageanti-locality
8amba relativeassociative markerAkie peopleapplicative suffix
9ancestral cultatr harmonyAkie traditionsargument position
10applicative suffixbad news deliveryAmerican Englishaspect
11argument licensingbinary-neg structureAsibi boughtassociative marker
12assoc-1pro augcanonical verbAsouk friendsback vowels
13associative markerclausal spineAsouk rememberbase generation
14associative morphologyclause-initial positionAsouk thoughtbound variable
15associative-marked subjectcleft strategyAspP Aspcanonical subject
16atr harmonycomplementary distributionBantu Lexicalclausal spine
17attitude holdercomplementizer systemBantu languagecode-switching
18audio recordercontrastive focusBongo-Bagirmi languagescomplementizer system
19baabi aracount nounCP domaincount noun
20bad newscountability distinctionChinese Japanesecountable mass noun
21bad news deliverycountable mass nounDP CPCP
22bad news managementdirect objectDP edgecross-linguistic
23bare countdownward mergerDhaisu speakersdemonstrative
24bare countable massDPDhaisu voweldirect object
25bare formendangered languageECL speakersdiscourse entities
26base-generated prolepsisescape hatchEPP positionDP
27binary-neg structureexistential quantifierEndangered Languagesembedded subject
28book prtex-situ focusEnglish translationendangered language
29book tomorrowexternal argumentEnglish typeEPP position
30breaking bad newsfinal focusFocus markingescape hatch
31buy bookfinite clauseFragment answerexistential quantifier
32c1-c1-own smfinite-nonfinite distinctionGbaya languagesex-situ focus
33c1-c1-person wh-c1-omfocal constituentHelsinki Corpusexternal argument
34c1-like-fv c1-c1-ownfocus constructionIn arguefeature percolation
35c1-pst-pay-fv c1-c1-ownfocus markerIn situfinal focus
36c1-refl-say-fv c1-thatfocus markingIn-situ focusfinite clauses
37c1-say-fv c1-thatfocus marking asymmetryIt calledfocus marker
38c1-that c1-c1-ownfocus-sensitive particleIt importantfocus marking
39c1-that c1-c1-own smfolktale narrationJohn housefocus marking asymmetry
40canonical clausefragment answerKenya Englishfocus position
41canonical verbfuture markerKenya Kiswahilifragment answer
42car focfuture tenseKenya languagefront vowels
43ɔítɛ kɛɛngrammatical roleKinande Linkerfuture tense
44class pronounhead movementKofi buygrammatical role
45clausal spinehead nounKofi re-tohead movement
46clause-initial positionin-situ focusKwa languageshead noun
47cleft strategyleft peripheryLanguage choicehigh vowels
48communicative difficultylicensingLeipzig GlossingH-tone
49comp envylingua francaLinguistica analysishuman rights
50comp envy tarolong-distance licensingLinker headindefinite noun
51comp foodmass communicationLong distanceindigenous languages
52complementary distributionmass nounLubukusu objectinitial position
53complementizer systemmatrix argumentMabia focusin-situ focus
54containing phasematrix clauseMarker positionintermediate vowels
55contrastive focusmatrix verbMultiple Agreeisland effects
56controlled elementmorphologyNgbugu vowellanguage vitality
57cook thing-thingnegationNiger Congolingua franca
58cook thing-thing neg2negative libertyNoun denotationsLinguistica
59count nounnon-divisive nounParallel AgreeLinker
60countability distinctionnonfinite complementPre-negative subjectlocality constraints
61countable massnounProto Bantulong-distance movement
62countable mass nounnoun phraseQuestion markermarking devices
63dà gbáŋNPStem countsmass noun
64dà gbǎŋobject markerSudanic languagesmatrix subject
65dā gbáŋpartial reduplicationSwahili morphologymid-low vowels
66ɗe ɲɛpast tenseSwahili tensedmood
67delivering bad newsphase edgeTAM markersmoribund languages
68dīg jāab-jāabpositive libertyTanzania Kiswahilimorpheme breaks
69direct objectprefixTanzania Languagemorphology
70diseased individualpresent tenseTense MarkerMultiple Agree
71documentary corpusprolepsisTense morphologynegation
72documentation agendaproleptic objectThagicu languagesnegative liberty
73documentation projectpronominal subjectThis illustratednonfinite clauses
74documentation teamproto phonemeUbangian languagesNPIs
75downward mergerquestion particleUnlocking theorynull pronoun
76dp edgereflexive markerVP DPnull subject
77effect of grammatical rolerelative clauseVP soutobject marker
78endangered languagerelative markerVSO wordoral genres
79envy tarorestrictive clauseVolkswagen Foundationpast tense
80escape hatchsentence-final focusWe assumephase edge
81exact naturesubject markerWe leaveplural marker
82example exsuffixWest Africanpositive liberty
83existential quantifiersyntactic licensingZulu nounspragmatic focus
84external argumenttenseZulu passivesprefix
85extra matrixunary-neg structurea-li Billpresent tense
86ɛnnɛ hwɛUnlocking theorya-li Georgeprolepsis
87ɛno aravelar stopaccessibility hierarchyproleptic object
88ɛnyɛ yievoiced pauseadjectives demonstrativesquantized noun
89ɛte sɛnvowel changeaffirmative sentencequestion marker
90ɛyɛ deɛnvowel mergeraffix hoppingquestion particle
91ɛyɛ duabɔvowel systemagreement morphologyreduplicated noun
92ɛyɛ nkrɔfoɔword orderanti localityrelative clause
93final focusapplicative suffixrelative marker
94final focus particleargument positionspoken language
95final vowelaspect featuresstrong pronouns
96finite clauseaspects linguisticsubject marker
97finite-nonfinite distinctionassociative markerTAM markers
98focal constituentattitude holdertense
99focus assignmentaudio videoUnlocking theory
100focus constructionback vowelsvelar stop
101focus markerbare countablevowel change
102focus marker kabare formvowel merger
103focus markingbase generationvowel system
104focus marking systembasic propertiesVP
105focus particlebend leftweak pronouns
106focus systembook tomorrowwh-movement
107focus-sensitive particlebought bookword order
108folktale narrationbound variable
109folktale narration sessionbuy book
110following utterancecalled associative
111form complementizercanonical subject
112fragment answercenter north
113fut buycentral vowel
114fut buy bookcl cl
115future markerclass language
116goat bitclass prefixes
117government businessclausal spine
118grammatical roleclause matrix
119harmony systemclosest relative
120head movementcode switching
121head nouncommon ground
122ho hyehyecommunity members
123ho hyehye yɛncomplementizer system
124hyehye yɛnconstituent focus
125indefinite noun classcontrol subject
126information statuscount noun
127information structurecountable mass
128in-situ strategycross linguistic
129intergenerational transmissionde se
130internal argumentdenoting nouns
131izingane amavuvuzeladirect object
132khúú wtdiscourse entities
133khúú wt ndiseased individuals
134lán fú-nīdocumentation project
135language documentationdominant languages
136language endangermentedge embedded
137language useeducation system
138left peripheryembedded subject
139lgd bhúŋwànìempty nodes
140lingua francaendangered language
141linguistic meaningescape hatch
142ln nzìethnic groups
143ln nzì khúúevaluated object
144ln nzì khúú wtex-situ strategy
145ln nzì khúú wt nexistential quantifier
146long-distance licensingexternal argument
147low tonefamiliar language
148m lnfeature percolation
149m ln nzìfinal focus
150m ln nzì khúúfinal particle
151m ln nzì khúú wtfinite Swahili
152managing bad newsfinite clauses
153marker kafloating tone
154marker prefixfocus function
155marking asymmetryfocus languages
156marking focusfocus marker
157mass nounfocus position
158matrix argumentfocus system
159matrix clausefocused element
160matrix nominalfood cooked
161matrix verbforeign language
162menhia obiarafront vowels
163menhia obiara mmoafuture tense
164mewɔ mgrammatical role
165mixed behaviorhead movement
166morpheme kahead nouns
167morpheme kàhigh mid
168morphological realizationhigh vowels
169name fochuman rights
170narration sessionindefinite noun
171national languageindigenous languages
172native speakerinformation focus
173nc-i nc-iinitial position
174nc-i nc-i sfpintermediate vowels
175nc-i sfpinvolving English
176ndugu yakeisland effects
177neg thinkki ki
178neg1 cookko ko
179negative libertylanguage morphology
180news bearerlanguage spoken
181news deliverylanguage vitality
182next sectionleft edge
183ni oolingua franca
184non-divisive nounlinguistic resources
185nonfinite complementlocal language
186non-restrictive clauselocal subjects
187non-restrictive rclocality constraints
188non-subject constituentlong ago
189non-subject focuslow tone
190non-subject splitmade child
191noun classmanaging bad
192noun class markermark focus
193noun class pronounmarker prefix
194noun denotationmarking asymmetry
195noun phrasemarking devices
196numeral modificationmass nouns
197nur āmatrix embedded
198nur nurmatrix subject
199nya wāāmid high
200nyinaa yɛnmid low
201nyinaa yɛn homid-low vowels
202nyinaa yɛn ho hyehyemodified numeral
203nyinaa yɛn ho hyehye yɛnmoribund languages
204nzì khúúmorpheme breaks
205nzì khúú wtmorphological analysis
206nzì khúú wt nmovement based
207obiara mmoamultiple verbs
208object agreementna ta
209object markernarrow focus
210object npnational language
211oral vowelnative speaker
212overt subjectnegative positive
213p eveningneighboring language
214partial reduplicationni li
215particular constituentni na
216person personni tu
217person person sfpnice empty
218person sfpnode midway
219pfv cl-foodnode style
220phase edgenominal NPIs
221phonemic statusnon-restrictive RCs
222plural suffixnorthern Ghana
223plural-marked countnoun classes
224positive libertynouns countable
225pre-negative subject positionnouns modified
226previous sectionnull pronoun
227proleptic objectnull subject
228pronominal subjectnumber affixes
229propositional attitudenumber words
230proto phonemeobject RCs
231question particleobject matrix
232rc lengthobserved subject
233rc variationoccurs clause
234recursive possessionofficial language
235reflexive markeroral genres
236relative clauseorder language
237relative markerouter aspect
238relative typeovert pronominal
239restrictive clauseovert subjects
240restrictive rcpart collection
241sentence final focuspast tense
242sentence final focus particleperson person
243sentence-final particlephase edge
244sentential negationphonemic status
245sɛ ɛyɛ deɛnplace focus
246sɛ meteplural marker
247sg-marked countable massplural noun
248shift transitionposition subject
249singulative suffixpositive liberty
250speaker populationpost verbal
251subject agreementpotential stem
252subject extractionpragmatic focus
253subject focuspresent analysis
254subject markerprimary language
255subject marker prefixproleptic object
256subject positionpronominal agreement
257such extractionproperty holds
258swahili ambapropositional attitude
259swahili rcquantized noun
260swahili rc variationquestion particle
261syntactic displacementrecording equipment
262syntactic distributionreduplicated noun
263syntactic licensingrelated languages
264syntactic positionrelative marker
265system vowelrequires Agree
266tense markerrestrictive restrictive
267tex localbibliographyrole head
268thing-thing neg2rural areas
269three-way countabilityscale node
270three-way countability distinctionsentence final
271tomorrow buysentential negation
272tomorrow buy bookshares properties
273tomorrow futsingle head
274tomorrow fut buysmall Swahili
275tomorrow fut buy booksouth bend
276traditional courtsouth south
277true massspeaker population
278unary-neg structurespeech communities
279unlocking theorystatus head
280upcoming bad newsstrong pronouns
281velar stopstyle center
282voiced pausesubject marker
283vowel changesubject subject
284vowel harmonysuffix shown
285vowel mergersyntactic position
286vowel systemsystem vowel
287vowel system voweltarget group
288wà dàten languages
289wāā-ī ātense aspect
290wɔn ayɛtense morpheme
291word ordertensed restrictive
292wt nterm focus
293yɛn hotext part
294yɛn ho hyehyetop signature
295yɛn ho hyehye yɛntop ten
296yɛn nyinaatopic topic
297yɛn nyinaa yɛntranscribed annotated
298yɛn nyinaa yɛn hotrue mass
299yɛn nyinaa yɛn ho hyehyeturn Asouk
300yór ń-dànurban areas
301velar stop
302verb constructions
303verb stems
304video recordings
305vowel change
306vowel merger
307vowel systems
308wa na
309west south
310wh movement
311wh situ
312words corpus
313younger generation

Leave a Reply

Your email address will not be published.

Captcha
Refresh
Hilfe
Hinweis / Hint
Das Captcha kann Kleinbuchstaben, Ziffern und die Sonderzeichzeichen »?!#%&« enthalten.
The captcha could contain lower case, numeric characters and special characters as »!#%&«.