Keyword (or keyterm) extraction is the task of automatically identifying the terms that best describe the contents of a document. It is the most important step in generating the Subject Indexes that you can find in the back of many of our publications.
Up until now we’ve been using Sketch Engine to accomplish this. Sketch Engine is proprietary software that offers many pre-compiled corpora to compare your own texts against. For example, we would upload the raw text of a new book and compare it against the British National Corpus, an extensive collection of various English language texts from the late 20th century. Since linguistic publications usually have a vocabulary that is quite different from everyday language, Sketch Engine did a good job at filtering out the terms that were unique to our book – resulting in a list of keyterms such as language acquisition or vowel harmony.
So why fix something that isn’t broken? Because we are committed to using free and open software wherever possible, and also to making things easier wherever possible. We wanted our own, local, open source keyword extraction software. As an intern with Language Science Press, I was tasked with developing that software. It should work just as well while being easier to use and possibly more tailored to the linguistic domain.
Keyword extraction models
The following paragraphs will demonstrate different approaches to keyword extraction. For illustration purposes we’re going to look at terms extracted from a forthcoming volume of papers from the 49th Annual Conference on African Linguistics. This book represents a wide variety of linguistic disciplines while adhering to the latest LangSci standards, and since I was involved in typesetting I’m somewhat familiar with its contents and possible good key terms.
Sketch Engine
First, let’s take a look at our baseline, Sketch Engine:
Since our books are written using the typesetting system TeX, one important thing to consider is whether to feed the software the raw TeX code, including all the formatting commands and linguistic examples, or a so-called “detexed” version that ideally only contains the actual text.
The current workflow needs a detexed version, so we use the detex software that comes with TeX Live. We upload the resulting text file to the Sketch Engine web interface and compare it against a reference corpus with a few clicks. The resulting list can be exported to a spreadsheet format.
noun class
matrix clause
head noun
dā gbáŋ
amba rc
object agreement
vowel system
focus particle
focus marker
proleptic object
buy book
countable mass
subject position
direct object
atr harmony
external argument
tex localbibliography
sentence final focus
final focus
grammatical role
As you can see on the right, the top 20 results are generally good. Some terms need manual editing (e.g. complete countable mass to countable mass noun, count noun and mass noun – see chapter 14) while others should be removed, namely parts of linguistic examples (dā gbáŋ and buy book) and TeX commands (tex localbibliography) that appear even though we detex’ed the document!
YAKE!
YAKE! (Yet Another Keyword Extractor) is a keyword extraction algorithm developed by Campos et al. in 2018. There is a web interface and a Python package, among others, and using it couldn’t be simpler. Just paste your text on the website or type a few lines of code and out comes your keyterm list.
It theoretically works on both TeX code and detexed text, but using it on TeX code obviously results in a very cluttered keywords list. Therefore I compiled a stop words list that filters out the most common TeX commands.
To the right you can see the top 20 keyterms extracted from the raw TeX code (not detexed), using the default settings and my custom stop words list.
YAKE is biased towards capitalized words, so there are quite a few language names among our keyterms (the Language Index is generated separately and not of interest here), as well as personal names (Asouk and Bisilki).
YAKE did a better job distinguishing countable vs. mass nouns than Sketch Engine, but the word/lemma noun is in there just too often. So there is definitely room for improvement.
However, note that this approach works exceptionally well on raw TeX code, so no need for that extra step here. Also, it operates directly on the given input text and doesn’t need a reference corpus, which makes it a good solution for any kind of document.
language
focus
languages
Focus marking
subject
mass nouns
countable mass nouns
English
Bantu languages
focus marker
sentence final focus
Swahili
Asouk
noun class
True mass nouns
Bisilki
Pre-negative subject position
noun
IPA
nouns
TF-IDF
This cryptic acronym stands for term frequency–inverse document frequency. I won’t go into the mathematical details here (see Wikipedia or a myriad of articles detailing the Python implementation for more info), but basically TF-IDF counts how many times each term occurs in a document as compared to other terms in the same document (term frequency) and weighs that number by how many other documents in the corpus also contain that term (inverse document frequency). Thus, a term that appears many times within our book but is rarely found in the reference corpus would receive a high TF-IDF score.
Here’s the result (using scikit-learn’s TfidfVectorizer with parameters adjusted to our needs, since the default values are useless here):
The results at the very top are somewhat similar to YAKE, below that some new keyterms like vowel system and object agreement are introduced. Interestingly, YAKE seems to focus on syntactic terms while TF-IDF favors phonological terms in this particular example.
The terms including non-alphabetical characters are filtered out in postprocessing, so no need to worry about those. The “geographical” terms come from leftover TeX commands, but we decided against including them in the stop words list because they might be of interest in a different context.
buy book
focus marking
mass nouns
Proto Bantu
3cm 3cm
noun class
controls south
matrix clause
indigenous languages
high vowels
3sg buy
south west
count nouns
vowel system
vowel systems
head noun
mid high
sentence final
mid low
object agreement
Simple maths
The last method I’m going to introduce is the benign-sounding simple maths method. Developed by the late Adam Kilgarriff in 2009, it is the basis for Sketch Engine’s keyword extraction model. It uses simple frequency ratios, i.e. how common a term is in one document versus another, but demands a computationally intensive normalization step that involves keeping track of all appearances of a term within a document.
Since first experiments with this approach did not yield more promising results than the models described above, I decided to leave it to the Sketch Engine servers and focus on the more efficient YAKE and TF-IDF.
Corpus
Before I go into details about the software I developed, I’d like to introduce the corpora that were generated to train the TF-IDF model. We decided to exclude some publications like dictionaries for which the keyword extraction method is generally not suited, as well as some non-English books. This selection is not final and may well be improved in the future.
The result is a compressed file ready for use with the software, as well as a detexed version of it, and can be found here.
The original detex software was quickly replaced with a custom version that is capable of mostly filtering out linguistic examples and tables. Though not perfect (it tends to “swallow” some paragraphs), it proved good enough for our purposes. Those interested can find the source code here.
The langscikw software
With the appropriate models selected, my task now was to piece them together to form usable software. I came up with a three-steps model:
- Since YAKE provided the best results in and of itself, it should be the first step. This way, it is possible to use the software without a training corpus.
- In a second step, the results of a TF-IDF model are added. We now have twice the number of keywords.
- Finally, a different TF-IDF model trained on the detexed corpus extracts even more terms. These are compared against a corpus of 10,000 keywords that appeared in published Language Science Press books and are thus assumed to be reliable, since each keywords list is edited by the author or volume editor. Only those terms that have been selected as key terms before are kept.
The software then deduplicates the list and performs some cleaning steps like removing non-alphabetical results. This works so well that despite extracting many more terms than we need, the returned list is approximately the desired length. We usually extract a list of about 300 keywords, which is manually reduced to about 100–200 terms. This is especially important here because the software deliberately only extracts bigram keyterms – in our experience single keywords are best determined by human intelligence.
Now we have an open-source Python package as well as an easy-to-use command-line interface that operate locally, don’t require registration or a license, can be used on raw TeX code – in fact, no preprocessing is necessary at all –, and extracts keyterms from a 200,000 word document in about two minutes.
Evaluation
langscikw outputs an alphabetically sorted list for technical reasons, so we can’t show any top results. Instead, I told both programs to extract 300 keywords and manually edited the lists like we do when constructing the Subject Index.
I ended up with 92 keywords for Sketch Engine (31%) vs. 107 keywords for langscikw (34% – I received 313 terms). Both lists had 46 terms in common (52% and 45%, respectively).
You can view the spreadsheet with the original and cleaned lists at the end of this post.
While Sketch Engine introduces more noise in the form of linguistic examples (up to 25%), it actually requires less editing because these are easy for humans to spot and remove. langscikw oftentimes requires the user to supply antonyms (e.g. nonfinite clauses when finite clauses is given), Sketch Engine finds these quite reliably. Also, Sketch Engine has a habit of including unigram, bigram and trigram versions of the same term, which I avoided by only working with bigrams.
Of course, my models could be tweaked to achieve even better results. For example there could be some sort of part-of-speech tagging that ensures that only noun phrases are selected as keyterms as there is a quite a bit of noise and non-sensical results.
In the future it would be nice to have a formal evaluation of the model performance and perhaps more fine-tuning of the parameters. An improvement of the detex software or internal pre- and postprocessing could also deliver better results.
All in all, I believe langscikw is a worthwhile alternative to Sketch Engine for our purposes.
[You can find the published package on GitHub and PyPI]
Appendix: Key terms extracted via the different approaches
Sketch Engine | Sketch Engine (cleaned) | langscikw | langscikw (cleaned) | |
1 | accessibility hierarchy | accessibility hierarchy | Ada P-market | accessibility hierarchy |
2 | affecting mass communication | affirmative sentence | African languages | adjective |
3 | affirmative sentence | Afranaph Project | Agree relation | affirmative sentence |
4 | afranaph id | agreement | Akan SVCS | affix |
5 | agreement marker | ancestral cult | Akan tense | affix hopping |
6 | agreement morphology | applicative suffix | Akie documentation | agreement |
7 | amba rc | argument licensing | Akie language | anti-locality |
8 | amba relative | associative marker | Akie people | applicative suffix |
9 | ancestral cult | atr harmony | Akie traditions | argument position |
10 | applicative suffix | bad news delivery | American English | aspect |
11 | argument licensing | binary-neg structure | Asibi bought | associative marker |
12 | assoc-1pro aug | canonical verb | Asouk friends | back vowels |
13 | associative marker | clausal spine | Asouk remember | base generation |
14 | associative morphology | clause-initial position | Asouk thought | bound variable |
15 | associative-marked subject | cleft strategy | AspP Asp | canonical subject |
16 | atr harmony | complementary distribution | Bantu Lexical | clausal spine |
17 | attitude holder | complementizer system | Bantu language | code-switching |
18 | audio recorder | contrastive focus | Bongo-Bagirmi languages | complementizer system |
19 | baabi ara | count noun | CP domain | count noun |
20 | bad news | countability distinction | Chinese Japanese | countable mass noun |
21 | bad news delivery | countable mass noun | DP CP | CP |
22 | bad news management | direct object | DP edge | cross-linguistic |
23 | bare count | downward merger | Dhaisu speakers | demonstrative |
24 | bare countable mass | DP | Dhaisu vowel | direct object |
25 | bare form | endangered language | ECL speakers | discourse entities |
26 | base-generated prolepsis | escape hatch | EPP position | DP |
27 | binary-neg structure | existential quantifier | Endangered Languages | embedded subject |
28 | book prt | ex-situ focus | English translation | endangered language |
29 | book tomorrow | external argument | English type | EPP position |
30 | breaking bad news | final focus | Focus marking | escape hatch |
31 | buy book | finite clause | Fragment answer | existential quantifier |
32 | c1-c1-own sm | finite-nonfinite distinction | Gbaya languages | ex-situ focus |
33 | c1-c1-person wh-c1-om | focal constituent | Helsinki Corpus | external argument |
34 | c1-like-fv c1-c1-own | focus construction | In argue | feature percolation |
35 | c1-pst-pay-fv c1-c1-own | focus marker | In situ | final focus |
36 | c1-refl-say-fv c1-that | focus marking | In-situ focus | finite clauses |
37 | c1-say-fv c1-that | focus marking asymmetry | It called | focus marker |
38 | c1-that c1-c1-own | focus-sensitive particle | It important | focus marking |
39 | c1-that c1-c1-own sm | folktale narration | John house | focus marking asymmetry |
40 | canonical clause | fragment answer | Kenya English | focus position |
41 | canonical verb | future marker | Kenya Kiswahili | fragment answer |
42 | car foc | future tense | Kenya language | front vowels |
43 | ɔítɛ kɛɛn | grammatical role | Kinande Linker | future tense |
44 | class pronoun | head movement | Kofi buy | grammatical role |
45 | clausal spine | head noun | Kofi re-to | head movement |
46 | clause-initial position | in-situ focus | Kwa languages | head noun |
47 | cleft strategy | left periphery | Language choice | high vowels |
48 | communicative difficulty | licensing | Leipzig Glossing | H-tone |
49 | comp envy | lingua franca | Linguistica analysis | human rights |
50 | comp envy taro | long-distance licensing | Linker head | indefinite noun |
51 | comp food | mass communication | Long distance | indigenous languages |
52 | complementary distribution | mass noun | Lubukusu object | initial position |
53 | complementizer system | matrix argument | Mabia focus | in-situ focus |
54 | containing phase | matrix clause | Marker position | intermediate vowels |
55 | contrastive focus | matrix verb | Multiple Agree | island effects |
56 | controlled element | morphology | Ngbugu vowel | language vitality |
57 | cook thing-thing | negation | Niger Congo | lingua franca |
58 | cook thing-thing neg2 | negative liberty | Noun denotations | Linguistica |
59 | count noun | non-divisive noun | Parallel Agree | Linker |
60 | countability distinction | nonfinite complement | Pre-negative subject | locality constraints |
61 | countable mass | noun | Proto Bantu | long-distance movement |
62 | countable mass noun | noun phrase | Question marker | marking devices |
63 | dà gbáŋ | NP | Stem counts | mass noun |
64 | dà gbǎŋ | object marker | Sudanic languages | matrix subject |
65 | dā gbáŋ | partial reduplication | Swahili morphology | mid-low vowels |
66 | ɗe ɲɛ | past tense | Swahili tensed | mood |
67 | delivering bad news | phase edge | TAM markers | moribund languages |
68 | dīg jāab-jāab | positive liberty | Tanzania Kiswahili | morpheme breaks |
69 | direct object | prefix | Tanzania Language | morphology |
70 | diseased individual | present tense | Tense Marker | Multiple Agree |
71 | documentary corpus | prolepsis | Tense morphology | negation |
72 | documentation agenda | proleptic object | Thagicu languages | negative liberty |
73 | documentation project | pronominal subject | This illustrated | nonfinite clauses |
74 | documentation team | proto phoneme | Ubangian languages | NPIs |
75 | downward merger | question particle | Unlocking theory | null pronoun |
76 | dp edge | reflexive marker | VP DP | null subject |
77 | effect of grammatical role | relative clause | VP sout | object marker |
78 | endangered language | relative marker | VSO word | oral genres |
79 | envy taro | restrictive clause | Volkswagen Foundation | past tense |
80 | escape hatch | sentence-final focus | We assume | phase edge |
81 | exact nature | subject marker | We leave | plural marker |
82 | example ex | suffix | West African | positive liberty |
83 | existential quantifier | syntactic licensing | Zulu nouns | pragmatic focus |
84 | external argument | tense | Zulu passives | prefix |
85 | extra matrix | unary-neg structure | a-li Bill | present tense |
86 | ɛnnɛ hwɛ | Unlocking theory | a-li George | prolepsis |
87 | ɛno ara | velar stop | accessibility hierarchy | proleptic object |
88 | ɛnyɛ yie | voiced pause | adjectives demonstratives | quantized noun |
89 | ɛte sɛn | vowel change | affirmative sentence | question marker |
90 | ɛyɛ deɛn | vowel merger | affix hopping | question particle |
91 | ɛyɛ duabɔ | vowel system | agreement morphology | reduplicated noun |
92 | ɛyɛ nkrɔfoɔ | word order | anti locality | relative clause |
93 | final focus | applicative suffix | relative marker | |
94 | final focus particle | argument position | spoken language | |
95 | final vowel | aspect features | strong pronouns | |
96 | finite clause | aspects linguistic | subject marker | |
97 | finite-nonfinite distinction | associative marker | TAM markers | |
98 | focal constituent | attitude holder | tense | |
99 | focus assignment | audio video | Unlocking theory | |
100 | focus construction | back vowels | velar stop | |
101 | focus marker | bare countable | vowel change | |
102 | focus marker ka | bare form | vowel merger | |
103 | focus marking | base generation | vowel system | |
104 | focus marking system | basic properties | VP | |
105 | focus particle | bend left | weak pronouns | |
106 | focus system | book tomorrow | wh-movement | |
107 | focus-sensitive particle | bought book | word order | |
108 | folktale narration | bound variable | ||
109 | folktale narration session | buy book | ||
110 | following utterance | called associative | ||
111 | form complementizer | canonical subject | ||
112 | fragment answer | center north | ||
113 | fut buy | central vowel | ||
114 | fut buy book | cl cl | ||
115 | future marker | class language | ||
116 | goat bit | class prefixes | ||
117 | government business | clausal spine | ||
118 | grammatical role | clause matrix | ||
119 | harmony system | closest relative | ||
120 | head movement | code switching | ||
121 | head noun | common ground | ||
122 | ho hyehye | community members | ||
123 | ho hyehye yɛn | complementizer system | ||
124 | hyehye yɛn | constituent focus | ||
125 | indefinite noun class | control subject | ||
126 | information status | count noun | ||
127 | information structure | countable mass | ||
128 | in-situ strategy | cross linguistic | ||
129 | intergenerational transmission | de se | ||
130 | internal argument | denoting nouns | ||
131 | izingane amavuvuzela | direct object | ||
132 | khúú wt | discourse entities | ||
133 | khúú wt n | diseased individuals | ||
134 | lán fú-nī | documentation project | ||
135 | language documentation | dominant languages | ||
136 | language endangerment | edge embedded | ||
137 | language use | education system | ||
138 | left periphery | embedded subject | ||
139 | lgd bhúŋwànì | empty nodes | ||
140 | lingua franca | endangered language | ||
141 | linguistic meaning | escape hatch | ||
142 | ln nzì | ethnic groups | ||
143 | ln nzì khúú | evaluated object | ||
144 | ln nzì khúú wt | ex-situ strategy | ||
145 | ln nzì khúú wt n | existential quantifier | ||
146 | long-distance licensing | external argument | ||
147 | low tone | familiar language | ||
148 | m ln | feature percolation | ||
149 | m ln nzì | final focus | ||
150 | m ln nzì khúú | final particle | ||
151 | m ln nzì khúú wt | finite Swahili | ||
152 | managing bad news | finite clauses | ||
153 | marker ka | floating tone | ||
154 | marker prefix | focus function | ||
155 | marking asymmetry | focus languages | ||
156 | marking focus | focus marker | ||
157 | mass noun | focus position | ||
158 | matrix argument | focus system | ||
159 | matrix clause | focused element | ||
160 | matrix nominal | food cooked | ||
161 | matrix verb | foreign language | ||
162 | menhia obiara | front vowels | ||
163 | menhia obiara mmoa | future tense | ||
164 | mewɔ m | grammatical role | ||
165 | mixed behavior | head movement | ||
166 | morpheme ka | head nouns | ||
167 | morpheme kà | high mid | ||
168 | morphological realization | high vowels | ||
169 | name foc | human rights | ||
170 | narration session | indefinite noun | ||
171 | national language | indigenous languages | ||
172 | native speaker | information focus | ||
173 | nc-i nc-i | initial position | ||
174 | nc-i nc-i sfp | intermediate vowels | ||
175 | nc-i sfp | involving English | ||
176 | ndugu yake | island effects | ||
177 | neg think | ki ki | ||
178 | neg1 cook | ko ko | ||
179 | negative liberty | language morphology | ||
180 | news bearer | language spoken | ||
181 | news delivery | language vitality | ||
182 | next section | left edge | ||
183 | ni oo | lingua franca | ||
184 | non-divisive noun | linguistic resources | ||
185 | nonfinite complement | local language | ||
186 | non-restrictive clause | local subjects | ||
187 | non-restrictive rc | locality constraints | ||
188 | non-subject constituent | long ago | ||
189 | non-subject focus | low tone | ||
190 | non-subject split | made child | ||
191 | noun class | managing bad | ||
192 | noun class marker | mark focus | ||
193 | noun class pronoun | marker prefix | ||
194 | noun denotation | marking asymmetry | ||
195 | noun phrase | marking devices | ||
196 | numeral modification | mass nouns | ||
197 | nur ā | matrix embedded | ||
198 | nur nur | matrix subject | ||
199 | nya wāā | mid high | ||
200 | nyinaa yɛn | mid low | ||
201 | nyinaa yɛn ho | mid-low vowels | ||
202 | nyinaa yɛn ho hyehye | modified numeral | ||
203 | nyinaa yɛn ho hyehye yɛn | moribund languages | ||
204 | nzì khúú | morpheme breaks | ||
205 | nzì khúú wt | morphological analysis | ||
206 | nzì khúú wt n | movement based | ||
207 | obiara mmoa | multiple verbs | ||
208 | object agreement | na ta | ||
209 | object marker | narrow focus | ||
210 | object np | national language | ||
211 | oral vowel | native speaker | ||
212 | overt subject | negative positive | ||
213 | p evening | neighboring language | ||
214 | partial reduplication | ni li | ||
215 | particular constituent | ni na | ||
216 | person person | ni tu | ||
217 | person person sfp | nice empty | ||
218 | person sfp | node midway | ||
219 | pfv cl-food | node style | ||
220 | phase edge | nominal NPIs | ||
221 | phonemic status | non-restrictive RCs | ||
222 | plural suffix | northern Ghana | ||
223 | plural-marked count | noun classes | ||
224 | positive liberty | nouns countable | ||
225 | pre-negative subject position | nouns modified | ||
226 | previous section | null pronoun | ||
227 | proleptic object | null subject | ||
228 | pronominal subject | number affixes | ||
229 | propositional attitude | number words | ||
230 | proto phoneme | object RCs | ||
231 | question particle | object matrix | ||
232 | rc length | observed subject | ||
233 | rc variation | occurs clause | ||
234 | recursive possession | official language | ||
235 | reflexive marker | oral genres | ||
236 | relative clause | order language | ||
237 | relative marker | outer aspect | ||
238 | relative type | overt pronominal | ||
239 | restrictive clause | overt subjects | ||
240 | restrictive rc | part collection | ||
241 | sentence final focus | past tense | ||
242 | sentence final focus particle | person person | ||
243 | sentence-final particle | phase edge | ||
244 | sentential negation | phonemic status | ||
245 | sɛ ɛyɛ deɛn | place focus | ||
246 | sɛ mete | plural marker | ||
247 | sg-marked countable mass | plural noun | ||
248 | shift transition | position subject | ||
249 | singulative suffix | positive liberty | ||
250 | speaker population | post verbal | ||
251 | subject agreement | potential stem | ||
252 | subject extraction | pragmatic focus | ||
253 | subject focus | present analysis | ||
254 | subject marker | primary language | ||
255 | subject marker prefix | proleptic object | ||
256 | subject position | pronominal agreement | ||
257 | such extraction | property holds | ||
258 | swahili amba | propositional attitude | ||
259 | swahili rc | quantized noun | ||
260 | swahili rc variation | question particle | ||
261 | syntactic displacement | recording equipment | ||
262 | syntactic distribution | reduplicated noun | ||
263 | syntactic licensing | related languages | ||
264 | syntactic position | relative marker | ||
265 | system vowel | requires Agree | ||
266 | tense marker | restrictive restrictive | ||
267 | tex localbibliography | role head | ||
268 | thing-thing neg2 | rural areas | ||
269 | three-way countability | scale node | ||
270 | three-way countability distinction | sentence final | ||
271 | tomorrow buy | sentential negation | ||
272 | tomorrow buy book | shares properties | ||
273 | tomorrow fut | single head | ||
274 | tomorrow fut buy | small Swahili | ||
275 | tomorrow fut buy book | south bend | ||
276 | traditional court | south south | ||
277 | true mass | speaker population | ||
278 | unary-neg structure | speech communities | ||
279 | unlocking theory | status head | ||
280 | upcoming bad news | strong pronouns | ||
281 | velar stop | style center | ||
282 | voiced pause | subject marker | ||
283 | vowel change | subject subject | ||
284 | vowel harmony | suffix shown | ||
285 | vowel merger | syntactic position | ||
286 | vowel system | system vowel | ||
287 | vowel system vowel | target group | ||
288 | wà dà | ten languages | ||
289 | wāā-ī ā | tense aspect | ||
290 | wɔn ayɛ | tense morpheme | ||
291 | word order | tensed restrictive | ||
292 | wt n | term focus | ||
293 | yɛn ho | text part | ||
294 | yɛn ho hyehye | top signature | ||
295 | yɛn ho hyehye yɛn | top ten | ||
296 | yɛn nyinaa | topic topic | ||
297 | yɛn nyinaa yɛn | transcribed annotated | ||
298 | yɛn nyinaa yɛn ho | true mass | ||
299 | yɛn nyinaa yɛn ho hyehye | turn Asouk | ||
300 | yór ń-dàn | urban areas | ||
301 | velar stop | |||
302 | verb constructions | |||
303 | verb stems | |||
304 | video recordings | |||
305 | vowel change | |||
306 | vowel merger | |||
307 | vowel systems | |||
308 | wa na | |||
309 | west south | |||
310 | wh movement | |||
311 | wh situ | |||
312 | words corpus | |||
313 | younger generation |