Keyword extraction from Language Science Press publications

Keyword (or keyterm) extraction is the task of automatically identifying the terms that best describe the contents of a document. It is the most important step in generating the Subject Indexes that you can find in the back of many of our publications.

Up until now we’ve been using Sketch Engine to accomplish this. Sketch Engine is proprietary software that offers many pre-compiled corpora to compare your own texts against. For example, we would upload the raw text of a new book and compare it against the British National Corpus, an extensive collection of various English language texts from the late 20th century. Since linguistic publications usually have a vocabulary that is quite different from everyday language, Sketch Engine did a good job at filtering out the terms that were unique to our book – resulting in a list of keyterms such as language acquisition or vowel harmony.

So why fix something that isn’t broken? Because we are committed to using free and open software wherever possible, and also to making things easier wherever possible. We wanted our own, local, open source keyword extraction software. As an intern with Language Science Press, I was tasked with developing that software. It should work just as well while being easier to use and possibly more tailored to the linguistic domain.

Keyword extraction models

The following paragraphs will demonstrate different approaches to keyword extraction. For illustration purposes we’re going to look at terms extracted from a forthcoming volume of papers from the 49th Annual Conference on African Linguistics. This book represents a wide variety of linguistic disciplines while adhering to the latest LangSci standards, and since I was involved in typesetting I’m somewhat familiar with its contents and possible good key terms.

Sketch Engine

First, let’s take a look at our baseline, Sketch Engine:

Since our books are written using the typesetting system TeX, one important thing to consider is whether to feed the software the raw TeX code, including all the formatting commands and linguistic examples, or a so-called “detexed” version that ideally only contains the actual text.

The current workflow needs a detexed version, so we use the detex software that comes with TeX Live. We upload the resulting text file to the Sketch Engine web interface and compare it against a reference corpus with a few clicks. The resulting list can be exported to a spreadsheet format.

noun class
matrix clause
head noun
dā gbáŋ
amba rc
object agreement
vowel system
focus particle
focus marker
proleptic object
buy book
countable mass
subject position
direct object
atr harmony
external argument
tex localbibliography
sentence final focus
final focus
grammatical role

As you can see on the right, the top 20 results are generally good. Some terms need manual editing (e.g. complete countable mass to countable mass noun, count noun and mass noun – see chapter 14) while others should be removed, namely parts of linguistic examples (dā gbáŋ and buy book) and TeX commands (tex localbibliography) that appear even though we detex’ed the document!

YAKE!

YAKE! (Yet Another Keyword Extractor) is a keyword extraction algorithm developed by Campos et al. in 2018. There is a web interface and a Python package, among others, and using it couldn’t be simpler. Just paste your text on the website or type a few lines of code and out comes your keyterm list.

It theoretically works on both TeX code and detexed text, but using it on TeX code obviously results in a very cluttered keywords list. Therefore I compiled a stop words list that filters out the most common TeX commands.

To the right you can see the top 20 keyterms extracted from the raw TeX code (not detexed), using the default settings and my custom stop words list.

YAKE is biased towards capitalized words, so there are quite a few language names among our keyterms (the Language Index is generated separately and not of interest here), as well as personal names (Asouk and Bisilki).

YAKE did a better job distinguishing countable vs. mass nouns than Sketch Engine, but the word/lemma noun is in there just too often. So there is definitely room for improvement.

However, note that this approach works exceptionally well on raw TeX code, so no need for that extra step here. Also, it operates directly on the given input text and doesn’t need a reference corpus, which makes it a good solution for any kind of document.

language
focus
languages
Focus marking
subject
mass nouns
countable mass nouns
English
Bantu languages
focus marker
sentence final focus
Swahili
Asouk
noun class
True mass nouns
Bisilki
Pre-negative subject position
noun
IPA
nouns

TF-IDF

This cryptic acronym stands for term frequency–inverse document frequency. I won’t go into the mathematical details here (see Wikipedia or a myriad of articles detailing the Python implementation for more info), but basically TF-IDF counts how many times each term occurs in a document as compared to other terms in the same document (term frequency) and weighs that number by how many other documents in the corpus also contain that term (inverse document frequency). Thus, a term that appears many times within our book but is rarely found in the reference corpus would receive a high TF-IDF score.

Here’s the result (using scikit-learn’s TfidfVectorizer with parameters adjusted to our needs, since the default values are useless here):

The results at the very top are somewhat similar to YAKE, below that some new keyterms like vowel system and object agreement are introduced. Interestingly, YAKE seems to focus on syntactic terms while TF-IDF favors phonological terms in this particular example.

The terms including non-alphabetical characters are filtered out in postprocessing, so no need to worry about those. The “geographical” terms come from leftover TeX commands, but we decided against including them in the stop words list because they might be of interest in a different context.

buy book
focus marking
mass nouns
Proto Bantu
3cm 3cm
noun class
controls south
matrix clause
indigenous languages
high vowels
3sg buy
south west
count nouns
vowel system
vowel systems
head noun
mid high
sentence final
mid low
object agreement

Simple maths

The last method I’m going to introduce is the benign-sounding simple maths method. Developed by the late Adam Kilgarriff in 2009, it is the basis for Sketch Engine’s keyword extraction model. It uses simple frequency ratios, i.e. how common a term is in one document versus another, but demands a computationally intensive normalization step that involves keeping track of all appearances of a term within a document.

Since first experiments with this approach did not yield more promising results than the models described above, I decided to leave it to the Sketch Engine servers and focus on the more efficient YAKE and TF-IDF.

Corpus

Before I go into details about the software I developed, I’d like to introduce the corpora that were generated to train the TF-IDF model. We decided to exclude some publications like dictionaries for which the keyword extraction method is generally not suited, as well as some non-English books. This selection is not final and may well be improved in the future.

The result is a compressed file ready for use with the software, as well as a detexed version of it, and can be found here.

The original detex software was quickly replaced with a custom version that is capable of mostly filtering out linguistic examples and tables. Though not perfect (it tends to “swallow” some paragraphs), it proved good enough for our purposes. Those interested can find the source code here.

The langscikw software

With the appropriate models selected, my task now was to piece them together to form usable software. I came up with a three-steps model:

Since YAKE provided the best results in and of itself, it should be the first step. This way, it is possible to use the software without a training corpus.
In a second step, the results of a TF-IDF model are added. We now have twice the number of keywords.
Finally, a different TF-IDF model trained on the detexed corpus extracts even more terms. These are compared against a corpus of 10,000 keywords that appeared in published Language Science Press books and are thus assumed to be reliable, since each keywords list is edited by the author or volume editor. Only those terms that have been selected as key terms before are kept.

The software then deduplicates the list and performs some cleaning steps like removing non-alphabetical results. This works so well that despite extracting many more terms than we need, the returned list is approximately the desired length. We usually extract a list of about 300 keywords, which is manually reduced to about 100–200 terms. This is especially important here because the software deliberately only extracts bigram keyterms – in our experience single keywords are best determined by human intelligence.

Now we have an open-source Python package as well as an easy-to-use command-line interface that operate locally, don’t require registration or a license, can be used on raw TeX code – in fact, no preprocessing is necessary at all –, and extracts keyterms from a 200,000 word document in about two minutes.

Evaluation

langscikw outputs an alphabetically sorted list for technical reasons, so we can’t show any top results. Instead, I told both programs to extract 300 keywords and manually edited the lists like we do when constructing the Subject Index.

I ended up with 92 keywords for Sketch Engine (31%) vs. 107 keywords for langscikw (34% – I received 313 terms). Both lists had 46 terms in common (52% and 45%, respectively).
You can view the spreadsheet with the original and cleaned lists at the end of this post.

While Sketch Engine introduces more noise in the form of linguistic examples (up to 25%), it actually requires less editing because these are easy for humans to spot and remove. langscikw oftentimes requires the user to supply antonyms (e.g. nonfinite clauses when finite clauses is given), Sketch Engine finds these quite reliably. Also, Sketch Engine has a habit of including unigram, bigram and trigram versions of the same term, which I avoided by only working with bigrams.

Of course, my models could be tweaked to achieve even better results. For example there could be some sort of part-of-speech tagging that ensures that only noun phrases are selected as keyterms as there is a quite a bit of noise and non-sensical results.
In the future it would be nice to have a formal evaluation of the model performance and perhaps more fine-tuning of the parameters. An improvement of the detex software or internal pre- and postprocessing could also deliver better results.

All in all, I believe langscikw is a worthwhile alternative to Sketch Engine for our purposes.

[You can find the published package on GitHub and PyPI]

Appendix: Key terms extracted via the different approaches

	Sketch Engine	Sketch Engine (cleaned)	langscikw	langscikw (cleaned)
1	accessibility hierarchy	accessibility hierarchy	Ada P-market	accessibility hierarchy
2	affecting mass communication	affirmative sentence	African languages	adjective
3	affirmative sentence	Afranaph Project	Agree relation	affirmative sentence
4	afranaph id	agreement	Akan SVCS	affix
5	agreement marker	ancestral cult	Akan tense	affix hopping
6	agreement morphology	applicative suffix	Akie documentation	agreement
7	amba rc	argument licensing	Akie language	anti-locality
8	amba relative	associative marker	Akie people	applicative suffix
9	ancestral cult	atr harmony	Akie traditions	argument position
10	applicative suffix	bad news delivery	American English	aspect
11	argument licensing	binary-neg structure	Asibi bought	associative marker
12	assoc-1pro aug	canonical verb	Asouk friends	back vowels
13	associative marker	clausal spine	Asouk remember	base generation
14	associative morphology	clause-initial position	Asouk thought	bound variable
15	associative-marked subject	cleft strategy	AspP Asp	canonical subject
16	atr harmony	complementary distribution	Bantu Lexical	clausal spine
17	attitude holder	complementizer system	Bantu language	code-switching
18	audio recorder	contrastive focus	Bongo-Bagirmi languages	complementizer system
19	baabi ara	count noun	CP domain	count noun
20	bad news	countability distinction	Chinese Japanese	countable mass noun
21	bad news delivery	countable mass noun	DP CP	CP
22	bad news management	direct object	DP edge	cross-linguistic
23	bare count	downward merger	Dhaisu speakers	demonstrative
24	bare countable mass	DP	Dhaisu vowel	direct object
25	bare form	endangered language	ECL speakers	discourse entities
26	base-generated prolepsis	escape hatch	EPP position	DP
27	binary-neg structure	existential quantifier	Endangered Languages	embedded subject
28	book prt	ex-situ focus	English translation	endangered language
29	book tomorrow	external argument	English type	EPP position
30	breaking bad news	final focus	Focus marking	escape hatch
31	buy book	finite clause	Fragment answer	existential quantifier
32	c1-c1-own sm	finite-nonfinite distinction	Gbaya languages	ex-situ focus
33	c1-c1-person wh-c1-om	focal constituent	Helsinki Corpus	external argument
34	c1-like-fv c1-c1-own	focus construction	In argue	feature percolation
35	c1-pst-pay-fv c1-c1-own	focus marker	In situ	final focus
36	c1-refl-say-fv c1-that	focus marking	In-situ focus	finite clauses
37	c1-say-fv c1-that	focus marking asymmetry	It called	focus marker
38	c1-that c1-c1-own	focus-sensitive particle	It important	focus marking
39	c1-that c1-c1-own sm	folktale narration	John house	focus marking asymmetry
40	canonical clause	fragment answer	Kenya English	focus position
41	canonical verb	future marker	Kenya Kiswahili	fragment answer
42	car foc	future tense	Kenya language	front vowels
43	ɔítɛ kɛɛn	grammatical role	Kinande Linker	future tense
44	class pronoun	head movement	Kofi buy	grammatical role
45	clausal spine	head noun	Kofi re-to	head movement
46	clause-initial position	in-situ focus	Kwa languages	head noun
47	cleft strategy	left periphery	Language choice	high vowels
48	communicative difficulty	licensing	Leipzig Glossing	H-tone
49	comp envy	lingua franca	Linguistica analysis	human rights
50	comp envy taro	long-distance licensing	Linker head	indefinite noun
51	comp food	mass communication	Long distance	indigenous languages
52	complementary distribution	mass noun	Lubukusu object	initial position
53	complementizer system	matrix argument	Mabia focus	in-situ focus
54	containing phase	matrix clause	Marker position	intermediate vowels
55	contrastive focus	matrix verb	Multiple Agree	island effects
56	controlled element	morphology	Ngbugu vowel	language vitality
57	cook thing-thing	negation	Niger Congo	lingua franca
58	cook thing-thing neg2	negative liberty	Noun denotations	Linguistica
59	count noun	non-divisive noun	Parallel Agree	Linker
60	countability distinction	nonfinite complement	Pre-negative subject	locality constraints
61	countable mass	noun	Proto Bantu	long-distance movement
62	countable mass noun	noun phrase	Question marker	marking devices
63	dà gbáŋ	NP	Stem counts	mass noun
64	dà gbǎŋ	object marker	Sudanic languages	matrix subject
65	dā gbáŋ	partial reduplication	Swahili morphology	mid-low vowels
66	ɗe ɲɛ	past tense	Swahili tensed	mood
67	delivering bad news	phase edge	TAM markers	moribund languages
68	dīg jāab-jāab	positive liberty	Tanzania Kiswahili	morpheme breaks
69	direct object	prefix	Tanzania Language	morphology
70	diseased individual	present tense	Tense Marker	Multiple Agree
71	documentary corpus	prolepsis	Tense morphology	negation
72	documentation agenda	proleptic object	Thagicu languages	negative liberty
73	documentation project	pronominal subject	This illustrated	nonfinite clauses
74	documentation team	proto phoneme	Ubangian languages	NPIs
75	downward merger	question particle	Unlocking theory	null pronoun
76	dp edge	reflexive marker	VP DP	null subject
77	effect of grammatical role	relative clause	VP sout	object marker
78	endangered language	relative marker	VSO word	oral genres
79	envy taro	restrictive clause	Volkswagen Foundation	past tense
80	escape hatch	sentence-final focus	We assume	phase edge
81	exact nature	subject marker	We leave	plural marker
82	example ex	suffix	West African	positive liberty
83	existential quantifier	syntactic licensing	Zulu nouns	pragmatic focus
84	external argument	tense	Zulu passives	prefix
85	extra matrix	unary-neg structure	a-li Bill	present tense
86	ɛnnɛ hwɛ	Unlocking theory	a-li George	prolepsis
87	ɛno ara	velar stop	accessibility hierarchy	proleptic object
88	ɛnyɛ yie	voiced pause	adjectives demonstratives	quantized noun
89	ɛte sɛn	vowel change	affirmative sentence	question marker
90	ɛyɛ deɛn	vowel merger	affix hopping	question particle
91	ɛyɛ duabɔ	vowel system	agreement morphology	reduplicated noun
92	ɛyɛ nkrɔfoɔ	word order	anti locality	relative clause
93	final focus		applicative suffix	relative marker
94	final focus particle		argument position	spoken language
95	final vowel		aspect features	strong pronouns
96	finite clause		aspects linguistic	subject marker
97	finite-nonfinite distinction		associative marker	TAM markers
98	focal constituent		attitude holder	tense
99	focus assignment		audio video	Unlocking theory
100	focus construction		back vowels	velar stop
101	focus marker		bare countable	vowel change
102	focus marker ka		bare form	vowel merger
103	focus marking		base generation	vowel system
104	focus marking system		basic properties	VP
105	focus particle		bend left	weak pronouns
106	focus system		book tomorrow	wh-movement
107	focus-sensitive particle		bought book	word order
108	folktale narration		bound variable
109	folktale narration session		buy book
110	following utterance		called associative
111	form complementizer		canonical subject
112	fragment answer		center north
113	fut buy		central vowel
114	fut buy book		cl cl
115	future marker		class language
116	goat bit		class prefixes
117	government business		clausal spine
118	grammatical role		clause matrix
119	harmony system		closest relative
120	head movement		code switching
121	head noun		common ground
122	ho hyehye		community members
123	ho hyehye yɛn		complementizer system
124	hyehye yɛn		constituent focus
125	indefinite noun class		control subject
126	information status		count noun
127	information structure		countable mass
128	in-situ strategy		cross linguistic
129	intergenerational transmission		de se
130	internal argument		denoting nouns
131	izingane amavuvuzela		direct object
132	khúú wt		discourse entities
133	khúú wt n		diseased individuals
134	lán fú-nī		documentation project
135	language documentation		dominant languages
136	language endangerment		edge embedded
137	language use		education system
138	left periphery		embedded subject
139	lgd bhúŋwànì		empty nodes
140	lingua franca		endangered language
141	linguistic meaning		escape hatch
142	ln nzì		ethnic groups
143	ln nzì khúú		evaluated object
144	ln nzì khúú wt		ex-situ strategy
145	ln nzì khúú wt n		existential quantifier
146	long-distance licensing		external argument
147	low tone		familiar language
148	m ln		feature percolation
149	m ln nzì		final focus
150	m ln nzì khúú		final particle
151	m ln nzì khúú wt		finite Swahili
152	managing bad news		finite clauses
153	marker ka		floating tone
154	marker prefix		focus function
155	marking asymmetry		focus languages
156	marking focus		focus marker
157	mass noun		focus position
158	matrix argument		focus system
159	matrix clause		focused element
160	matrix nominal		food cooked
161	matrix verb		foreign language
162	menhia obiara		front vowels
163	menhia obiara mmoa		future tense
164	mewɔ m		grammatical role
165	mixed behavior		head movement
166	morpheme ka		head nouns
167	morpheme kà		high mid
168	morphological realization		high vowels
169	name foc		human rights
170	narration session		indefinite noun
171	national language		indigenous languages
172	native speaker		information focus
173	nc-i nc-i		initial position
174	nc-i nc-i sfp		intermediate vowels
175	nc-i sfp		involving English
176	ndugu yake		island effects
177	neg think		ki ki
178	neg1 cook		ko ko
179	negative liberty		language morphology
180	news bearer		language spoken
181	news delivery		language vitality
182	next section		left edge
183	ni oo		lingua franca
184	non-divisive noun		linguistic resources
185	nonfinite complement		local language
186	non-restrictive clause		local subjects
187	non-restrictive rc		locality constraints
188	non-subject constituent		long ago
189	non-subject focus		low tone
190	non-subject split		made child
191	noun class		managing bad
192	noun class marker		mark focus
193	noun class pronoun		marker prefix
194	noun denotation		marking asymmetry
195	noun phrase		marking devices
196	numeral modification		mass nouns
197	nur ā		matrix embedded
198	nur nur		matrix subject
199	nya wāā		mid high
200	nyinaa yɛn		mid low
201	nyinaa yɛn ho		mid-low vowels
202	nyinaa yɛn ho hyehye		modified numeral
203	nyinaa yɛn ho hyehye yɛn		moribund languages
204	nzì khúú		morpheme breaks
205	nzì khúú wt		morphological analysis
206	nzì khúú wt n		movement based
207	obiara mmoa		multiple verbs
208	object agreement		na ta
209	object marker		narrow focus
210	object np		national language
211	oral vowel		native speaker
212	overt subject		negative positive
213	p evening		neighboring language
214	partial reduplication		ni li
215	particular constituent		ni na
216	person person		ni tu
217	person person sfp		nice empty
218	person sfp		node midway
219	pfv cl-food		node style
220	phase edge		nominal NPIs
221	phonemic status		non-restrictive RCs
222	plural suffix		northern Ghana
223	plural-marked count		noun classes
224	positive liberty		nouns countable
225	pre-negative subject position		nouns modified
226	previous section		null pronoun
227	proleptic object		null subject
228	pronominal subject		number affixes
229	propositional attitude		number words
230	proto phoneme		object RCs
231	question particle		object matrix
232	rc length		observed subject
233	rc variation		occurs clause
234	recursive possession		official language
235	reflexive marker		oral genres
236	relative clause		order language
237	relative marker		outer aspect
238	relative type		overt pronominal
239	restrictive clause		overt subjects
240	restrictive rc		part collection
241	sentence final focus		past tense
242	sentence final focus particle		person person
243	sentence-final particle		phase edge
244	sentential negation		phonemic status
245	sɛ ɛyɛ deɛn		place focus
246	sɛ mete		plural marker
247	sg-marked countable mass		plural noun
248	shift transition		position subject
249	singulative suffix		positive liberty
250	speaker population		post verbal
251	subject agreement		potential stem
252	subject extraction		pragmatic focus
253	subject focus		present analysis
254	subject marker		primary language
255	subject marker prefix		proleptic object
256	subject position		pronominal agreement
257	such extraction		property holds
258	swahili amba		propositional attitude
259	swahili rc		quantized noun
260	swahili rc variation		question particle
261	syntactic displacement		recording equipment
262	syntactic distribution		reduplicated noun
263	syntactic licensing		related languages
264	syntactic position		relative marker
265	system vowel		requires Agree
266	tense marker		restrictive restrictive
267	tex localbibliography		role head
268	thing-thing neg2		rural areas
269	three-way countability		scale node
270	three-way countability distinction		sentence final
271	tomorrow buy		sentential negation
272	tomorrow buy book		shares properties
273	tomorrow fut		single head
274	tomorrow fut buy		small Swahili
275	tomorrow fut buy book		south bend
276	traditional court		south south
277	true mass		speaker population
278	unary-neg structure		speech communities
279	unlocking theory		status head
280	upcoming bad news		strong pronouns
281	velar stop		style center
282	voiced pause		subject marker
283	vowel change		subject subject
284	vowel harmony		suffix shown
285	vowel merger		syntactic position
286	vowel system		system vowel
287	vowel system vowel		target group
288	wà dà		ten languages
289	wāā-ī ā		tense aspect
290	wɔn ayɛ		tense morpheme
291	word order		tensed restrictive
292	wt n		term focus
293	yɛn ho		text part
294	yɛn ho hyehye		top signature
295	yɛn ho hyehye yɛn		top ten
296	yɛn nyinaa		topic topic
297	yɛn nyinaa yɛn		transcribed annotated
298	yɛn nyinaa yɛn ho		true mass
299	yɛn nyinaa yɛn ho hyehye		turn Asouk
300	yór ń-dàn		urban areas
301			velar stop
302			verb constructions
303			verb stems
304			video recordings
305			vowel change
306			vowel merger
307			vowel systems
308			wa na
309			west south
310			wh movement
311			wh situ
312			words corpus
313			younger generation

Language Science Press Blog