The constituency of hypertext corpora
A unique property of the hypertext medium is that authors naturally identify certain stretches of text as units, in the form of inline hyperlinks. What kind of linguistic objects are inline hyperlinks? I have investigated this question through a 5.7M word corpus of English hypertext, which includes 375k links. Manual annotation of a sample of 5,000 links in the corpus found that 94% of inline links are constituents in their host sentences, with the exceptions largely clustering into other kinds of linguistic objects, e.g. constituents modulo a large adjunct not included at the right edge. I am pursuing the hypothesis that hyperlinking is a process constrained by the author’s knowledge of the text’s underlying linguistic structure, and therefore could ultimately be used as a test for syntactic constituency, once we better understand the possible range of exceptions.
“The Constituency of Hyperlinks in a Hypertext Corpus.”
Presented at the International Society for the Linguistics of English (IsLE 2), Boston University.