Beyond tokens: what character counts say about a page

Peter Organisciak
Sense and Sentences
7 min readOct 20, 2016

--

When talking about quantitative features in text analysis the term token count is king, but other features can help infer the content and context of a page. I demonstrate visually how the characters at the margins of a page can show us intuitively sensible patterns in text.

At the HTRC, we distribute a dataset of features extracted from millions of digitized books. A feature refers generally to something being quantitatively measured. The term is not used often in the context of text analysis because there is no need for a generalized term when you are mainly dealing with one specific type of feature: the term token. The HTRC Extracted Features Dataset includes the ever-useful token counts but also shares other measures of what is on a page. Here, I demonstrate one such set of features: counts of characters at the left-most and right-most side of a page.

Using just the character count features and applying a simple clustering technique — nothing fancy or complex — it is possible to surface different types of pages: prose, poetry, tables, indices, table of contents, etc. Such information is useful for focusing other analysis. For example, if trying to understand what an author is writing about in a work of fiction, you want to focus on the content, and skip over irrelevant, noisy text such as back-of-the-book advertisements.

This purpose of this post is to show this visually, with demonstrative pages from clusters. Seeing how pages cluster by character patterns is very lucid for understanding how that feature may be useful, so I’ll try not to overwhelm it with too much commentary.

The marginal character count feature was included in the HTRC EF dataset in support of genre classification work by Ted Underwood, following from a suggestion by David Mimno. For each page, a count is provided for all the characters occurs on the left-most side of the page (referred to as beginLineChars in the dataset), and on the right-most side of a page (endLineChars).

In the following page from Paul Clifton, I’ve highlighted the left-most and right-most characters. These are the characters that are counted for each page in the HTRC EF dataset. Note that the actual characters that occur are different between the poetry and prose; would you be able to work backward to guess which is which if given just those characters?

Picture1.png

When we see patterns in character counts, they are being influenced by what is on the page. Some patterns are structural: there is a visual rhythm to how poetry is written and conventions to how a title page is laid out. With prose, where a line break usually means nothing more than “there was no more space on this line”, the patterns look more like the general distributions of characters in that language’s words. Within that, there is still a rhythm to paragraph breaks, and more paragraphs manifests as more capital letters on the left side of a page and more punctuation on the right.

To demonstrate how such patterns manifest, I performed clustering with the K-Means algorithm to partition character count co-occurrences. As input, counts were taken for beginLineChars and endLineChars on 4 million random pages from English books, log transformed and with rare characters filtered. Languages have different distributions of characters as well as different characters, so without a focus on English pages then clustering counts would tell us something we know: what language the page is.

The standard K-Means algorithm separates a set of multi-dimensional points into k number of groups in a very conceptually simple way. First, k centroids are randomly initialized, and all the observations are associated with the centroid that is closest to them. These are our groupings, but they’re useless since the original points were randomly initialized. To make the clusters slightly better, we update the centroids, moving them to a new place in the center of all the observations in the initial cluster. This process is repeated — move centroid based on associated observations, update observation associations based on centroid location — until hopefully the cluster membership stops changing with updates. There are many edge cases to which K-Means is not robust to, but that doesn’t matter in our demonstrative case: as we’ll see, even this simple algorithm finds some intuitively sensible clusters.

In our case, the dimensions are counts of character occurrences, and each page is an observation, a single point plotted in those dimensions. If you’re having trouble imagining a multidimensional space for hundreds of characters, try to picture a case where all that is measured is capital letters and lowercase letters. If a page has 3 capital letters and 2 lowercase letter, you can plot that on a scatterplot with an x-axis and y-axis. With higher dimension spaces, the same ideas exist, even though that are many more axes than just x and y.

Inspecting page clusters

What do the clusters look like? Below is a selection of clusters with representative pages from each. The cluster with the most members is the evenly distributed prose cluster, notable for being stylistically unremarkable.

model1-cluster0

The most representative characters for pages in this cluster are [t,a,s] at the start of lines and [-,e,.,s,d,n,t] at the end. These are similar distributions to the characters that start and end words in the English language with the exception of the hyphen at the end of lines for continuing long words — [t, a, o, s, w] and [e, s, d, t, n], respectively.

Further clusters look like poetry (capital letters starting lines, punctuation ending them), …

model1-cluster12

directories, …

model1-cluster13

tables of contents (capitals and numbers… note though that the first image fits into this pattern but isn’t a TOC),

model1-cluster11.png

and numerous others. Here’s a slideshow of more clusters, including two-column pages, tables, and dialogue-heavy prose.

[gallery ids=”3228,3234,3232,3227,3236,3230,3229,3235,3231,3233" type=”slideshow”]

Latent Feature

In the clusters above, you may notice that the approach made use of an additional bit of information indirectly: the number of lines on a page. When there are more total lines on a page, it follows that there are more characters starting and ending those lines. The log transformation performed on the raw character counts before clustering softened the influence of page length indirectly, but I expect you wouldn’t want to fully remove the page length.

Bonus: Feature Selection

The following two lists show the characters that occurred in more that 0.1% of lines, sorted by how much they discriminate on where in the book the page occurs (e.g. is it a page at the start, in the middle, or at the very end). If you had to choose only a few characters to use for guessing where in a book a page occurs, you wouldn’t choose ones like lowercase ‘t’ or ‘s’. I’m not sure how useful this is, but it certainly is intriguing to try to guess at what makes a given character notable.

Most discriminating characters at the start of a line
['»' '“' '_' 'A' '■' '•' 'I' "'" '*' '[' '-' 'j' 'Y' 'O' '$' 'F' 'G' 'J'
'H' 'D' 'K' 'L' 'M' 'E' '8' 'C' 'B' '9' '7' '6' '5' '4' '3' '2' '1' '0'
'.' ',' '(' 'N' 'S' 'P' 'k' '—' 'y' 'w' 'v' 'u' 't' 's' 'r' 'q' 'p' 'o'
'n' 'm' 'l' 'i' 'Q' 'h' 'g' 'f' 'e' 'd' 'c' 'b' 'a' 'X' 'W' 'V' 'U' 'T'
'R' '"']
Most discriminating characters at the end of a line
['I' 'τ' '_' 'H' 'M' '»' 'O' 'c' 'A' '—' '”' 'i' ']' '*' 'T' "'" 'R' 'u'
':' 'o' '9' '8' '7' '6' '5' '4' '3' '?' '2' '1' '0' '.' '-' ',' ')' '"'
';' 'E' 'y' 'D' 'm' 'l' 'k' 'h' 'g' 'f' 'e' 'd' 'p' 'a' 'r' 'S' 's' 'N'
't' 'w' 'n' '!']

Another way at looking at notable features is agglomerating: grouping together individual characters that tend to occur together. Depending on how many features you want, you can subsequently group the groups, and so on. This process is useful for machine learning algorithms, to avoid having too many features, and visualizing it in a dendrogram gives you an idea of what patterns occurs together.Below is a

Below is a dendrogram of such agglomerative clustering. Note, for example, that numbers could probably be counted together. Counting ‘6’ and ‘7’ individually, for example, doesn’t tell you much more than together. As before, the exercise of seeing how these structures emerge is more important than any commentary I might provide. What looks peculiar to you? What’s an artifact? What’s surprising?

feature-dendrogram.png

--

--