Tagged in

Text Mining

Sense and Sentences

Data mining the past, by Peter Organisciak

More information

Followers

Elsewhere

More, on Medium

Text Mining

Peter Organisciak in Sense and Sentences

Jul 17, 2017

Writing a poetry generator

Taking apart an old dictionary and remixing it in rhyme.

Peter Organisciak in Sense and Sentences

Oct 20, 2016

Beyond tokens: what character counts say about a page

When talking about quantitative features in text analysis the term token count is king, but other features can help infer the content and context of a page. I demonstrate visually how the characters at the margins of a page can show us…

Peter Organisciak in Sense and Sentences

Jun 14, 2016

Understanding Classified Languages in the HathiTrust

Peter Organisciak in Sense and Sentences

Mar 18, 2016

A Dataset of Term Stats in Literature

Following up on Term Weighting for Humanists, I’m sharing data and code to apply term weighting to literature in the HTRC’s Extracted Features dataset.

Prepared from 235,000 Language and Literature (i.e. LCC Class P) volumes, I’ve…

Peter Organisciak in Sense and Sentences

Mar 9, 2016

Term Weighting for Humanists

An introduction to TF-IDF

Peter Organisciak in Sense and Sentences

Mar 2, 2016

HTRC Feature Reader 2.0

I’ve released an overhaul of the HTRC Feature Reader, a Python library that makes it easy to work with the Extracted Features (EF) dataset from the HathiTrust. EF provides page-level feature counts for 4.8 million volumes, including part-of-speech tagged term counts, line and sentence…

Peter Organisciak in Sense and Sentences

Dec 7, 2015

MARC Fields in the HathiTrust

At the HathiTrust Research Center, we’re often asked about metadata coverage for the nearly 15 million HathiTrust records. Though we provide additional derived features like author gender and language inference, the metadata is generally what arrives from institutional…

Peter Organisciak in Sense and Sentences

Oct 27, 2015

Your First Twitter Bot, in 20 minutes