title: TEXT ANALYTICS - A SHORTCUT TO LINGUISTIC EVIDENCE
place: KUA Søndre Campus, Danmark
time: October 22-23, 9:00-16:00 (both days).
instructors: Claus Povlsen (CST|KU) & Kristoffer L. Nielbo (datakuben|SDU)
contact: cpovlsen@hum.ku.dk, kln@cas.au.dkThe recent explosion in digitized and digital text-media is rapidly changing
the evidential basis for the humanities. While the humanities used to be the principal
scientific consumers of text-based data, the majority of text analysis is now performed
by 'machines' outside traditional humanistic domains.
Text-Analytics applies automated and data-intensive techniques in order to extract
useful knowledge from from large collections of linguistic data. In this PhD course, the
participant will acquire experience with two major machine learning paradigms (supervised and unsupervised learning)
in order to answer research questions fundamental to the humanities: can we classify texts by genres, periods and status
and how do surface structures reveal latent semantic properties.
The workshop consists of a series of hands-on tutorials with Python combined with useful explanations and illustrations through use-cases. Programming experience is not a requirement, but participants are should to prepare by installing Python and completing three introductory tutorials available on-line.
TEXT ANALYTICS, TEXT DATA MINING, DIGITAL HUMANITIES, HUMANITIES COMPUTING, CULTURE ANALYTICS
DAY 1: Thematic Analysis and Unsupervised Learning
| Time | Content | Instructor |
|---|---|---|
| 09:00-10:00 | Text Analytics/ML |
KLN |
| 10:00-11:00 | Topic Modeling |
KLN |
| 10:11-12:00 | Preparation |
KLN |
| 12:00-13:00 | Lunch |
|
| 13:00-14:00 | Training |
KLN |
| 14:00-15:00 | Application |
KLN |
| 15:00-16:00 | Free Play |
KLN |
DAY 2: Document Classification and Supervised Learning
| Time | Content | Instructor |
|---|---|---|
| 09:00-10:00 | Text Analytics/buffer |
KLN |
| 10:00-11:00 | Document Classification |
KLN |
| 11:00-12:00 | Representation |
KLN |
| 12:00-13:00 | Lunch | |
| 13:00-14:00 | Validation |
KLN |
| 14:00-15:00 | Optimization |
KLN |
| 15:00-16:00 | Free Play |
KLN |
Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM, 55(4), 77–84.
Broadwell, P., Mimno, D., & Tangherlini, T. R. (2017). The Tell-Tale Hat: Surfacing the Uncertainty in Folklore Classification , Journal of Culture Analytics, DOI: 10.22148/16.012.
Brücher, H., Knolmayer, G., & Mittermayer, M. A. (2002). Document classification methods for organizing explicit knowledge. Institut fur Wirtschaftsinformatik der Universität Bern.
Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From data mining to knowledge discovery in databases. AI Magazine, 17(3), 37.
Jänicke, S., Franzini, G., Cheema, M. F & Scheuermann, G. (2015) On Close and Distant Reading in Digital Humanities: A Survey and Future Challenges in Eurographics Conference on Visualization (EuroVis) (2015), DOI: 10.2312/eurovisstar.20151113
Jockers, M. L., & Mimno, D. (2013). Significant themes in 19th-century literature. Poetics, 41(6), 750–769.
Radovanović, M., & Ivanović, M. (2008). Text mining: Approaches and applications. Novi Sad J. Math, 38(3), 227–234.
Tangherlini, T. R., & Leonard, P. (2013). Trawling in the Sea of the Great Unread: Sub-corpus topic modeling and Humanities research. Poetics, 41(6), 725–749.
Underwood, T. (2016). The Life Cycles of Genres. Journal of Culture Analytics, DOI: 10.22148/16.005.
- Install the Anaconda distribution of Python here
- Go through episodes 1-3 in Software Carpentry's Python Lesson here
1.2 ECTS for course attendance, reading and preparation