Processing your XML/TEI with the XML Family of Languages

 

DHSI 2026: Week 2, 15–19 June, 2026

Instructors: Elisa Beshero-Bondar  and David J. Birnbaum

Socials:  Bluesky: @epyllia | Mastodon: epyllia@indieweb.social  | GitHub: @ebeshero

 

  • Started as Asst/Assoc Prof. of English Lit
  • Now Prof. of Digital Humanities at Penn State Behrend
  • Learning to build scholarly editions and investigate research questions w/ code / markup changed everything! 
  • Got involved w/ the TEI (maybe you should, too!)
  • Now Chair of TEI Technical Council,  helps maintain/revise the TEI Guidelines
  • Projects: Frankenstein Variorum and Digital Mitford, and newtfire.org 
  • David dared me (as a professor) to take his 15-week class in DH / code, and I learned to build my first projects
  • Prof. of Slavic Languages and Literatures | formative role w/ TEI Council
  • Immense talent with XSLT –> visualizations / interfaces
  • obdurodon.org (projects and teaching materials)
  • A favorite of his projects: Daniel the Prisoner: A Virtual Florilegium

Why bother? Shouldn't we just tell AI to do all this  processing now? 

  • AI doesn't build or sustain an archive for cultural heritage on its own, the way a scholarly editor would.
     
  • Scholars who work w/ XML projects care about provenance, accuracy, repeatability (sharing the code-base so others can run it / duplicate results).
     
  • If we're comfortable with approximating everything and overusing GPU. The code bloat in the general-purpose LLMs is pretty extreme for something w/ poor editorial standards.

Image credit: Hugh Kenny via Piedmont Environmental Council." from Jon Gorey, “Data Drain: The Land and Water Impacts of the AI Boom," Lincoln Institute of Land Policy, 17 October 2025.

Markup Tech / Editorial Markup

 

Image credit: Chicago Manual of Style: How Books and Journals are Produced

  • XML-based research projects = carefully constructed processing systems. They benefit from custom AI applications that extend the scope of an LLM. 
    • Humanities:  digital scholarly editions / machine-assisted collation for critical editions
    • XML systems in health care--requirements for precision / standards in curating and protecting data
  • Learning XML data structures and processing helps you
    • inspect the code of a big markup project for yourself
    • define your own small AI systems
    • design precision instruments for humanities projects

AI applications with the XML Stack

  • Knowing XML/XPath, you can use AI with quality control:
    • Coding assistant: help with manuscript analysis, validation, debugging
    • Pair with Handwritten Text Recognition, machine learning processes
  • Humanities text-scholarship: designing the data structures and defining relationships between parts
  • My current digitxml-mcp project:
    • developing a small, XPath-capable Agentic AI Language Model with an MCP server
    • Able to apply + adapt Saxon XPath/XQuery/XSLT processing with my guidance
    • all running on a laptop and shared on GitHub

Build smaller custom AI tools on your own terms

  • No technology is neutral. Who controls the systems for design and analysis of cultural heritage resources?
     
  • Unstructured text packaged in regular "chunks" defined character counts in LLMs. Context truncated, meaning approximated. Statistical significance is good enough.
     
  • Text scholars with markup tools design and transform data structures.  Coding / processing helps give us deliberate scholarly authority to represent documents as data structures.
     
  • This, too, is academic writing!

From student-made TEI/XSLT Emily Dickinson Fascicles 6 + 16 project

You're ready for our DHSI course if. . .

. . . you've worked with some XML and want to do more with the angle brackets.  

If you want to get started with XML/TEI markup . . .

<corpus>
    <sentence>
        <orth>Мы с Марко поеха-л-и автобус-ом в Переделкино</orth>
        <translit>My s Marko poexa-l-i avtobus-om v Peredelkino.</translit>
        <ilg>we with Marko go-PST-P bus-by to Peredelkino.</ilg>
        <free>Marko and I went to Peredelkino by bus.</free>
    </sentence>
</corpus>

Code: Leipzig coding conventions for a transliterated / translated passage used in our class. You don't have to know the languages represented to explore and test the markup!

...take Connie Crompton's course in week 1 first! 

[Foundations] Text Encoding Fundamentals and Their Application 

 

1. XPath and XQuery via eXist-dB and oXygen

2. XQuery  + XSLT:

flow processing / transformation / re-mediation / interface design / creative art!

Source: student Mia Borgia's Blues Analysis Project of a mostly broken 90's web archive of blues lyrics, writers, + performers

XSLT made this heatmap as a guide to serendipitous exploration of the Frankenstein Variorum

 

 

See our Method page for details:

https://frankensteinvariorum.org/

 

TEI + XSLT helped make this possible for CSS + JS

3. Schematron:

Project management, customized rules based on XPath relationships and evaluation of data

 

  • Written in XML, makes your own custom human-readable error messages
  • Applies XPath, processed like XSLT 
 <sch:pattern>
        <sch:rule context="end">
            <sch:let name="start" value="number(preceding-sibling::start)"/>
            <sch:report test="number(.) lt $start">The end page 
              (<sch:value-of select="."/>) 
              cannot be less than the start page
              (<sch:value-of select="$start"/>)</sch:report>
        </sch:rule>
        <sch:rule context="text()" role="warning">
            <sch:report test="matches(., '[&quot;'']')"
                     role="warning">Text contains 
              straight apostrophe or quotation mark</sch:report>
        </sch:rule>
        <sch:rule context="bibItem">
            <sch:report test="not(issue)" 
                     role="warning">Issue number is missing</sch:report>
        </sch:rule>
        <sch:rule context="initial">
            <sch:report test="string-length(.) gt 1"
                        role="warning">Author has a middle 
              initial of length greater than 1</sch:report>
        </sch:rule>
    </sch:pattern>

Our class website (built w/ XML, XSLT + Schematron) helps you review and apply the materials during + after class ends!

Bring your project to class if you want some guidance with it!

 

 During lunch hours / afternoons / evenings David and I are often tinkering w/ our students' code--solving a problem, figuring out a challenging task like...

  • mapping XML to RDF (linked data) via XSLT
  • figuring out how to write Schematron to validate something messy/confusing
  • helping you design a way to read your markup data
  • moving a bit beyond the horizons of what we're teaching
  • If you're okay to share your code w/ the class, we all learn from it!​

 

DHSI Processing XML Feature

By Elisa Beshero-Bondar

DHSI Processing XML Feature

  • 284