March 5, 2019 · Shop talk

Studio Tech Hour: Data mining historical content

What is Studio Tech Hour? It is actually two hours! Drop by the Collaboration Studio at Vassar Libraries on Tuesdays from 5:00 - 7:00 pm for tech talk, consultation on digital projects, and general support on technology both in and out of the classroom. In between questions I'll be exploring a new idea or method each week. Feel free to join me!

This week I am starting with the amazing Programming Historian. If you aren't familiar with this blog, their Lessons page is a fantastic launching point for brainstorming the kind of projects you might want to start and the best tools to accomplish them. They started writing lessons in Spanish in 2017 and will begin publishing French-language lessons this year. Today I am checking out Caleb McDaniel's exercise: Data Mining the Internet Archive (¡además en español!).

One takeaway: I am always trying to be better about not using jargon when I work with others, as a librarian, and doubly so as a techbrarian. I was thrilled to find this awesome breakdown of the nebulous word "item" on the Archive-It blog: How Archive.org Items are Structured

How do you know whether your files should be in one item or separate items? You get one metadata file per item. If the same metadata describes ALL of the files (like a CD), then that’s one item. If the files are too different to have the same metadata (title, creator, description, etc.), they should be in different items.

Note if you are following along at home! The Internet Archive searching quickstart URL has changed: https://internetarchive.readthedocs.io/en/latest/quickstart.html#searching is now https://archive.org/services/docs/api/internetarchive/quickstart.html#searching


Special Issue: Big Data from the South | Things for which I am thankful, today.