New tool to export photos from Facebook pages and groups

csv_exportA few years ago, Facebook famously started allowing users to download their own data, providing them with a zip file of all their photos and status updates. However, they have never offered such a feature for pages or groups.  The tool developed allows you to download all the photo albums for a page that you like or a group that you manage.  It creates a metadata CSV file for all the photos, and provides you will a script that you can run on your local computer to download all the images.  You can try it out at:

The source code is also available on GitHub.

Summer 2016 course project: born-digital archives

mediaIn this summer’s session of born-digital archives, students have been working on a born-digital archives project, which includes working with records on obsolete media (5.25 diskettes, 3.5 floppies, Zip disks, Mini DV tapes, etc.) as well as inactive records on network storage which originate in a variety of antiquated file formats (e.g., WordPerfect, email in MS Outlook Express format, etc.).  Students are divided into three teams to tackle the project: a Digital Forensics team (working primarily with obsolete media), a Digital Preservation team (working primarily with format migration), and Curation and Description (working primarily on appraisal, arrangement and description).  The collection comes from Pratt School of Information’s own files, and will eventually become available through the School’s on-site archives.

More information can be found in the course syllabus.

New open-source scripts

wordperfectI wanted to go ahead and put out there some new scripts that I have recently developed.  These include:

BagIt Validation Script
For a given directory, this script validates all the “BagIt” bags in it, and send an email to a designated email address with the status of the bags.  BagIt is a standard and a software originally developed by Library of Congress that is used to confirm the integrity of collections of files (e.g., not files deleted, no files tampered with, no files suffering from bit-rot/bit-corruption/etc.).  Written with Python and tested on Windows.

File Normalization tools: WordPerfect to PDF
Doing born-digital archives work almost always seems to turn-up WordPerfect (WPD) files.  This script will go thru a directory, including all subdirectories, and create PDF verisons of all WPD files using MS Word for Windows.  Requires Windows XP+ and MS Word for Windows.

Upcoming course projects (Fall and Summer 2016 semesters)

umaticBelow you will find the upcoming course projects that we be undertaken by my students in the Fall 2016 and Summer 2016 classes:

Fall 2016 – LIS 668-01 Projects in Moving Image and Sound Archives
The course project in this class will involve digital reformatting and exhibiting to the public the public access program Dyke TV, in collaboration with the Lesbian Herstoy Archives. Below you will find some information about the program written by Erica Titkemeyer (2013):

In 1993, Dyke TV began as an access television show created by members of the New York City lesbian community (specifically Linda Chapman, Ana Simo, and Mary Patierno) at Prince St. and Broadway in Manhattan. The purpose was to produce news segments by, for, and about lesbian individuals and communities throughout the United States. The founders more specifically wished to document “rising lesbian activism and to provide a viable platform for lesbian voices to enter the realm of popular culture.” By the time the series came to an end thirteen years later in 2006, the production had reached a total of 78 public access channels , produced at least 322 total shows , and planted its office among the lesbian community in Park Slope, Brooklyn.

The project will involve working with a video collection on U-Matic videotape, which is endangered because of a declining number of units available for playing the format. Past student work digitized from LHA can be found at

Continue reading “Upcoming course projects (Fall and Summer 2016 semesters)”

New script: Archives Finder

archives_finderRecent initiatives in accessioning born-digital archives have focused on removable media, such as using forensic tools to image media (e.g., 1, 2, 3, 4).  However, there has been little discussion of the born-digital archiving needs of institutional archives.  In institutional settings, terabytes of records with permanent value often reside on large, unstructured network drives, often alongside active records.  For example, a National Archives of the UK blog post mentions that  up to two-thirds of government information is held on unstructured shared drives with some departments holding up to 190 terabytes of information.

Tools to identify batches of inactive records, such as the records of departed staff members or initiatives that have long ended, are often lacking and are designed more for IT departments to manage disk space.  To address this need, I created the script Archives Finder that aims to address some of the issues with existing tools for locating batches of inactive records.  Archives Finder searches across large, unstructured network drives for the largest possible grouping of records that are a given number of years old defined by the user.  It also includes “fuzzy math” feature that allows the user to specify that only a certain threshold of files need to by X years old.  The defaults are 95% of files are 7 years old, but these values can be readily modified.  The results are output as a CSV file that can be readily viewed in MS Excel.

You can download the script at GitHub, which runs on Windows machines.

Spring 2016 Courses

President Jimmy Cater, photograph by Bill MarisThis semester is off to a nice start. In LIS 665 Projects in Digital Archives, students will be working to arrange, describe and digitize portions of a collection of architectural photography (with some landscape and craft photography) donated to the School by the estate of Bill Maris. You can checkout the finding aid created by students last Fall here. One of Maris’ digitized photograph is shown here, depicting President Jimmy Carter making furniture.

In LIS 625 Management of Archives and Special Collections, students will continue arranging and describing a collection of records on the history of the school. The finding aid created by students last semester can be found here. You can find both course syllabi below:

Syllabus – LIS 665 Projects in Digital Archives
Syllabus – LIS 625 Managemenet of Archives & Special Collections

New book project: Moving Image and Sound Collections for Archivists

16mm polyester film[Update 8/6/17 – The book is now for sale at the SAA bookstore!]

I am pleased to announce that I am working on a new book project titled Moving Image and Sound Collections for Archivists to be published by the press of the Society of American Archivists.

Most archivists encounter and most archives contain some form of moving image and sound material.  These can include recordings of events on video, oral histories captured on audiotape, and films created by independent filmmakers.  The purpose of this book is to provide practical guidance to the archivist on how to preserve and make accessible the moving image and sound record.  Although the moving image archivist may find value in this book, it is specifically targeted at the general archivist who may deal primarily in paper-based collections and need additional guidance or the student archivist with interest in building-out this expertise.

Continue reading “New book project: Moving Image and Sound Collections for Archivists”

The decline of text on the web

Whitehouse webpage with text blocks selectedI had a hunch that the webpages were deploying text less text than the used to.  I put together a study that looks at the use of text on webpages since 1999, using the Internet Archive’s archived webpages in the WayBackMachine.  I found that indeed there has been a decline, beginning around year 2005.

You can read the paper online at Information Research:

The rise and fall of text on the Web: a quantitative study of Web archives

Update (Oct 17, 2015): I have also blogged about this study on the CILIP blog and the Web Archives for Historians blog.

Archiving Email Newsletters, or getting your Newsletters out of Constant Contact

pride_run_emailIn this blog post, I am going to offer a way to extract large batches of email newsletters from Constant Contact for the purposes of creating email archives, resulting in each message as a PDF.

First, some background.  I have recently finished an email archiving project for the History & Archives of Front Runners New York.  The club used to snail-mail newsletters since the early 1980s, but transitioned to email newsletters around 2004, and has been using Constant Contact since 2007 for its newsletter software.  They had managed to retain all the messages in Constant Contact, however, not all the embedded images.

Constant Contact does not have an easy way to export sent messages in bulk.  Thus, I created a script that leverages the Constant Contact API to export messages and the related metadata.  It creates a PDF, first including a full-length image of the email message, followed by a JSON export of the message metadata, and complete with text-version of the email message (if available).  This allows for the look of the message to be retained, but also text-searchable.

Continue reading “Archiving Email Newsletters, or getting your Newsletters out of Constant Contact”