Award Abstract # 1526700
III: Small: Increasing the Value of Existing Web Archives

NSF Org: IIS
Div Of Information & Intelligent Systems
Awardee:
Initial Amendment Date: August 19, 2015
Latest Amendment Date: August 19, 2015
Award Number: 1526700
Award Instrument: Standard Grant
Program Manager: James French
IIS
 Div Of Information & Intelligent Systems
CSE
 Direct For Computer & Info Scie & Enginr
Start Date: September 1, 2015
End Date: August 31, 2019 (Estimated)
Total Intended Award Amount:
Total Awarded Amount to Date: $481,780.00
Funds Obligated to Date:
History of Investigator:
  • Michael  Nelson (Principal Investigator)
    mln@cs.odu.edu
  • Michele  Weigle (Co-Principal Investigator)
Awardee Sponsored Research Office: Old Dominion University Research Foundation
4111 Monarch Way  Norfolk,
VA
 US  23508-2561
(757)683-4293
Sponsor Congressional District:
Primary Place of Performance:
Primary Place of Performance
Congressional District:
DUNS ID:
Parent DUNS ID:
NSF Program(s): Info Integration & Informatics
Primary Program Source:
Program Reference Code(s): 7364, 7923
Program Element Code(s): 7364
Award Agency Code:
Fund Agency Code:
CFDA Number(s):

ABSTRACT

Web archiving is a thriving activity, but remains at the fringes of the larger web community. Web archiving often runs into two opinions: (1) who cares about the past? and (2) hasn't the Internet Archive solved this already? While the Internet Archive is the cornerstone of web archiving, there remains much work to be done to align archiving with the larger web community. The PIs will investigate a collection of methods and concepts to accelerate the adoption and utility of web archives. While there are more than a dozen publicly available web archives (including the Internet Archive) that are simultaneously accessible via the Memento protocol, these archives are mostly underutilized because they lack the APIs and services to be of greater immediate use to the live web. For example, rather than returning "HTTP 404" responses for pages that are not archived, the archives can introspect on their collections for replacement or similar pages. This project will research: (1) extended APIs for archives; (2) models and methods for archival quality; and (3) user tools and techniques for exploring and understanding temporality on the web. The broader impacts of this research will include increasing the ability of archives to record today's social discourse, which primarily occurs on the web, oftentimes with print or TV as secondary. The ability to publish data on the web far outstrips the ability to archive it for posterity. There are a number of public web archives that are doing yeoman's work saving as much material as they can, but saving is only a precondition for later use. Mostly these archived web pages are being underutilized, only because the tools for extracting the value from these archives are lacking. This project will research and build the tools, infrastructure, and methods to better utilize, understand, and interact with the archived materials that we already have.

Aside from their crawling, archives are mostly passive collections of content that offer little in the way of services other than answering "yes" or "no" to a request for an archived page. Even with the increased rate of archiving (and a greater number of active web archives), there is little analysis on the web archives to provide better services for incoming requests. The PIs will build on their prior API work to explore recommendation services for web pages, where even if an archive does not have the requested web page it can make recommendations for a replacement page based on content and link analysis. This will prevent the web archives from being a dead end if they do not have the requested page. The PIs will also perform fundamental research on the issue of the quality of the reconstructed page, a topic that has been mostly ignored. In particular the PIs are concerned with detecting and resolving "temporal violations," combinations of HTML pages with embedded resources that are presented to the user as an historical page but in fact they never existed in that combination on the live web. This occurs in at least 5% of the pages replayed through the Internet Archive. The other aspect of quality research deals with automatically assessing how damaged an archived page is with respect to its missing embedded resources. Straight percentages (e.g., this page is missing 3 of 57 embedded resources) do not tell the whole tale, but there are automated methods that can be used to estimate how important the resource was (even though you do not have it) to the rendered page. This will allow large-scale assessment not only of pages, but of archive-wide performance for comparable time periods. Lastly, the PIs will focus on tools and methods for allowing users to better understand and interact with the archived web and temporal concepts in general. Users' understanding of temporal concepts is not well advanced, in part because the tools are not in place to allow them to better understand and build models for interaction. For further information see the web site at: http://ws-dl.cs.odu.edu/.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

(Showing: 1 - 10 of 11)
Sawood Alam, Mat Kelly, Michael Nelson "InterPlanetary Wayback: The Permanent Web Archive" Proceedings of ACM/IEEE JCDL 2016 , 2016 , p.273 http://doi.acm.org/10.1145/2910896.2925467
Sawood Alam, Michael L. Nelson, Herbert Van de Sompel, Lyudmila L. Balakireva, Harihar Shankar, and David S. H. Rosenthal "Web Archive Profiling Through CDX Summarization" International Journal on Digital Libraries , v.17 , 2016 , p.223 http://dx.doi.org/10.1007/s00799-016-0184-4
Mat Kelly, Sawood Alam, Michael L. Nelson, Michele C. Weigle "InterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives" Proceedings of TPDL 2016 , 2016 , p.411 https://doi.org/10.1007/978-3-319-43997-6_35
Mohamed Aturban, Mat Kelly, Sawood Alam, John A. Berlin, Michael L. Nelson, and Michele C. Weigle "ArchiveNow: Simplified, Extensible, Multi-Archive Preservation," Proceedings of ACM/IEEE Proceedings of JCDL 2018 , 2018 , p.321 http://dx.doi.org/10.1145/3197026.3203880
Sawood Alam, Mat Kelly, Michele C. Weigle, and Michael L. Nelson "Unobtrusive and Extensible Archival Replay Banners Using Custom Elements" Proceedings of ACM/IEEE Proceedings of JCDL 2018 , 2018 , p.319 http://dx.doi.org/10.1145/3197026.3203881
Sawood Alam, Michael L. Nelson, Herbert Van de Sompel, David S. H. Rosenthal "Web Archive Profiling Through Fulltext Search" Proceedings of TPDL 2016 , 2016 , p.121 https://doi.org/10.1007/978-3-319-43997-6_10
Mat Kelly, Michael L. Nelson, and Michele C. Weigle "A Framework for Aggregating Private and Public Web Archives" Proceedings of ACM/IEEE Proceedings of JCDL 2018 , 2018 , p.273 http://dx.doi.org/10.1145/3197026.3197045
Mat Kelly, Lulwah M. Alkwai, Sawood Alam, Michael L. Nelson, Michele C. Weigle, and Herbert Van de Sompel "Impact of URI Canonicalization on Memento Count" Proceedings of ACM/IEEE JCDL 2017 , 2017 , p.303 https://doi.org/10.1109/JCDL.2017.7991601
Sawood Alam, Mat Kelly, Michele C. Weigle, and Michael L. Nelson "Client-side Reconstruction of Composite Mementos Using ServiceWorker" Proceedings of ACM/IEEE JCDL 2017 , 2017 , p.237 https://doi.org/10.1109/JCDL.2017.7991579
Sawood Alam, Michael Nelson "MemGator - A Portable Concurrent Memento Aggregator: Cross-Platform CLI and Server Binaries in Go" Proceedings of ACM/IEEE JCDL 2016 , 2016 , p.243 http://doi.acm.org/10.1145/2910896.2925452
Lulwah Alkwai, Michael L. Nelson, and Michele C. Weigle "Comparing the Archival Rate of Arabic, English, Danish, and Korean Language Web Pages" ACM Transactions on Information Systems , v.36 , 2017 https://dx.doi.org/10.1145/3041656
(Showing: 1 - 10 of 11)

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page