Award Abstract # 1526700
III: Small: Increasing the Value of Existing Web Archives
| NSF Org: |
IIS
Div Of Information & Intelligent Systems
|
| Awardee: |
|
| Initial Amendment Date: |
August 19, 2015 |
| Latest Amendment Date: |
August 19, 2015 |
| Award Number: |
1526700 |
| Award Instrument: |
Standard Grant |
| Program Manager: |
James French
IIS
Div Of Information & Intelligent Systems
CSE
Direct For Computer & Info Scie & Enginr
|
| Start Date: |
September 1, 2015 |
| End Date: |
August 31, 2019 (Estimated) |
| Total Intended Award Amount: |
|
| Total Awarded Amount to Date: |
$481,780.00 |
| Funds Obligated to Date: |
|
| History of Investigator: |
-
Michael
Nelson
(Principal Investigator)
mln@cs.odu.edu
-
Michele
Weigle
(Co-Principal Investigator)
|
| Awardee Sponsored Research Office: |
Old Dominion University Research Foundation
4111 Monarch Way
Norfolk,
VA
US
23508-2561
(757)683-4293
|
| Sponsor Congressional District: |
|
| Primary Place of Performance: |
|
Primary Place of Performance Congressional District: |
|
| DUNS ID: |
|
| Parent DUNS ID: |
|
| NSF Program(s): |
Info Integration & Informatics
|
| Primary Program Source: |
|
| Program Reference Code(s): |
7364,
7923
|
| Program Element Code(s): |
7364
|
| Award Agency Code: |
|
| Fund Agency Code: |
|
| CFDA Number(s): |
|
ABSTRACT

Web archiving is a thriving activity, but remains at the fringes of the larger web community. Web archiving often runs into two opinions: (1) who cares about the past? and (2) hasn't the Internet Archive solved this already? While the Internet Archive is the cornerstone of web archiving, there remains much work to be done to align archiving with the larger web community. The PIs will investigate a collection of methods and concepts to accelerate the adoption and utility of web archives. While there are more than a dozen publicly available web archives (including the Internet Archive) that are simultaneously accessible via the Memento protocol, these archives are mostly underutilized because they lack the APIs and services to be of greater immediate use to the live web. For example, rather than returning "HTTP 404" responses for pages that are not archived, the archives can introspect on their collections for replacement or similar pages. This project will research: (1) extended APIs for archives; (2) models and methods for archival quality; and (3) user tools and techniques for exploring and understanding temporality on the web. The broader impacts of this research will include increasing the ability of archives to record today's social discourse, which primarily occurs on the web, oftentimes with print or TV as secondary. The ability to publish data on the web far outstrips the ability to archive it for posterity. There are a number of public web archives that are doing yeoman's work saving as much material as they can, but saving is only a precondition for later use. Mostly these archived web pages are being underutilized, only because the tools for extracting the value from these archives are lacking. This project will research and build the tools, infrastructure, and methods to better utilize, understand, and interact with the archived materials that we already have.
Aside from their crawling, archives are mostly passive collections of content that offer little in the way of services other than answering "yes" or "no" to a request for an archived page. Even with the increased rate of archiving (and a greater number of active web archives), there is little analysis on the web archives to provide better services for incoming requests. The PIs will build on their prior API work to explore recommendation services for web pages, where even if an archive does not have the requested web page it can make recommendations for a replacement page based on content and link analysis. This will prevent the web archives from being a dead end if they do not have the requested page. The PIs will also perform fundamental research on the issue of the quality of the reconstructed page, a topic that has been mostly ignored. In particular the PIs are concerned with detecting and resolving "temporal violations," combinations of HTML pages with embedded resources that are presented to the user as an historical page but in fact they never existed in that combination on the live web. This occurs in at least 5% of the pages replayed through the Internet Archive. The other aspect of quality research deals with automatically assessing how damaged an archived page is with respect to its missing embedded resources. Straight percentages (e.g., this page is missing 3 of 57 embedded resources) do not tell the whole tale, but there are automated methods that can be used to estimate how important the resource was (even though you do not have it) to the rendered page. This will allow large-scale assessment not only of pages, but of archive-wide performance for comparable time periods. Lastly, the PIs will focus on tools and methods for allowing users to better understand and interact with the archived web and temporal concepts in general. Users' understanding of temporal concepts is not well advanced, in part because the tools are not in place to allow them to better understand and build models for interaction. For further information see the web site at: http://ws-dl.cs.odu.edu/.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from this site.
(Showing: 1 - 10 of 11)
(Showing: 1 - 11 of 11)
Sawood Alam, Michael L. Nelson, Herbert Van de Sompel, Lyudmila L. Balakireva, Harihar Shankar, and David S. H. Rosenthal
"Web Archive Profiling Through CDX Summarization"
International Journal on Digital Libraries
, v.17
, 2016
, p.223
http://dx.doi.org/10.1007/s00799-016-0184-4
Mohamed Aturban, Mat Kelly, Sawood Alam, John A. Berlin, Michael L. Nelson, and Michele C. Weigle
"ArchiveNow: Simplified, Extensible, Multi-Archive Preservation,"
Proceedings of ACM/IEEE Proceedings of JCDL 2018
, 2018
, p.321
http://dx.doi.org/10.1145/3197026.3203880
Sawood Alam, Mat Kelly, Michele C. Weigle, and Michael L. Nelson
"Unobtrusive and Extensible Archival Replay Banners Using Custom Elements"
Proceedings of ACM/IEEE Proceedings of JCDL 2018
, 2018
, p.319
http://dx.doi.org/10.1145/3197026.3203881
Mat Kelly, Michael L. Nelson, and Michele C. Weigle
"A Framework for Aggregating Private and Public Web Archives"
Proceedings of ACM/IEEE Proceedings of JCDL 2018
, 2018
, p.273
http://dx.doi.org/10.1145/3197026.3197045
Mat Kelly, Lulwah M. Alkwai, Sawood Alam, Michael L. Nelson, Michele C. Weigle, and Herbert Van de Sompel
"Impact of URI Canonicalization on Memento Count"
Proceedings of ACM/IEEE JCDL 2017
, 2017
, p.303
https://doi.org/10.1109/JCDL.2017.7991601
Sawood Alam, Mat Kelly, Michele C. Weigle, and Michael L. Nelson
"Client-side Reconstruction of Composite Mementos Using ServiceWorker"
Proceedings of ACM/IEEE JCDL 2017
, 2017
, p.237
https://doi.org/10.1109/JCDL.2017.7991579
Sawood Alam, Michael Nelson
"MemGator - A Portable Concurrent Memento Aggregator: Cross-Platform CLI and Server Binaries in Go"
Proceedings of ACM/IEEE JCDL 2016
, 2016
, p.243
http://doi.acm.org/10.1145/2910896.2925452
Lulwah Alkwai, Michael L. Nelson, and Michele C. Weigle
"Comparing the Archival Rate of Arabic, English, Danish, and Korean Language Web Pages"
ACM Transactions on Information Systems
, v.36
, 2017
https://dx.doi.org/10.1145/3041656
(Showing: 1 - 10 of 11)
(Showing: 1 - 11 of 11)
Please report errors in award information by writing to: awardsearch@nsf.gov.