Skip to main content
Celebrating Heritage, Promoting Our Future

Archive in WARC format

WARC (Web ARChive) is a file format specifically designed for web archives. It's primarily used for the long-term preservation of digital data.

Nottingham University Digital Research has an excellent tutorial to set this up once installed on your pc run one of these commands

 

 

For a static html version

wget --mirror --recursive --convert-links --adjust-extension --random-wait --page-requisites --local-encoding=UTF-8 --no-parent -R "*.php, *.xml" https://[your website address]

 

For a WARC version

wget --mirror --recursive --convert-links --adjust-extension --random-wait --page-requisites --no-parent -R "*.php, *.xml" --warc-cdx --warc-file=[YOUR_FILENAME] https://[your website address]

The above will result in 

a large .warc.gz file containing all the website elements

a .cdx log file that lists all assets in this archive

You will then need 

a programme to expand the .gz file

https://www.7-zip.org/

 

A programe to view the webpages

https://github.com/webrecorder/replayweb.page/releases

and 

A way of viewing the cdx log file

https://glogg.bonnefon.org/download.html 

 

 

a tutorial on how to open WARC files is here

See also https://wiki.archiveteam.org/index.php/Wget_with_WARC_output 

 

Add new comment

Plain text

  • No HTML tags allowed.
  • Lines and paragraphs break automatically.
  • Web page addresses and email addresses turn into links automatically.