Archive in WARC format
WARC (Web ARChive) is a file format specifically designed for web archives. It's primarily used for the long-term preservation of digital data.
Nottingham University Digital Research has an excellent tutorial to set this up once installed on your pc run one of these commands
For a static html version
wget --mirror --recursive --convert-links --adjust-extension --random-wait --page-requisites --local-encoding=UTF-8 --no-parent -R "*.php, *.xml" https://[your website address]
For a WARC version
wget --mirror --recursive --convert-links --adjust-extension --random-wait --page-requisites --no-parent -R "*.php, *.xml" --warc-cdx --warc-file=[YOUR_FILENAME] https://[your website address]The above will result in
a large .warc.gz file containing all the website elements
a .cdx log file that lists all assets in this archive
You will then need
a programme to expand the .gz file
A programe to view the webpages
https://github.com/webrecorder/replayweb.page/releases
and
A way of viewing the cdx log file
https://glogg.bonnefon.org/download.html
a tutorial on how to open WARC files is here
See also https://wiki.archiveteam.org/index.php/Wget_with_WARC_output
Add new comment