Skip to main content
Question

Combining multiple WARCs from Archive-It

  • 12 July 2024
  • 1 reply
  • 46 views

We recently started with Preservica 7.2.1, Professional Plus.

We want to ingest website crawls created in Archive-IT.  Each Archive-IT crawl produced multiple WARC files which we are able to harvest using the WASAPI tool. 

My question is:  Has anyone recently successfully ingested multiple WARC files from a single Archive-IT Website crawl and integrated them so that Replay rendered them as a single viewable resource?

I would love to get notes on your workflow!

Thanks in advance-

Elizabeth Altman

 

1 reply

Userlevel 4
Badge +6

Hi Elizabeth

Check out this (somewhat) related thread:

As you can see, the individual WARC files for a crawl are stored in Preservica as multiple content objects under an asset.

I am not sure it would work but I am wondering if you could use PAX ingest as described in the Standard Workflows document here to produce such required hierarchy

If it worked, you would then have to probably also tweak the metadata to point to the starting URL for the replay. I believe this has been discussed / described in another post so you should be able to find how to do it here.

Reply