Skip to main content

I’m saving our press releases that are emailed to staff and faculty with a link. The link takes you to a webpage where the release is posted. I can use Conifer to record the webpage and ingest it into Preservica, but for the best discoverability, it would be nice if these articles could be OCR’d. It doesn’t seem like that’s a feature in Starter, which I understand. How are other Starter users compensating for this?

My own methods, so far: If there’s a photo (which is most of the tme) is to make subject headings of the people identified. I had an article where the photo only named one person, but the article named seven or eight others. I decided to list the names in the Description field of my DC metadata. Don’t ask me why. I’m trying different approaches.

Am I missing something obvious or do you have better methods?

I don't have the answers but you, but this is a good question. Does Preservica Starter search live text (i.e., non-rasterized and non-vectored) embedded inside a pdf? If it does, you could use another tool like ABBYY reader to OCR the text before uploading to PS. Better, can you request the source file and convert into a searchable PDF/A prior upload?


I know that the Enterprise edition does have OCR capabilities.  I wonder if there are any plans to implement this function in Starter/Starter Plus...


Text based PDFs and word documents should be full-text searchable. Sometimes it takes a while (up to a few days) for this to be completed, but it should work. For instance, I recently uploaded a PDF created from a word doc and after a couple days it was showing up in my search results. OCR is not included otherwise until Enterprise as Ashley mentioned here, but there are free solutions for OCR that you can use prior to ingesting into Starter. 

Here’s an article from techradar on OCR solutions (free options are listed at the end).

Hope this is helpful! 


Reply