survey

Survey: Bulk ingest into Preservica using a CSV file

Forum|Forum|4 years ago
November 4, 2021
13 replies
1339 views

David
Community Manager

Hi All

We are currently working on a new feature enabling you to bulk ingest assets and metadata into Preservica using a CSV file.

It would be great if you could take a couple of minutes to answer a short survey, so we can learn more about your requirements and develop a solution that truly meets your needs.

The survey will close on Friday, 12th November 2021.

Please complete the short survey here

Thanks in advance.

Michaela

Product Director, User Value

kate_s
Known Participant
Forum|Forum|4 years ago
November 4, 2021

Would this mean we wouldn’t have to figure out how to translate the content in the CSV file to a .metadata XML document for every asset we ingest? In other words, would your system take our CSV file and do all that translation for us? Right now we are bogged down in having to come up with a variety of scripts that can translate the data, and it’s a long and complex process.

Kate

--kate s. (she/her)

Anonymous
Forum|Forum|4 years ago
November 5, 2021

Hi Kate,

Yes you will be able to use a CSV file instead of an XML document, and ingest content and metadata into Preservica that way.

Kind regards,

Michaela

kate_s
Known Participant
Forum|Forum|4 years ago
December 1, 2021

Hi Michaela:

I watched your presentation video on this topic from November 2021. Thank you very much. It was very helpful and interesting. I have a few questions about this upcoming feature.

What is the projected timeline for the launch of this feature?
I don’t have Starter, I have another one of your products. (I believe it’s Cloud, but I’m not entirely sure because I’m not sure where to look that up.) Will this feature be available to me?
In my current instance of Preservica, it knows which .metadata file is paired with which asset because the names are identical, with the exception that “.metadata” is added to the end of the metadata file. With the metadata in a CSV file, how will the system know which line to pair with which asset?
Will you be able to customize the format of the CSV file? For example, in my current workflow, I create an Excel file with particular columns of data. So for example, if I have ten PDF articles to ingest, I have an Excel file with ten rows of metadata. Then, I upload the Excel file into OpenRefine and normalize the data. (I normalize the format of the numbers and I split some columns into multiple columns. For example, if I have a column with a list of three keywords, I can split that into three columns.) Then I export the file from OpenRefine using a script that maps each column into the appropriate MODS field and generates a TXT file of code based on my entire Excel file. Once I have this TXT file, I can use a script to split the code into separate XML files for each item. So if we are talking about my earlier example, I’d now have ten XML files. I then convert those .xml files to .metadata files by changing the ending. These files have the exact same name as my PDFs, save for the .metadata ending. Then I ingest all these files—the PDFs and their paired .metadata file—into Preservica, and Preservica automatically matches the metadata for the correct file. If I’m understanding you correctly, this new feature will allow me to skip using OpenRefine and the scripts that generate the XML code and instead, I could upload all ten PDFs plus my one Excel CSV file and Preservica will automatically pair the data from the correct row and column to the correct PDF metadata field. Am I understanding that correctly? And will it matter that my Excel document has columns that our team has curated specifically for our content?

Thank you very much.

Best,

Kate

--kate s. (she/her)

Anonymous
Forum|Forum|4 years ago
December 3, 2021

Hi Kate,

Many thanks for watching the video. Here are the answers to your questions:

The launch of this new feature for Starter users will be January 2022
The feature will be available to CE users later next year, probably in the second half of the next calendar year
The file will contain asset titles, which is how the system knows how to pair metadata with the asset.
You will be able to skip the Open Refine step, but please bear in mind that the excel documents will need to have exactly the same columns as the version of DC and MODS supported by Starter, and that this initial version of the feature will not support custom metadata fields. So the format of your metadata file needs to be exactly the same as the DC or MODS templates made available in the product. We will be adding support for custom metadata fields next year, which will provide more flexibility when uploading metadata.

Kind regards,

Michaela

kate_s
Known Participant
Forum|Forum|4 years ago
December 7, 2021

Thanks for your reply. Where can I see ahead of time what the columns will be? And does it need to be exact down to the number of repetitions? For example, if your template has space for 4 associated authors, am I limited to 4 and cannot add a 5th or 6th like I can in my own spreadsheet?

Also, I am not sure where to post this, but I’m unable to access the recording for “Recording: Starter user community workshop – December 2021.” The link goes to a protected YouTube link at the moment. Thank you!

Kate

--kate s. (she/her)

David
Author
Community Manager
Forum|Forum|4 years ago
December 8, 2021

Hi Kate, you should now be able to view the recording from the Starter user community workshop (thanks for highlighting :smiley: )

Watch the recording

David Portman

Anonymous
Forum|Forum|4 years ago
December 8, 2021

Hi Kate,

Thanks for your question:

Where can I see ahead of time what the columns will be? And does it need to be exact down to the number of repetitions? For example, if your template has space for 4 associated authors, am I limited to 4 and cannot add a 5th or 6th like I can in my own spreadsheet?

We will make the templates available to you, but we don’t have these as yet. The columns in your spreadsheet need to correspond exactly to the columns in the template.

Could I ask whether you already have a spreadsheet you would like to use or whether you are just anticipating a need for multiple author fields for example?

Kind regards,

Michaela

kate_s
Known Participant
Forum|Forum|4 years ago
December 8, 2021

Could I ask whether you already have a spreadsheet you would like to use or whether you are just anticipating a need for multiple author fields for example?

I do have a spreadsheet that I'm already using. Using a script in OpenRefine, I generate companion XML documents for all the PDFs and images being ingested. My spreadsheet has customized headers, including one primary contributor and space for four associated contributors. Because fields are repeatable in MODS, I could technically add more if needed. (Scientific papers often have lots of authors.) My spreadsheet has headers your template won’t have, like “Parent Volume” and “Parent Number,” which gets mapped to MODS like this:

<mods:detail type="volume">
<mods:number>23</mods:number>
</mods:detail>
<mods:detail type="number">
<mods:number>1</mods:number>
</mods:detail>

Including all possible fields and the title, I have 61 potential pieces of metadata I fill in on my Excel spreadsheet for any given item. Of course, many of those fields end up being blank because chances are slim all 61 fields will apply to any given item.

It sounds like your CSV solution is for Starter users, not for Cloud users, so it won’t apply to us anyway, although it would be much easier than the roundabout way we’re doing it now. I’m very interested to see if this might be applicable to our situation, which is as a University that will be ingesting student and faculty scholarship, as well as some archival photos and other historically significant materials.

The asset submission functionality is also something I'm very interested in, as right now we have to build out our own submission forms. But I understand that won’t be available to Cloud users either, at least not in the foreseeable future.

Kate

--kate s. (she/her)

alarge
Known Participant
Forum|Forum|4 years ago
January 11, 2022

Hi-

I was just wondering when the CSV template for bulk ingest metadata will be available. I think we’re on track for January (2022)?

Thanks!

Ashley

Town Archivist

Town of Bedford, MA

Anonymous
Forum|Forum|4 years ago
January 12, 2022

Hi Ashley,

We should be on track for end of January release. You will be able to find out more in the upcoming Community Workshop:

Kind regards,

Michaela

alarge
Known Participant
Forum|Forum|4 years ago
January 12, 2022

Thank you!

hardenj
Participating Frequently
Forum|Forum|4 years ago
January 19, 2022

Is there any additional information on when this feature will be available to EPC users?

stamar
New Participant
Forum|Forum|3 years ago
August 17, 2022

Is this feature available to CE users yet?

Sign up

Login to the community