Portable Gramps XML Import/Export?

cdhorn · September 21, 2018, 3:29am

Any chance of eventually adding support for the Portable Gramps XML format or base Gramps XML format for both importing and exporting trees? Preferably the portable format though since it packages the media items as well.

DallanQ · September 22, 2018, 11:03pm

I’m unfamiliar with the Portable Gramps XML format. Do you know how many programs outside of Gramps support it? As a possible alternative, I’ve been thinking about either exporting a zip file, which would include your gedcom file and all of your media, or giving you the ability to export your media to dropbox or google drive and having the exported gedcom reference the files from there.

cdhorn · September 22, 2018, 11:29pm

My guess would be Gramps is the only thing that supports it at this time although I could be wrong. The “portable” format is basically their XML format for the data with the media packaged with it, I think in .tar.gz format. It’s basically that idea of being able to export/import the media with everything else… what you are suggesting with a zip format archive is similar and works just as well, and using dropbox or google drive are really great ideas too.

Actually, it’d be nice to be able to add media items by just adding a link to the document in Google drive and either referencing it there or importing it directly from there without a download/upload through the end users machine.

Separate note, but media related, I noticed the Gedcom export from Ancestry has the FILE tags in it for the media items one has uploaded and attached to their tree. Not sure what stack you are using behind the scenes but you probably could develop a custom Gedcom importer for Ancestry files that uses Selenium on the back end to login to Ancestry and extract the images during the import… not sure if you guys have worked with it before, it’s a great tool for that stuff when a vendor doesn’t expose a REST or other API to get at stuff.

DallanQ · September 23, 2018, 12:04am

I like the idea of being able to link to an existing file in dropbox / google drive and referencing it there. I’m not sure it would work unless everyone you invited to your tree also had access to your dropbox / google drive folder. But it seems worth looking into.

I haven’t been able to get the Ancestry FILE links to work. Even when I manually log into Ancestry in my browser, then copy the FILE link into the browser window, they still don’t work for me. Do they work for you? I would love it I could get those links to work.

cdhorn · September 23, 2018, 12:17am

Interesting… the ones in the Gedcom I downloaded a couple days ago work for me, at least at the moment they do. They do not pull up the actual image, they pull up a view media page. You could then navigate the DOM and click view original and then find the image tag and URL that downloads the actual image.

My tree on there is public, I am curious, try this one and let me know if it works:
http://trees.ancestry.com/rd?f=image&guid=d79eaa8f-e5e4-40ff-84db-5fe43c3a2734&tid=81556269&pid=10

DallanQ · September 23, 2018, 12:58am

Wow, I can see that image fine. I’ll have to look into this again. I don’t know why the FILE links didn’t work for me earlier. Thank you for bringing this to my attention!

cdhorn · September 23, 2018, 1:01am

Oh you’re welcome. Thank you for helping create such a great service here!

cdhorn · September 24, 2018, 12:38am

Hey Dallan, I was playing with this a bit this afternoon to start getting a better feel for navigating Ancestry.com and have some working Python code that will parse a Gedcom and login to Ancestry and download all of the unique images for the user contributed / attached media. As each GUID represents a unique image I key off that to only download them once. If this would be useful to you I can spruce it up so you can call it as a command line utility and I can have it spit out the results in json format for you to grab and work with in whatever code you wrote to do the regular import.

cdhorn · September 26, 2018, 3:36am

Each GUID appears to represent a unique linking of an instance of image. If you attach an image someone else shared to two different people in Ancestry and did so separately then they are tracked separately and you get two separate downloads of the same image. You would probably want to catch that during the download process and reswizzle the Gedcom file to account for it before performing the actual import.

Over my DSL link at home it takes 8-10 seconds to process an image, which includes extracting all the metadata for it on the edit page in addition to the download.

DallanQ · September 28, 2018, 6:14pm

I’ve been working on a desktop app that lets you upload media attached to your GEDCOM via FILE paths from desktop genealogy applications. It seems like it would be a good fit to add this functionality to that desktop application.

cdhorn · September 30, 2018, 11:13pm

You know, before I started playing with this stuff I should have looked around more. Other people have done this stuff before and much more and there is code floating around out there to do so. This one doesn’t target the FILE entries which would be considered user content but the Ancestry content specified in the _APID entries:

https://nerok00.github.io/ancestry-image-downloader

Note the warning about their terms and conditions, which are worded differently here in the US. I interpret the US wording as allowing me to download the results relevant to my research and if I was to store them elsewhere as part of my research I’d think that would be acceptable. AFAIK their FamilySync technology that is supported by FamilyTreeMaker does just that. I’m curious did you ever approach them about adding support for their FamilySync interface?

DallanQ · October 2, 2018, 4:22am

Thanks for the link. I’ll take a look at it.

I’ve approached Ancestry several times about their FamilySync service. They are making that available only to FamilyTreeMaker and RootsMagic.

cdhorn · October 21, 2018, 5:59pm

Dallan,

Will this desktop program you mentioned be platform agnostic or Windows only?

Is there a standard format for FILE references? Are local file references also in URI format or some programs just put absolute or relative pathnames there without the file:/// prefix? If so are you supporting either format?

Will this tool support FILE tags with http:// or https:// URIs in addition to local files so it could point to images in Google drive or elsewhere? If so and the Gedcom is from Ancestry.com are you going to assume it is not a direct pointer to the image and try to parse the page to get the image URL and then download that? Or it will be able to handle either scenario?

Finally, will the Gedcom import on the RootsFinder site itself also support FILE tags with a http:// or https:// URI in them if they point to a publicly accessible resource?

I know the best thing would likely be to audit my tree and enter everything manually, but I’m increasingly realizing that 20 years of research has produced some 35,000 or so unique source records I would need to enter and that is a huge amount of time. Probably 2/3 of those have images I would have to download and upload as well so I am back to thinking about how to best approach that.

Maybe it would be better to spend some time cleaning everything up in Ancestry.com over the course of several months and creating proper source records for uploaded documents there. I could then download the Gedcom, and then parse it and download all the related images both user media and Ancestry.com specific media, and then refactor the Gedcom file so this tool you mentioned could be used to upload everything. That would let me consolidate duplicates and perform some other data hygiene tasks at the same time. Or alternately I could do that and then upload everything to Google drive and refactor the Gedcom to point there and use the Gedcom import on the main RootsFinder site to do the import if it supports it.

Curious as to your thoughts, and time frame for the desktop tool you are working on.

Thanks,
Chris

DallanQ · October 29, 2018, 1:36pm

The desktop program will be platform agnostic. As far as I know, there isn’t a standard format for FILE references. Most of the references I’ve seen are for relative paths, without a file:/// prefix. I will support both relative and absolute file paths.

The thing that needs to be added is support for http(s):// references. I’m hoping that those are direct pointers to images so I don’t have to parse the page to get the image url. If they end up being html pages that need to be parsed, that’s more work and would need to be added later.

I don’t anticipate that the Import on RootsFinder would support FILE tags with http(s):// references, even if they point to public addresses. I anticipate that importing media would be a feature only of the desktop app in order to simplify the RootsFinder site importer.

I have some DNA features that I’ve committed to implementing in the next couple of months. I expect that the media importer will be available January or February of next year (before RootsTech).

cdhorn · October 30, 2018, 12:44am

Dallan,

Excellent, thanks much! When the time comes if you need anyone to test please don’t hesitate to drop me an email, I’d love too. Maybe best to not bother with http(s):// support for now for the desktop utility, for Ancestry at least there will always be the wrapper page to obfuscate things.

I played more this past Saturday and have code to handle the _APID entries and the user media FILE references. It not only downloads all the images, it extracts all the relevant metadata from the pages and for the _APID pages generates a screenshot of each page as a number of them don’t have images but are text only or URL references. I may add support for some of the URL references to things like Find-a-Grave records to walk those out as well at some point and extract or screenshot the page on the external site.

My intent is to take all of that and refactor the Gedcom to optionally replace or supplement all the _APID references as well as fold in the screen shots, cleanup more obvious data issues, and stuff like that. I think that should give me a way to import it all into Gramps and using your tool RootsFinder and I’ll finally have what I consider a true backup of my tree on Ancestry.com as the Gedcom by itself is, while very important, only a part of it. The code is in Python and will be posted to Github at some point in case others find it of use as well.

It really is a shame they won’t allow you to interface with their sync technology, and the two vendors they do allow are not platform agnostic.

Thanks,
Chris

cdhorn · November 25, 2018, 6:49pm

Dallan,

Another question since you have far more experience with this stuff than I for sure.

I see Ancestry.com embeds the OBJE record with the FILE, FORM and TITLE attributes under the INDI or other record they are attached to. They do not create separate OBJE records and then reference them elsewhere even though that makes more sense and I see is part of the 5.5.1 standard.

My thinking in refactoring the file is to create the actual @O01234@ OBJE records and just reference them elsewhere where needed, especially since I’m adding one or more for every _APID.

Do you know if most Gedcom importers properly support that and will the one you have planned do so as well?

Thanks,
Chris

DallanQ · November 28, 2018, 2:30pm

To quote “pirates of the caribbean”, the GEDCOM standard is more a set of guidelines than actual rules. I doubt anyone implements the complete specification (because it is quite complex with infinite recursion possible), and most people have extended it (because it’s been almost 20 years since the last update). What I do, and what I assume most people do, is test the GEDCOMs exported from the various desktop and genealogy websites that I’m familiar with in order to make sure that RootsFinder works well with those GEDCOM dialects. So you could try refactoring, but you’ll need to test to find out how well the various importers handle your output.

I’m still planning to work on Ancestry media importing by the way. I’ve been busy lately adding DNA-related features, but I expect to have it ready by RootsTech in February.

cdhorn · November 29, 2018, 2:03am

Ah, well said and great movies! Yeah, my second pass through the extract code and initial pass through the refactor code are done and the Gramps importer is handling it pretty well although I need to do more testing as I’m still very much figuring things out. I extracted the data as well as took a screenshot of every source citation record and folded all that information into the Gedcom as well and Gramps appears to see it all fine as best I can tell. Crazy amount of data, some 33928 screenshots and 10159 media items at the moment.

I did a couple test imports into RootsFinder but want to finish figuring a few more things out. I’ll likely have some more questions for you at that point.

I’m figuring the only difference between the online import and future desktop utility will be the desktop one uploading all the media files.

I know you have the DNA enhancements and a zillion other things going on I’m sure, so no rush.

cdhorn · November 29, 2018, 4:56am

I loaded a test Gedcom there, tree name is Test. If your importer keeps a log with errors and issues found during import and you would be willing to share it can you email it to me? Would be much appreciated!

I added a note with URL to the record collection to the source record. I see you pull that in but then try to use that in the citation/evidence when it is a link to the main search page for the collection and not to a specific page in the collection. The specific page link I can include in the citation, but it would be to the wrapper/record page and not actual image. Maybe I’ll try that next time to see how you handle it. I assume you pull from evidence first and then fall back to source.

On the source record I like that you pull the URL from the notes and put it in the source URL field though. If you do that the note in the note section seems superfluous though if all it contains is a URL and nothing else of value. Assuming you export the URL as a note not sure you need it.

On the source record I thought TEXT would target the bibliographic citation field but it went into notes. How do I do that? Would that be a NOTE subordinate to a DATA entry?

DallanQ · November 30, 2018, 2:11pm

When the evidence is displayed, I’m pretty sure I fall back to the source if there isn’t a URL in the evidence.

SOUR records contain bibliographic information in TITL, AUTH, and PUBL fields. I’d say the best place to put a complete source citation is in the PUBL field.