Wiki backup question

T

Tabby 15 years ago

Now seems like a good time to ask who is keeping backups of what? I've got backups of the articles I've done, and pics I've added, but not in any wiki native format, they're .txt and .jpg, so restoration would take ages. Wgetting the wiki a while back proved a bit impractical, and its far bigger now.

NT

Vote

S

Scott M 15 years ago

There's various bits of s/w out there that will spider their way through a website, grab all the files that make it up and write it to disc. I used one called WinHTTrack once which worked ok.

Vote

M

Mike Clarke 15 years ago

OK for sites with static pages but you can't rely on this approach if the site relies on server side scripts to generate the pages.

Vote

N

Nick Leverton 15 years ago

They're rarely much use when a site is heavily scripted, as you get the rendered output of the scripts rather than the actual website source. As NT (Tabby) hinted, the average wiki consists entirely of scripts, which interpret the wiki contents, images and markup and convert them to HTML. As with any software, to back it up you need to save the sources, not the output.

Whoever has the admin rights to the wiki server should be easily able to access the source files and images and make a backup of them.

Failing that, and if you don't need the wiki history or discussion pages or other metadata, perhaps one could adapt a spider to only follow the wiki "view source" links and scrape the unformatted text from them ...

Nick

Vote

G

geoff 15 years ago

Yeah - just tried ...

Vote

H

Huge 15 years ago

wget

Vote

D

Dave Liquorice 15 years ago

The MediaWiki software package has facilties for taking full backups of the database(s) that it uses to store the information that the scripts present as web pages. The admins should have access to the server to do this, IIRC this requires shell access to the server preferably via SSH.

Cheers Dave.

(Admin of another wiki)

Vote

T

Tabby 15 years ago

OK, I went to

formatting link

entered the entire index of pages, but instead of working it gave this: XML Parsing Error: no element found Location:

formatting link

Number 1, Column 1: ^

I dont see anything elsewhere that might be used to export pages in any form. Anyone a bit more familiar with wikimedia? Thanks

NT

ps To the folks that suggested using various inbuilt methods, I dont know what the terms mean, so dont know how to do what was sugested.

pps As someone suggested, I did use wget a while ago to harvest the edit pages, which is where the article text is, but its a pretty horrible way to do it, and would be a mare to reinstate.

Vote

A

Andy Dingley 15 years ago

MediaWiki sucks for this. DB level access is a royal nightmare to actually use (in practice it means that you often can't use a backup you took earlier).

One of the best and most robust ways, although painful, is to use Special:Export and Special:Import to produce XML dumps of the wiki content (including categories, templates or other namespaces). This has the great advantage that it can be done through normal wiki pages, without being a server admin (wiki admin permissions are usually needed, certainly for import). The downside is that Export needs a list of pages, not wildcard - the best way I have to handle this is by installing the DPL (Dynamic Page List) extension and using this to make a page that outputs a list of page names to export. You can also use this same approach to replicate content (or part content) from one wiki to another.

Vote

T

Tabby 15 years ago

I've got all the xmls saved. It will only spit out a limited number of articles per xml page.

Next question is the image files. I've got the ones I've contributed only as jpgs, so need a much faster to restore format for all of them. Any suggestions?

thanks, NT

Vote

J

John Rumm 15 years ago

Its hosted on one of Grunff's servers, so I expect (although have not checked!) it will be backed up as a matter of course. I will ask him and see what level of backup is in place.

I have access via FTP to the account that hosts all the media wiki stuff, and also to the MySQL server that holds the (non image) content[1]. Last time I looked however it was getting a bit on the large size (i.e. several gig) for sucking down ADSL connections on a routine basis.

[1] the images a dumped into a directory hierarchy, and links to them are held in the DB rather than the images themselves begin held as BLOBs

Vote

T

Tabby 15 years ago

Life happens... best we each have a copy I think.

I followed some of that. I dled all the arrticles in xml format, total

3.1MB, I dont know what the rest of those gigabytes is. The images should total somewhere vaguely in the ballpark of 100M so far.

NT

Vote

A

Andy Dingley 15 years ago

re the file size issue, are you just looking at the source images, or both the source images and also the cached thumbnail versions?

Vote

J

Jules Richardson 15 years ago

JOOI, how easy is it to get at the "raw data"? Is the database format nicely-documented and the data stored in an accessible form such that it can easily be pulled out independently of the wiki software? I'm always a bit wary of public information tucked away behind proprietary software in a proprietary format...

cheers

Jules

Vote

D

Dave Liquorice 15 years ago

MediaWiki is just a collection of PHP scripts that access a MySQL database pretty sure that all is open source...

Vote

T

Tim Watts 15 years ago

Dave Liquorice ( snipped-for-privacy@howhill.com) wibbled on Monday 03 January

2011 00:25:

formatting link

's all there.

As someone else said, the only way to do a proper backup is from the server with suitable permissions (MySQL dump and read acccess to the Wiki config and media files).

I'm sure whoever is running it has though of all that, but if people are worried, perhaps someone with the connections could ask?

In the absence of that, a wget mirror would be better than nothing but it would be painful to reconstruct (unless using a "smart" backup script that pulled each page in edit mode so that the raw wiki markup were grabbed plus spidering into full resolution versions of any embedded images etc) and even then all the history is byebye.

Vote

A

Andy Dingley 15 years ago

What's on the Mediawiki site is very far from "all of it"

In particular, be careful when trying to restore DB backups (this is a general problem for backups at that level), because you will run into problems if the target system is different from the source system. It's OK if you're restoring after a disk crash into an identical environment, but it's very likely to break if the installation was different (DB names for multiple wikis on shared DB servers is one), if the context is different (home vs. production server) or even if you try to put a backup from an older MW onto a newer MW.

Vote

T

Tim Watts 15 years ago

Andy Dingley ( snipped-for-privacy@codesmiths.com) wibbled on Monday 03 January 2011

12:02:

That's in the details - which I take for granted when it comes to an actual implementation (this is what I do as well for a living).

The short of it is though that it comes down to backing up RDBMS and a bunch of files - I didn't want to write a treatise on the exact procedure which as you say is rather more complicated. Just affirming that any sort of meaningful backup does require access to the server.

DB names aren't a problem if you run individual backups per DB - at least not on Postgresql - I usually dispense with the "backup all DBs" program in favour of scripting my own that backs up each DB into a seperate file for very similar reasons.

I sort of expected MediaWiki to have a generic backup option at the application level to avoid problems with restoring to slightly different versions - Horde does, at least at a per-app level.

But if that link above is correct (by lack of omission) that doesn't seem to be an option?

Which would mean recover would have to be done to the same version - but I don't see that as too much of a problem.

Cheers

Tim

Vote

A

Andy Dingley 15 years ago

Table names may vary too.

If you install multiple wikis on the typical shared hosting with a single DB visible to that hosting account, they're disambiguated by a per-wiki name prefix on the DB object names, set when you first install. You're likely to see the same single-DB situation when you have multiple development wikis on a laptop (I have three hosted on this one). The need for this is also a reason to always use the optional prefix when installing, as it makes it easier to move them around later.

Vote

T

Tim Watts 15 years ago

Andy Dingley ( snipped-for-privacy@codesmiths.com) wibbled on Monday 03 January 2011

13:42:

I do prefer the multiple DB way of installing stuff - much easier to manage. But I am aware of the table prefix method.

I don't think this is a problem - the stated issue is how to back up the running server. It is a prefectly reasonable assumption that the server would be reinstalled in the same environment, or if one had to reinstall to a new environment, then you'd have to make that environment the same, even installing an older matching version of MediaWiki then upgrading to the current if necessary, which is also reasonable. I think for this exercise it is a moot issue. I have seen the same scenario with MythTV where the database gets modified with upgrades so I'm aware of the issue.

One thing though - I always work with Postgresql avoiding MySQL unless I really have no choice.

Postgresql's pg_dump program has options to dump pure SQL as well as binary format dumps. Obviously, with a pure SQL dump, doing a search and replace on all prefixes is possible. How do MySQL dumps work?

Vote

Wiki backup question

Join the Discussion

Didn't find your answer?