Every once in a while I help out with a Koha installation. At the moment, they have asked me to fix a bunch of duplicate biblio records. I have found two main ways to do this. A manual way where you are in control every step of the way: http://manual.koha-community.org/3.2/en/stafflists.html#mergebibrecs And a potentially destructive way where you just guess and see what happens: https://saturn.ffzg.hr/koha/index.cgi?action=revision_view;page_name=removin... As I understand it, the latter basically has you choose one of the biblio records, point all your identical items to the one biblio record, and delete the other biblio records. Then, one runs a script (sync_items_in_marc_bib.pl) to add any missing data to the biblio record by pulling the data from the items. Being a mysql guy and scripting guy, this latter approach seems to be the "easy" way to do it. If I were a librarian and understood the biblio data, I might be howling in anguish at the thought of randomly selecting the biblio record. But I have no idea what the information means. I am a sysadmin and have no real understanding nor ownership of the data. And, I am not sure the people asking me to do this job completely understand the nuances of this either. So, I ask the Koha community for advice. Should I make a little script that runs the duplicate biblio sql script, selects one of the biblio records to point all the items to, and delete all the other biblio records? One would need to sync the marc items and reindex when done. Or is that basically a terrible thing to do? And, if I make such a script, is there a place where I should put the script so others do not need to make the same thing? - Tim Young
Tim, Regarding you message about what you should do with duplicate bibliographic records in Koha, as a cataloger myself, I can tell you that not all bibliographic records are created equal. Many bibliographic records are missing subject headings (600's, 650's, etc.) and contributor headings (700's), not to mention that many records do not meet RDA standards. FWIW, Christopher Davis Systems & E-Services Librarian Uintah County Library cgdavis@uintah.utah.gov (435) 789-0091 <14357890091> ext.261 uintahlibrary.org basinlibraries.org facebook.com/uintahcountylibrary instagram.com/uintahcountylibrary On Thu, Jul 20, 2017 at 2:27 PM, Tim Young <Tim.Young@lightsys.org> wrote:
Every once in a while I help out with a Koha installation.
At the moment, they have asked me to fix a bunch of duplicate biblio records. I have found two main ways to do this.
A manual way where you are in control every step of the way: http://manual.koha-community.org/3.2/en/stafflists.html#mergebibrecs
And a potentially destructive way where you just guess and see what happens: https://saturn.ffzg.hr/koha/index.cgi?action=revision_view; page_name=removing_duplicate_records;revision_id=20091114221320
As I understand it, the latter basically has you choose one of the biblio records, point all your identical items to the one biblio record, and delete the other biblio records. Then, one runs a script ( sync_items_in_marc_bib.pl) to add any missing data to the biblio record by pulling the data from the items.
Being a mysql guy and scripting guy, this latter approach seems to be the "easy" way to do it. If I were a librarian and understood the biblio data, I might be howling in anguish at the thought of randomly selecting the biblio record. But I have no idea what the information means. I am a sysadmin and have no real understanding nor ownership of the data. And, I am not sure the people asking me to do this job completely understand the nuances of this either.
So, I ask the Koha community for advice. Should I make a little script that runs the duplicate biblio sql script, selects one of the biblio records to point all the items to, and delete all the other biblio records? One would need to sync the marc items and reindex when done. Or is that basically a terrible thing to do?
And, if I make such a script, is there a place where I should put the script so others do not need to make the same thing?
- Tim Young
_______________________________________________ Koha mailing list http://koha-community.org Koha@lists.katipo.co.nz https://lists.katipo.co.nz/mailman/listinfo/koha
I on occasion have needed to "dedupe" bib records. We have a script that looks for dupes using isbn as the key. We then make the determination on "best" record to keep by see NH which bib record is longer. (Longer implies more tags ergo better cataloging). We then move all items to the best record, along with any outstanding holds. Our script then deletes the "losing bib". It is always a balance between getting rid of dupes (makes Cataloger happy) and retaining good bins (blindly choosing by length alone makes Cataloger sad) I am not familiar with script you reference (sync items). I'd have to read it to see if It would work well in deduplication efforts. Joy Sent from my iPhone
On Jul 20, 2017, at 3:27 PM, Tim Young <Tim.Young@LightSys.org> wrote:
Every once in a while I help out with a Koha installation.
At the moment, they have asked me to fix a bunch of duplicate biblio records. I have found two main ways to do this.
A manual way where you are in control every step of the way: http://manual.koha-community.org/3.2/en/stafflists.html#mergebibrecs
And a potentially destructive way where you just guess and see what happens: https://saturn.ffzg.hr/koha/index.cgi?action=revision_view;page_name=removin...
As I understand it, the latter basically has you choose one of the biblio records, point all your identical items to the one biblio record, and delete the other biblio records. Then, one runs a script (sync_items_in_marc_bib.pl) to add any missing data to the biblio record by pulling the data from the items.
Being a mysql guy and scripting guy, this latter approach seems to be the "easy" way to do it. If I were a librarian and understood the biblio data, I might be howling in anguish at the thought of randomly selecting the biblio record. But I have no idea what the information means. I am a sysadmin and have no real understanding nor ownership of the data. And, I am not sure the people asking me to do this job completely understand the nuances of this either.
So, I ask the Koha community for advice. Should I make a little script that runs the duplicate biblio sql script, selects one of the biblio records to point all the items to, and delete all the other biblio records? One would need to sync the marc items and reindex when done. Or is that basically a terrible thing to do?
And, if I make such a script, is there a place where I should put the script so others do not need to make the same thing?
- Tim Young
_______________________________________________ Koha mailing list http://koha-community.org Koha@lists.katipo.co.nz https://lists.katipo.co.nz/mailman/listinfo/koha
I did make an automated script to deduplicate biblio records. Thanks, Joy, for the thoughts about choosing the "biggest" record. While it is dangerous to say it is done before the librarians have told me I did a great job, I will go ahead and put the script here just in case someone else finds it useful some day. It does create a log file of all the biblio records it deletes. But, does not save the contents. At least you can look at whatever remains and try to remember what the contents used to be... My apologies to any Perl masters who might be offended by my attempt. :) ----- #!/usr/bin/perl #For use with Koha 17.05.x $INSTANCE=""; #default instance $KOHAMYSQL="koha-mysql"; #put a path here if this cannot be found @TABLELIST=(aqorders, article_requests, biblioimages, items, hold_fill_targets, biblioitems, old_reserves, reserves, ratings, reviews, tags_all, virtualshelfcontents); my $doOne=0; #set to 1 to test my $OutFile="DeDup_log.txt"; ##################################### # No need to edit below this line ### ##################################### open(LOG,'>',$OutFile); #read in the instance if(@ARGV > 0) { $INSTANCE = $ARGV[0]; } else { die "Specify a koha instance to dedup"; } sub RunSQL { my $sql = join ' ', @_; my $command = "echo '$sql' | $KOHAMYSQL $INSTANCE"; my $error; my $out; print "Command: $command\n" if ($doOne == 1); $out = qx{$command 2>&1}; $error=$?; return $error, $out; } my $counter=0; #Find all the duplicates print "Calculating: \n"; print LOG "Calculating: \n"; ($status, $output) = RunSQL ("SELECT GROUP_CONCAT(biblionumber SEPARATOR \", \") AS biblionumbers, title, author FROM biblio GROUP BY CONCAT(title,\"\/\",author) HAVING COUNT(CONCAT(title,\"\/\",author))>1;"); #Get Stats. $dupes=0; $totalrecs=0; my @lines = split /\n/, $output; foreach my $line (@lines) { (my $nums, my $name) = split /\t/,$line; @bibs = split /,/, $nums; if(@bibs >1) #It is a dup if there are 2 of them. Skip the title line { $dupes++; $totalrecs+=@bibs; #count of matches } } $ToRemove = $totalrecs - $dupes; print " Duplicates to clean: $dupes\n"; print LOG " Duplicates to clean: $dupes\n"; print " Total Records involved: $totalrecs\n"; print LOG " Total Records involved: $totalrecs\n"; print " Records to remove: $ToRemove\n"; print LOG " Records to remove: $ToRemove\n"; #Now, we do the real work ONETIME: { foreach my $line (@lines) { #print "$line\n"; (my $nums, my $name) = split /\t/,$line; #print " $nums\n"; @bibs = split /,/, $nums; if(@bibs >1) #It is a dup if there are 2 of them. Skip the title line { my $max=0; my $maxitem="0000"; #find the biblio record with the largest size foreach my $bib (@bibs) { ($status, $out) = RunSQL("SELECT * FROM biblio_metadata WHERE biblionumber = $bib"); my $len = length($out); if($len >= $max) { $max = $len; $maxitem = $bib; } } #if we have a largest size, point everything to that and delete other records if($maxitem != "0000") { print LOG "---$name\n"; print LOG "Found Best Record: $maxitem\n"; my $sql; foreach my $bib (@bibs) { if($bib != $maxitem) #skip the one we are standardizing on { #update each table foreach my $table (@TABLELIST) { $sql="UPDATE $table SET biblionumber = $maxitem WHERE biblionumber=$bib;"; #$sql="SELECT biblionumber FROM $table WHERE biblionumber=$bib;"; (my $err, my $update)=RunSQL($sql); if($err != 0) {die $update}; } #Now we are ready to delete $sql = "DELETE from biblio where biblionumber = $bib;"; (my $err, my $update)=RunSQL($sql); if($err != 0) {die $update}; $counter++; print LOG " Deleting: $bib\n"; print "."; } } } if($doOne == 1) { last ONETIME; } } } } close(LOG); print "\n"; print "Completed purging $counter records.\n";
participants (3)
-
Christopher Davis -
Joy Nelson -
Tim Young