#!/usr/bin/perl -w
use strict;

=head1 NAME

bibparse.pl refs.txt > refs.endnote.txt

=head1 DESCRIPTION

This uses a bunch of heuristics to try to parse completely random bibliography entries,
regardless of its format.
It then can produce an output file in a structured format.

Right now EndNote format is the only one supported. That should be enough, since
there are many other products to help you once you've gotten your existing text
into some structured format.

For a typical academic paper with complex entries, this tool will get about 75% of the entries completely correct.
For the others, it will try to pull out what fields it can (such as publication year, etc.), and put
all the rest of the text in an 'other' field. You will then have to clean it up manually. 

=head1 FORMATS

For an old but still excellent overview of structured bibliography formats, see:

   Survey of Bibliography Tools
   by Dana Jacobsen
   http://www.ecst.csuchico.edu/~jacobsd/bib/formats/

There are many "tagged" formats, basically field-value pairs on individual lines.

BibTeX is a more complex format, basically a program when you get down to it.
EndNote cannot directly import BibTeX; instead they suggest you download their
utility (refer.bst) to generate Refer format using \bibliographystyle{refer}.

Of course the up-and-coming formats rely on XML; see for example:

    http://www.loc.gov/standards/mods/

All of that has to do with what for us is output.
Our *input* is a bibliography text style. There are even more of those.
In fact, there is just about one for every journal. 
Some of the better-known styles are: AMA, APA, CBE, Chicago, MLA, Turabian.

There are many examples/overviews of these styles on the internet (it seems
every university has their own). See for example:

  http://www.liu.edu/cwis/cwp/library/workshop/citation.htm
  http://www.dianahacker.com/resdoc/
  http://www.aresearchguide.com/styleguides.html
  http://writing.colostate.edu/references/sources/document/pop4.cfm

=head1 OTHER TOOLS

Amazingly, I can find no free or pay tool that does this. They can import/export
oodles of structured formats, and they can output oodles of textual formats,
but parsing random text bibliographies seems to be beyond what they want to do. 

Some command-line conversion tools for structured formats are:

  bibutils http://www.scripps.edu/~cdputnam/software/bibutils/bibutils.html
  bp http://www.ecst.csuchico.edu/~jacobsd/bib/bp/

For a dauntingly exhaustive comparison of commercial products, see:

   http://www.burioni.it/forum/ors-bfs/grid/index.html

There is an ambitious bibliographic software project at http://bibliographic.openoffice.org/

=head1 TODO

There is no end to this problem really. And it can't be done, in general.
The non-structured bibliography formats are *not* reversible; for example,
they do not have a means for escaping periods.

Right now we just apply a hodgepodge of regular expressions, snipping bits of
information out. We might be better off repeatedly trying to apply certain
style templates (MLA, Chicago, etc.).

It takes no command-line options right now. If you want to change what it does,
you have to edit the file and change some constants at the top.

=head2 Character Sets

Character encodings are just painful. Ideally we'd offer control for input and output character set.
But the biggest challenge comes from the almost-ascii documents that have things like
Windows "smart quotes" in them.

=head2 Input Text Formatting

Right now, we assume text without markup. But knowing what is italic could be useful,
for bib styles that use italics.

=head2 Nested References

A particular challenge are references to articles published inside books. This is really a nested record.

Some examples from the wild are:

   Brown RP, Gerbarg PL, and Muskin PR. Complementary and Alternative
   Treatments in Psychiatry. In Psychiatry 2nd Edition, ed. by Allen
   Tasman, Jeffrey Lieberman and Jerald Kay, Published by Wiley &
   Sons, Ltd., London, UK, 2003, pg 2171-2172.

   Thase ME, Jindal RD. Combining psychotherapy and psychopharmacology
   for treatment of mental disorders. In: Lambert MJ, editor. Bergin
   and Garfield's Handbook of psychotherapy and behavior change. 5th
   ed. New York: John Wiley and Sons Inc; 2004. p 743-66.

   Rush AJ, Thase ME. Psychotherapies for depressive disorders: a
   review.  In: Sartorius N, ed. WPA Series. Evidence and Experience in
   Psychiatry: Volume 1. Depressive Disorders. Chichester, UK: John Wiley
   and Sons; 1999:161-206.

   Goldenberg DL. Fibromyalgia and related syndromes. 
   In: Klippel JH, Dieppe PA, Eds. Rheumatology. London: Mosby, 
   1998:15.1-15.12.

=head2 Reports

Reports can be a challenge:

   Mental Health: A Report of the Surgeon General. 1999. 
   U.S. Department of Health and Human Services,
   Substance Abuse and Mental Health Services Administration, Center
   for Mental Health Services, National Institutes of Health, National
   Institute of Mental Health, Rockville, MD.

=cut

use Data::Dumper;

# whether to convert things like "=20" from an emailed file
my $INPUT_EMAIL_ENCODING = 1;

# regexp for an entry starter 
my $ENTRY_START_RE = '(\d+)\s*[\.\)]+\s+';

# the regexp to use to separate entries on input.
# any number of blank lines.
my $INPUT_SEP_RE = '\n(?:\s*\n)+';
my $INPUT_SEP = "\n\n";

my $COUNT = 0;
my %COUNTS = ();

# 'Manuscript', 'Edited Book', 'Report', 'Conference Proceedings', ...
my $ENDNOTE_TYPES = {
 journal => 'Journal Article',
 book => 'Book',
 booksection => 'Book Section',  
 generic => 'Generic',
};

my $ENDNOTE_TAGS = {
 type => '%0',    # EndNote only, not Refer
 author => '%A',  # "Last, First von, Jr." format in EndNote, "First M. Last" in Refer. repeated.
 editors => '%E', # may be repeated in EndNote
 title => '%T',
 seriestitle => '%S',
 journal => '%J', # journal name, if article came from a journal.
 book => '%B',    # book, if article came from a book. TODO: use secondary author and secondary title?
 date => '%D',    # full date with month spelled out
 volume => '%V',
 issue => '%N',
 edition => '%7',
 pages => '%P',
 keywords => '%K',
 publisher => '%I',
 city => '%C',
 other => '%O',   # extra printed info
 abstract => '%X', # extra non-printed info
 label => '%L',    # used for labelling
 url => '%U',
};
 
# record separator, tag separator, tag hash, type hash
my $OUTPUT_STYLES = {
  endnote => ["\n", " ", $ENDNOTE_TAGS, $ENDNOTE_TYPES],
};

my $OUTPUT_STYLE = 'endnote';

# whether to split names into repeated individual author fields on output
my $OUTPUT_SPLIT_NAMES = 1;

# no good way to deal with the last one if "Casper G. How I scare people."  
my $PROTECT_PERIOD_INITIALS = 0;

my $PAGERE = '[A-Z]*[1-9][\d\.]*';

# used when recombining multi-sentence titles.
my $TITLESEP = '. ';
# used when combining extra parts
my $OTHER_SEP = '. ';

my $NESTEDRE = 'in:\s*(.*)';

# assumes surrounding chars
my $YEARRE = '\D(\d\d\d\d)[\s,:;\.\)]';

################################################################
sub warning {
    print STDERR "WARNING [$COUNT]: ", @_, "\n";
    return '';
}

my $DEBUG = 0;
sub debug {
    print STDERR "DEBUG [$COUNT]: ", @_, "\n" if $DEBUG;
}

my @RECORDS = ();

sub record_rec {
    my ($type, $rec) = @_;
    $rec->{type} = $type;
    $COUNTS{$type}++;
    push(@RECORDS, $rec);
    return $rec;
}


sub parse_entry {
    my ($entry) = @_;

    # convert to one line
    $entry =~ s/[\n\r]/ /g;

    # leading white space
    $entry =~ s/^\s+//;
    # remove trailing white space and trailing period.
    $entry =~ s/\s*\.?\s+$//;

    debug("==== Parsing: '$entry'");

    my $rec = {};
    # eliminate any leading digit like "17." or "17)" or "40.." (typo)
    if ($ENTRY_START_RE && $entry =~ m/^$ENTRY_START_RE/) {
	debug("removed leading number from entry: $1");
	$rec->{recno} = $1;
	$entry =~ s/^$ENTRY_START_RE//;
    }

    # pull out any urls (because they contain periods)
    if ($entry =~ m/\s+(http:\S+)/) {
	my $url = $1;
	$url =~ s/\.$//;
	$rec->{url} = $url;
	$entry =~ s/(?:available)?\s*(at)?\s*:?\s*\.?\s+http\S+//i;
    }
    if ($entry =~ m/\s+(www\.\S+)/) {
	my $url = $1;
	$url =~ s/\.$//;
	$rec->{url} = $url;
	$entry =~ s/(?:available)?\s*(at)?\s*:?\s*\.?\s+www\.\S+//i;
    }

    my $PERIOD = '|';
    my $PERRE = quotemeta($PERIOD);

    # digits on either side of a period are like decimals of some sort.
    $entry =~ s/(\d)\.(\d)/$1$PERIOD$2/g;

    # rewrite periods in names (as initials)
    # handle both: "Thase, M. E., & Rush, A. J. (1997)." and "Lindgarde F, Manthorpe R."
    $entry =~ s/(,\s*[A-Z])\.(\s*[A-Z])\./$1$PERIOD$2$PERIOD/g;
    # warning("no match: '$entry'") if $entry =~ m/mandersch/i && $entry !~ m/,\s*[A-Z]\.\s*[A-Z]\./;
    # $entry =~ s/(,\s+[A-Z])\./$1$PERIOD/g if $PROTECT_PERIOD_INITIALS;

    # ., and .) are always suspect.
    $entry =~ s/(\w+)\.([,\)]) /$1$PERIOD$2 /g;

    # rewrite some abbreviations: "No." (Number), "Suppl." (Supplement),
    # because it'll look like a part separator. 
    # TODO: what about titles ending in these?
    $entry =~ s/(no)\. /$1$PERIOD /ig;
    $entry =~ s/(suppl)\. /$1$PERIOD /ig;
    # $entry =~ s/(ltd)\., /$1$PERIOD, /ig;
    $entry =~ s/(jr)\. /$1$PERIOD /ig;
    $entry =~ s/(vs)\. /$1$PERIOD /ig;
    $entry =~ s/(ed)\. /$1$PERIOD /ig;
    # $entry =~ s/(et\.? al)\. /$1$PERIOD /ig;
    $entry =~ s/(\Wpp?)\./$1$PERIOD/ig;

    $entry =~ s/(D)\.(C)\./$1$PERIOD$2$PERIOD/g;

    $entry =~ s/(\s[A-Z])\./$1$PERIOD/g if $PROTECT_PERIOD_INITIALS; # " A."
    $entry =~ s/($PERRE[A-Z])\./$1$PERIOD/g; # ".A."

    # nested book reference, need to suck in the next line as it is the title.
    $entry =~ s/(in:[^\.]+?eds?)\./$1$PERIOD/i;
    
    # split by period (removing) or by question mark (preserving)
    my $parts = [split(/(?<=\?)|(?:\s*\.\s*)/,$entry)];

    # warning("raw parts are:\n\t",join("\n\t", @$parts));

    $parts = fix_parts($parts);

    # restory periods in entry
    $entry =~ s/$PERRE/./g;

    # debug("parts from '$entry' are:\n", join("\n\t",@$parts));
    # @$parts = map {(s/$PERRE/./g,$_)} @$parts;
    foreach $_ (@$parts) {s/$PERRE/./g;}
    my $nparts = scalar(@$parts);

    # warning("parts from '$entry' are:\n\t",join("\n\t", @$parts));

    # maybe we misinterpreted an author initial like "Casper G. How I scare people."
    # just two parts or the 3rd part starts with a year, or there are 4 words in the author
    if ($nparts < 3 || $parts->[2] =~ m/^\d\d\d\d\s/ || $parts->[0] =~ m/\w+\s+\w+\s+\w+\s+\w+/) {
	if ($parts->[0] =~ m/^(.+)\s*\.\s*([^\.]+)$/ ) {
	    $nparts++;
	    shift @$parts;
	    @$parts = ($1, $2, @$parts);
	    warning("undoing aggressive initials to split so title is now '$2'");
	}
    }

#    if (($nparts < 3 || $parts->[2] =~ m/\d\d\d\d/) && $parts->[1] =~ m/(.*?\?)(.*)/) {
#	$nparts++;
#	my ($title, $rest) = ($1, $2);
#	my $author = shift @$parts;
#	shift @$parts;
#	@$parts = ($author, $title, $rest, @$parts);
#	warning("fixing title with question mark, now title just '$title'");
#    }

    if ($nparts < 3) {
	warning("entry has only $nparts parts; '$entry'");
    }
    $parts = fix_parts($parts);
    return if try_journal($rec, $parts, $nparts, $entry);
    $parts = fix_parts($parts);
    return if try_any($rec, $parts, $nparts, $entry);

    # won't ever run, since try_any always succeeds.
    debug("the ", scalar(@$parts), " parts are:\n\t", join("\n\t",@$parts));
    warning("could not match entry: '$entry'");
}

sub fix_parts {
    my ($parts) = @_;
    my @p = ();
    for (@$parts) {
	next unless $_;
	# in case rearrangements screwed things up
	# warning("fixing '$_'") if m/^\s+/;
	s/^\s+//;
	push(@p, $_);
    } 
    return [@p];
}

sub is_pubinfo {
    my ($p) = @_;
    return  " $p " =~ m/^ $YEARRE / || $p =~ m/\d\-\d/ || $p =~ m/^suppl/i;
}
sub is_journal_name {
    my ($p) = @_;
    return $p =~ m/journal/i;
}
sub is_nested_ref {
    my ($p) = @_;
    return $p =~ m/^$NESTEDRE/i;
}

# JOURNAL FORMAT
# format of: "author. title. journal. year [month] [day];volume[(issue)]:pagestart-pageend.
# TODO: 
#    titles that end in "?"
#    sometimes month precedes year: "title. [month] year"
#    "Biological Psychiatry, 42: 740-743, 1997."
#    "1997, 54: 1001-1006."
#    "Regier, D. A., Rae, D. S., Narrow, W. E., Kaelber, C. T., & Schatzberg, A. F. (1998). Prevalence of anxiety disorders and their comorbidity with mood and addictive disorders. British Journal of Psychiatry. Supplement, 34, 24-28"
#    "Angst, J, Angst, F, & Stassen, HH. (1999). Suicide risk in patients with major depressive disorder. Journal of Clinical Psychiatry, 60 (Suppl. 2), 57-62."
#    "2003;64 Suppl 15:7-12"
#    J Clin Psychiatry. 2001;62 Suppl 18:12-7. 
# DONE:
#    "Int Clin Psychopharmacol 1999;14(Suppl 2):S1-S6"
#    "2003; 74 (2); 191-195"
# note that:
#    month can be a range: Mar-Apr
#    issue need not be a number: "Suppl 2"
#    pages are not always pure numbers ("S1-S6", "15.1-15.12")
#
# WARNING: this modifies its arguments, which might affect later attempts, either positively or negatively
sub try_journal {
    my ($rec, $parts, $nparts, $entry) = @_;

    return 0 if $nparts < 3;

    # warning("parsing: ", join("\n\t", @$parts)) if $entry =~ m/Bulow/;
    my $year;

    # maybe year is in parens at end of author part
    if ($parts->[0] =~ m/\((\d+)\)$/) {
	$year = $1;
	$parts->[0] =~ s/\(\d+\)$//;
	warning("moving year '$year' from author");
    }
    # maybe year is in parens after author, separated by period (different part)
    if ($parts->[1] =~ m/^\((\d+)\)$/) {
	$year = $1;
	my $author = shift @$parts;
	shift @$parts;
	@$parts = ($author, @$parts);
	$nparts--;
	warning("moving year '$year' from part after author");
    }

    # multiple sentence titles.
    # 2nd and 3rd might be a two-sentence title. then the 4th part is the combined journal and pubinfo, or the 4th is journal and 5th is pubinfo.
    if ($nparts >= 4 && 
	# the 4th part shouldn't look like pub info, because then 3rd part is probably journal name.
	!is_pubinfo($parts->[3]) &&
	# more confirmation that 3rd part is not a journal name
	!is_journal_name($parts->[2]) &&
	!is_nested_ref($parts->[2]))
    {
	my $author = shift @$parts;
	my $title1 = shift @$parts;
	my $title2 = shift @$parts;
	my $title = $title1 . (($title1 =~ m/\?$/) ? ' ':  $TITLESEP) . $title2;
	warning("guessing that we have a two-sentence title: '$title'");
	@$parts = ($author, $title, @$parts);
	$nparts--;
    }
    else {warning("nparts=$nparts not combining '", $parts->[3], "' with match=" . (" $parts->[3] " =~ m/^ $YEARRE /)) if $entry =~ m/depressed patients/;}  

    # if 3 parts, maybe journal and year are together in one part (the last one)
    if ($nparts == 3) {
	# insert year so it can be parsed out
	if ($year) {
	    $parts->[2] =~ s/^([^,;\d]+,)/$1 $year,/;  
	    warning("made combined now '$parts->[2]' with year from before");
	    $year = undef;
	}
	# split combined part into journal name and pub info
	if ($parts->[2] =~ m/^(.*?)\s*(\d.*)$/) {
	    $parts->[2] = $1;
	    push(@$parts, $2); # year plus rest is now $parts->[3]
	    $parts->[2] =~ s/,\s*$//; # remove any trailing comma from journal title
	    $nparts++;
	    warning("now journal is '$parts->[2]' and pubrest is '$parts->[3]'");
	}

	if ($nparts == 3) {
	    debug("only 3 parts, so can't be journal");
	    return 0;
	}
    }

    my $p3 = $parts->[3];
    my $origp3 = $p3;

    # make sure the year we pulled out is used.
    if ($year) {
	if ($p3 =~ m/(\d\d\d\d)/) {warning("doing nothing with year '$year' because have year '$1'");}
	else {
	    $parts->[3] = $p3 = "$year; $p3";
	    warning("prepended year to make '$p3'");
	}
    }

    # fix up year is at end, as in "Biological Psychiatry, 42: 740-743, 1997." which becomes "42: 740-743, 1997"
    if ($p3 =~ m/(\d+)\s*[:]\s*(\d+\s*-\s*\d+)\s*[,]\s*(\d\d\d\d)/) {
	$p3 = "$3; $1: $2";
	warning("moving year from end to start: '$origp3' to '$p3'");
    }

    # fix up issue not in parens: "2003;64 Suppl 15:7-12"
    if ($p3 =~ s/;\s*(\d+)\s+(\w+)\s+(\d+):/;$1 ($2 $3):/) {
	warning("reformated funny issue number: '$origp3' to '$p3'");
    }

    my $fields;

    # "1998 Feb;13(2):77-85"
    if ($p3 =~ m/(\d+)(\s+[\w\-]+)?(\s+\d+)?\s*[;,]\s*(\d+)\s*(\([^\)]+\))?\s*[:;,]\s*($PAGERE)\s*(-\s*$PAGERE)?/) {
	$fields = [$1, $2, $3, $4, $5, $6, $7];
    }

    # supplement as in  "1998; Supplement, 34, 24-28"
    # or journal name, as in: "1993; Archives of General Psychiatry, 50, 85-94" which exists just because 
    # we created it by inserting "year;" from being earlier in parens.
    elsif ($p3 =~ m/(\d+)\s*;\s*([\w ]+),\s*(\d+),\s*($PAGERE)-($PAGERE)/) {
	warning("matched supplement in '$p3'");
	# call the volume "supplement"
	my $words = $2;
	$words =~ s/\s+$//;
	if ($words =~ m/^suppl/i) {
	    $fields = [$1, '', '', $words, $3, $4, $5];
	}
	else {
	    warning("don't know what to do with words '$words'. maybe previous parts are a multi-sentence title and this is a journal name?");
	}
    }


    if ($fields) {
	my ($y, $month, $day, $volume, $issue, $pagestart, $pageend) = @$fields;
	debug("matched journal");
	$rec->{original} = $entry;
	$rec->{authors} = $parts->[0] || die "no authors: ", Dumper($parts);
	$rec->{title} = $parts->[1];
	$rec->{journal} = $parts->[2];
	$rec->{year} = $y;
	$month =~ s/^\s+// if $month;
	$rec->{month} = $month if $month;
	$day =~ s/^\s+// if $day;
	$rec->{day} = $day if $day;
	$rec->{volume} = $volume if $volume;
	$issue =~ s/\((.*)\)/$1/ if $issue;
	$rec->{issue} = $issue if $issue;
	$rec->{pagestart} = $pagestart if $pagestart;
	$pageend =~ s/^-// if $pageend;
	$rec->{pageend} = $pageend if $pageend;
	return record_rec('journal', $rec);
    }
    debug("can't parse as journal: '$p3'");
    return 0;
}

sub sane_year {
    my ($y) = @_;
    return ($y =~ m/^19\d\d$/ || $y =~ m/^20\d\d$/);
}

# BOOK FORMAT
# format of: "author. title. city: publisher, year"
#    Murray CJL, Lopez AD, eds. Summary: The global burden of disease: a comprehensive assessment of mortality and disability from diseases, injuries, and risk factors in 1990 and projected to 2020. Cambridge, MA: Published by the Harvard School of Public Health on behalf of the World Health Organization and the World Bank, Harvard University Press, 1996.
#    Thase ME, Jindal RD. Combining psychotherapy and psychopharmacology for treatment of mental disorders. In: Lambert MJ, editor. Bergin and Garfield’s Handbook of psychotherapy and behavior change. 5th ed. New York: John Wiley and Sons Inc; 2004.  p 743-66.
#    Rush AJ, Thase ME. Psychotherapies for depressive disorders: a review. In: Sartorius N, ed. WPA Series. Evidence and Experience in Psychiatry: Volume 1. Depressive Disorders. Chichester, UK: John Wiley and Sons; 1999:161-206.

sub try_any {
    my ($rec, $parts, $nparts, $entry) = @_;

    my $i = 0;
    # just keep pulling stuff out
    for my $part (@$parts) {
	if (!$part) {
	    warning("skipping empty part $i");
	    next;
	}

	# page range
	# "14(Suppl 2):S1-S6"
	# "52:559-88"
	# TODO: but not "Contract 290-97-0012" or "Publication No. 93-0551"
	# for now just make sure ending page doesn't start with 0
	# first pull out volume or issue prior to page range
	if ($part =~ m/(\d+)\s*:\s*\d+-\d+/) {
	    $rec->{volume} = $1;
	    # try to guess about volume vs. year
	    if (!$rec->{year} && sane_year($rec->{volume})) {
		$rec->{year} = $rec->{volume};
		delete $rec->{volume};
	    }
	    $part =~ s/\d+\s*:\s*(\d+-\d+)/$1/;
	}
	if ($part =~ m/($PAGERE)-($PAGERE)/) {
	    $rec->{pagestart} = $1;
	    $rec->{pageend} = $2;
	    # get rid of any "pp." crap too
	    $part =~ s/\:?\s*p?[a-z]?\.? ?$PAGERE\-$PAGERE//;
	}

	# "ed" could be "editor" or "edition"
	if ($part =~ m/^(\w+) edition$/i) {
	    $rec->{edition} = $1;
	    $part = '';
	}

	# year
	# TODO: we should take the later match, certain any in the same part as a page range.
	# but the way we've implemented things now, we have no way to go back and restore what we pulled out.
	if (!$rec->{year} && " $part " =~ m/$YEARRE/) {
	    # try not to match a year in the middle of a title
	    if (sane_year($1) && $part !~ m/\w+ \d\d\d\d \w+/) {
		my $y = $rec->{year} = $1;
		$part =~ s/[,;]?\s*\(?$y\)?\;?\s*//;
	    }
	}

	if ($part =~ m/^$NESTEDRE/i) {
	    my $book = $1;
	    $rec->{book} = $book;
	    $rec->{type} = 'booksection';
	    # TODO: parse out authors from the line as editors, if some marker like "editors" or "(Eds.)"
	    $part = '';
	}

	# city and publisher
        # "Geneva: World Health Organization"
	# "New York: John Wiley and Sons Inc; 2004"
	# "Chichester, UK: John Wiley and Sons; 1999:161-206"
	# "Cambridge: Harvard University Press"
	# "Rockville, MD: Agency for Health Care Policy and Research"
	# "Rockville, Md: Dept of Health and Human Services, 1993; AHCPR Publication No. 93-0551"
	# "Washington, DC: American Psychiatric Association"
	# "Cambridge, MA: Published by the Harvard School of Public Health on behalf of the World Health Organization and the World Bank, Harvard University Press, 1996."
	# "Hyattsville, MD: National Center for Health Statistics, 2002"
	# problems:
	#   "Washington, D.C. American Psychiatric Association"
	#   "5th ed. New York: John Wiley and Sons Inc"
	if ($part =~ m/^(\w+(?:, \w\w)?): ([\w, ]+);?$/) {
	    $rec->{city} = $1;
	    $rec->{publisher} = $2;
	    $part = '';
	}
	elsif ($part =~ m/york/i) {warning("didn't match city in '$part'");}

	# authors
	# "Sudarsan B"
	# "Brown RP, Gerbarg PL, and Muskin PR"
	# "Rohini V, Pandey RS, Janakiramaiah N, Gangadhar BN, Vedamurthachar A"
	# "Wen-Shing Tseng, Jon Streltzer, Editors"
	# "Krishnan K, Delong M, Kraemer H, Carney R, Spiegel D, Gordon C, and others"
	# "Murray CJL, Lopez AD, eds"
	# "Mulrow CD, Williams JW, Jr., Trivedi M, et al."
	unless ($rec->{authors}) {
	    my @names = split(/\s*,\s*/,$part);
	    my $mismatch = 0;
	    $mismatch = 1 if $part =~ m/^\w+$/; # whole field is a single word
	    for (@names) {
		s/^and //;
		s/^& //;
		s/et al$//;
		next unless $_;
		if (m/^[a-z][\x80-\xFF\w]*( [a-z]\w*\.?)?\s*$/i) {}
		elsif (m/^Jr\.?$/) {}
		else {
		    warning("bad author '$part' because of '$_'") if $part =~ m/meredith/i;
		    $mismatch = 1;
		}
	    }
	    unless ($mismatch) {
		$rec->{authors} = $part;
		$part = '';
	    }
	}

	$parts->[$i] = $part;
	$i++;
    }

    # put first part into title, and rest into 'other'
    my $line = '';
    my $tcount = 0;
    my $title = '';
    for my $part (@$parts) {
	$part =~ s/^\s+//;
	$part =~ s/\s+$//;
	next unless $part;

	$tcount++;
	if ($title) {
	    $line .= $OTHER_SEP if $line;
	    $line .= $part;
	}
	else {
	    $title = $part;
	}
    }
    $rec->{title} = $title;
    $rec->{other} = $line if $line;
    my $rectype = $rec->{type} || 'generic';

    # if we only combined 1 part into the title, assume this is a book
    if ($tcount == 1 && $rec->{authors} && !$rec->{issue} && !$rec->{volume}) {
	$rectype = $rec->{type} ||'book';
	warning("we think we successfully parsed a '$rectype' with title '$title'"); 
    }
    elsif ($line =~ /Summary:/) {
	warning("tcount=$tcount");
    }
    $rec->{original} = $entry;

    warning("making a guess at record: ", Dumper($rec));
    return record_rec($rectype, $rec);
}

    # deal with any character encoding.
    # =20 
    # =96 -
    # =92 '
    # =E4 a:
    # =F1 n~
    #my $foo = "a=F1b=E4=92";
    #$foo =~ s/=([0-9A-F][0-9A-F])/chr(hex($1))/ge;
    #print STDERR "FOO=$foo\n";
sub convert_email_encoding {
    my ($lines) = @_;
    $lines =~ s/=([0-9A-F][0-9A-F])/chr(hex($1))/ge;
    return $lines;
}

sub parse {
    my $lines = '';
    while(<>) {$lines .= $_;}

    $lines = convert_email_encoding($lines) if $INPUT_EMAIL_ENCODING;

    # ensure records with an entry starter have blank lines before them.
    if ($ENTRY_START_RE) {
	$lines =~ s/\n($ENTRY_START_RE)/$INPUT_SEP$1/g;
    }

    # split up entries
    my @entries = split(/$INPUT_SEP_RE/, $lines);

    for my $entry (@entries) {
	parse_entry($entry); $COUNT++;
    }
}

sub output {
    my $i = 0;

    my $style = $OUTPUT_STYLES->{$OUTPUT_STYLE} || die "no such bib style '$OUTPUT_STYLE'";
    my ($RECSEP, $TAGSEP, $TAGS, $TYPES) = @$style;
 
    for my $rec (@RECORDS) {
	print $RECSEP if $i;
	$COUNT = $i;
	while (my($k,$v) = each %$rec) {
	    if (!$v) {warning("skipping missing value for field '$k' in record"); next;} 
	    # universal cleanups
	    $v =~ s/^\s+//;
	    $v =~ s/\s*[,;]?\s*$//;

	    my $tag = $k;
	    if ($k eq 'authors') {
		$tag = 'author';
		# try to split author into multiple fields. note that alternatively we could just let endnote do its "smart" parsing
		if ($OUTPUT_SPLIT_NAMES) {
                    # deal with Jr
		    $v =~ s/, Jr/ Jr/g; 
                    # in case input already has commas before initials
		    $v =~ s/, ([A-Z][A-Z]?[\.\,])/ $1/g; 
		    $v =~ s/, ([A-Z][A-Z]?)$/ $1/g;
                    # get rid of "&" and "and"
		    $v =~ s/,?\s*&/,/;
		    $v =~ s/,?\s* and /,/;
		    # if has "eds" or "editors" or "ed". TODO: "(Eds.)"
		    if ($v =~ m/ eds$/i || $v =~ m/ editors$/i || $v =~ m/ ed$/i) {
			$v =~ s/,? eds$//i;
			$v =~ s/,? editors$//i;
			$tag = 'editors';
		    }
		    # if has "et al"
		    my $etal = 0;
		    if ($v =~ m/ et\.? al$/) {
			$v =~ s/,? et\.? al//;
			$etal = 1;
		    }
		    if ($v =~ m/ others$/i) {
			$v =~ s/,? others$//i;
			$etal = 1;
		    }
		    my @names = split(/\s*,\s*/, $v);

		    # we assume "Smith J" and want to produce "Smith, J". we don't get "Joe Smith" on input.
		    my @outnames = ();
		    for (@names) {
			s/^(\w+) /$1, /;
			push(@outnames, $_);
		    }
		    if ($etal) {
			# end note wants "et al," so it will treat it as the last name and won't convert to "al, e"
			push(@outnames, "et al,");
		    }
		    $v = \@outnames;
		}
	    }
	    elsif ($k eq 'type') {
		$v = $TYPES->{$v} || warning("no endnote type for type '$v'");
	    }
	    elsif ($k eq 'pagestart') {
		$v = $v . '-' . $rec->{pageend} if $rec->{pageend};
		$tag = 'pages';
	    }
	    elsif ($k eq 'pageend') {next;}
	    if ($k eq 'year') {
		$tag = 'date';
		$v = $rec->{month} . " $v" if $rec->{month};
		$v = $rec->{day} . " $v" if $rec->{day};
	    }
	    elsif ($k eq 'month' || $k eq 'day') {next;}
	    elsif ($k eq 'recno') {
		$tag = 'label';
		$v = "N$v";
	    }
	    elsif ($k eq 'original') {
		$tag = 'abstract';
		$v = "original was: $v";
	    }
	    my $outtag = $TAGS->{$tag};
	    if (!$outtag) {warning("no known tag for internal tag '$tag'"); next;}
	    if (ref($v)) {
		for (@$v) {print $outtag, $TAGSEP, $_, "\n";}
	    }
	    else {
		print $outtag, $TAGSEP, $v, "\n";
	    }
	}
	$i++;
    }
}

sub main {
    parse();
    my @counts = %COUNTS;
    print STDERR "Made ", scalar(@RECORDS), " records out of $COUNT input entries. Counts of types created: @counts\n";  
    output();
}

main();

