Keeping a Bibliography

Tags: musings, research

Published on
« Previous post: Delayed gratification and gratitude — Next post: The Power of Admitting Ignorance »

A crucial part in becoming a researcher consists in building on top of research that others did. As the saying goes, ‘No Man is an Island’, and science typically advances by critiquing, extending, or—in some cases—tearing down the work of others. We are all standing on the shoulders of giants to some extent. When writing a publication, we acknowledge the work of others by adding appropriate citations. The entirety of these citations is typically referred to as a bibliography, but I will be using this term in a rather haphazard manner, extending its meaning to include ‘publications you read over the course of your Ph.D. research and beyond’.

Why?

Keeping a list of all the publications that you read (or just collected) about a certain topic is the cornerstone of scholarship. Of course, writing papers is important, but showing that you know how these papers fit into the grand scheme of things, is crucial in your journey to become a researcher. Plus, acknowledging the work of others is polite and humble—it demonstrates that, despite all the strides we make in research, we can only be successful if we work together and share our knowledge.

On the pragmatic level, establishing a bibliography will make it easier for you to summarise papers and, ultimately, generate ideas1. It will also provide you with a sense of accomplishment—after all, reading papers is also a large part of research and should be acknowledged as such. At the end of your Ph.D., when writing your thesis, your main bibliography will2 contain dozens of papers, which you should have at least skimmed at some point. Moreover, there will also be a handful of papers that you refer to so often that you know like the back of your hand. It is quite likely that, barring major changes3 in your research directions, your bibliography will keep on growing and benefiting you for many years to come. Hence, it pays off to do this the right way from the beginning.

How?

The tried-and-true way of bibliography management in mathematics and computer science (and, much to my delight, an increasing number of other disciplines) involves BibTeX, a companion tool to LaTeX4. The basic idea is that you keep all your bibliographic entries in text files that follow the BibTeX format. In contrast to other methods of storage, this format has certain advantages:

  1. It can be read and edited by humans (with some training, but that is what you are here for!).
  2. It can be put under version control5.
  3. It can be extended to contain more information, such as direct links to papers.
  4. It can be easily converted to other formats, making it highly versatile and flexible.

The great thing about this format is that there is a difference between the way you store your items (where the idea is that you add as much information as you can), and how they look in a document. For example, certain journals have their own styles for formatting a bibliography. BibTeX can easily accommodate them and leave out information or format it in a different way.

To get you started with writing BibTeX, grab your favourite text editor or a special BibTeX editor6 and create a new file. Every entry in the file follows the same format: you specify a type of the item that you want to add (such as an article in a journal or a book), followed by a description of its properties (such as authors or a title). An example entry could look like this:

@article{Edelsbrunner02,
  author    = {Edelsbrunner, Herbert and Letscher, David and Zomorodian, Afra},
  title     = {Topological Persistence and Simplification},
  journal   = {Discrete {\&} Computational Geometry},
  year      = {2002},
  volume    = {28},
  number    = {4},
  month     = nov,
  publisher = {Springer},
  pages     = {511--533},
  doi       = {10.1007/s00454-002-2885-2},
}

Let us decompose this entry:

  • Edelsbrunner02 is an internal identifier. You use it whenever referring to that specific publication. For example, if you are citing the paper, you would write \cite{Edelsbrunner02}.7

  • The author is the list of authors of the paper. Notice that I specified them based on their surnames first. This makes it easier for BibTeX to detect how an author name should be formatted. It pays off when you have names that are more complex such as Laurens van der Maaten of t-SNE fame. Formatting his name as van der Maaten, Laurens will tell BibTeX that everything before the comma is a surname. If your style abbreviates first names, Laurens will now be abbreviated as van der Maaten, L., instead of some monstrosities like Maaten, L.V.D. or some such nonsense8.

  • The title refers to the title of the paper. I kept the capitalisation of the original paper intact, but whether upper-case or lower-case letters are used is at the discretion of the bibliography style that you use in practice. Hence, to ensure that proper nouns are capitalised correctly, you need to enclose them in curly braces. For example, a paper entitled Everything you wanted to know about Gaussian elimination should be formatted as title = {Everything you wanted to know about {G}aussian elimination}. You can also include whole words in curly braces in order to prevent BibTeX from changing them. This is great for things like t-SNE, which you can provide as {t-SNE} in the title.

  • The journal contains the name of the journal. I had to encode the ampersand ‘&’ in the title because LaTeX would complain otherwise. Again, the capitalisation of journal titles might be changed depending on the style, but it is good practice to use the original capitalisation of the journal.

  • year, volume, and number are all self-explanatory and contain, for once, no pitfalls. The volume field refers to the time in history when journals would come in different volumes to collect articles within a certain period. The number refers to a more specific issue of the respective journal. It is thus more specific and should only be used in conjunction with the volume field, but never alone. As an example of this organisational style, the aforementioned journal Discrete & Computational Geometry assigns ‘Volume 63’ to all articles published from January to April 2020. The first issue of Volume 63 appeared in January.

  • month is a dangerous field. If you specify the month by a three-letter abbreviation like this (and without the curly braces), BibTeX can automatically provide the proper names in all kinds of languages. Ostensibly, this is a great feature if you are writing in multiple languages, but while I was initially very much in favour of always adding a month to my entries, it started losing its relevance over the past few years. I can in good conscience say that I never used the month field to look up information about a paper. There is one saving grace for it, though: the field is used internally for sorting! Thus, if you have multiple articles by the same authors over the same year, the month field helps in establishing a consistent sorting order.

  • publisher is another one of these self-explanatory but ultimately dangerous fields. The publisher string is almost never formatted directly by BibTeX, so make sure you are consistent with adding content there. Moreover, most bibliography styles ignore the publisher for an article anyway. I only included it here to describe it briefly.

  • pages is probably the most misused field. The idea is to specify a range of pages for the respective article. Hence, you need to use an ’en-dash’, i.e. ‘–’, or ‘--’ in LaTeX. You are not supposed to use spaces here or any other kind of dash. While page ranges can be seen as a charming remnant of the past, there are still some articles that are only available in the real world and have not been digitised yet. Thus, I keep on using this field even though I never used it to find an article (I did use it to find chapters in books, though, which is why I added the field to this example; there are situations in which the field is useful).

  • doi is one of my favourite fields. It refers to the Digital Object Identifier System and makes it possible to—finally—locate a specific bibliographic item with a single click by providing a persistent URL under which it can be reached. I always add DOIs to my bibliographies whenever I can get away with it (many venues use bibliography styles that discard them, though). For your own dissertation, I would definitely recommend them—modern BibTeX and BibLaTeX styles support them nicely and will format them as URLs that can be clicked. Make sure to just use the plain DOI in the field; there is no need for providing a URL directly.

If that seems like a lot of information for a single entry, do not despair—many other item types exist and they share the same set of elements. Let us discuss a few of them before providing more examples.

Common entry types

  • article: you already encountered this entry type above. It is meant for research articles that have been published in journals. The required fields are author, title, journal, year, volume.

  • book: use this to cite a book that was published somewhere. The required fields are author/editor (it is sufficient to specify one of them), title, publisher, year.

  • incollection: use this to cite a part of a book that has its own title and authors. For example, my article Agreement Analysis of Quality Measures for Dimensionality Reduction was published in the book Topological Methods in Data Analysis and Visualization IV. If you do not want to cite the book as a whole but only my individual contribution, you should use incollection. The required fields are author, title, booktitle, publisher, year. The booktitle field should contain the title of the book, i.e. of the collection itself. As a rule, if a chapter has its own author, you probably want to use incollection.

  • inproceedings: use this to cite an article that was published in conference proceedings. For example, papers published at ICML should generally be added as this type. The required fields are author, title, booktitle, year. The field booktitle is confusing. Here, it refers to the proceedings itself. For example, if you cite something from ICML 2019, you should use ‘Proceedings of the 36th International Conference on Machine Learning’ as its content.

  • mastersthesis and phdthesis: use this to refer to a thesis. The required fields are author, title, school, year. The field school is a free-form field in the sense that you can provide the name of the institution. For example, to cite my Ph.D. thesis, you would use school = {Ruprecht-Karls-Universit{\"a}t Heidelberg}, since this is the German name of my university.

  • techreport: use this to cite a technical report, i.e. a report published by some institution that did not necessarily undergo peer-review (except for maybe an internal review). The required fields are author, title, institution, year. The field institution refers to the university or other entity that published this report. You can also use this type to refer to other forms of grey literature, i.e. publications that do not fall under the traditional categories of academic publishing. A research report, for example, could also be considered a techreport.

  • misc: use this as a last resort to add bibliographic information in almost free form. This type has no required fields, but a few optional ones, including author, title, and note. You can use this to refer to companies or software projects, for example.

Optional fields

Having seen the most common entry types, you should be aware that most of them support numerous optional fields. For example, article supports the optional pages field. Wikipedia has a great breakdown of optional and required fields for different entry types.

Whether to use all of them or only the required ones is at your discretion. I tend to take a pragmatic view here: you should add all information that is necessary to identify the work that was added to your bibliography, as well as provide some context about it. For example, I prefer adding editor fields to all entries whenever appropriate; I see this as a professional courtesy towards the people who edited a certain work. Whether I can use all of these fields in a bibliography for a paper is a different matter—remember that curating a bibliography and using it in practice are two different things; for most of your academic publications, a publisher or conference will dictate how entries are formatted and which fields are being included in them.

How to cite conference papers

After these theoretical examples, here are some practical considerations when adding machine learning articles to your bibliography.

  • ICML papers: download the appropriate BibTeX file from http://proceedings.mlr.press (for some reason, you have to click on ‘abs’ to get the abstract of a paper before links to BibTeX files are shown). Some adjustments are needed, though: remove the address field9 and the month10 field. Make sure that all names follow the Last, First format.

  • NeurIPS papers: download the appropriate BibTeX file from http://papers.neurips.cc. Change the entry type to inproceedings. The remainder of the file is fine, but be sure to specify editor entries correctly; for some reason, the exported entries do not follow the Last, First format. Moreover, for the 2019 proceedings, one of the editors is formatted incorrectly. Her name should be specified as d'Alch{\'e}-Buc, F.; if you are using BibLaTeX, you can also directly specify the accent; it supports UTF-8.

  • ICLR papers: download the appropriate BibTeX file from https://openreview.net. The format works well outside the box, and, having only a few entries, there is nothing you have to fix. Be mindful of the capitalisation rules, though!

For each of these conferences, consider adding their respective abbreviation in the booktitle field. Other than that, there is not much you can do here. By the way: the instructions for ICML also apply to a number of other venues, including AISTATS, COLT, and MLHC! It is great that PMLR is providing this service.

How to cite preprints

I have skirted around the problem of citing arXiv preprints because there is no formal standard. There are, however, certain scenarios:

Scenario 1: BibLaTeX and your own bibliography style

This is the nicest scenario: you get to use your own bibliography style and you are allowed to use BibLaTeX. In this case, use the misc type and the additional fields eprint, archiveprefix, and primaryclass to format the entry. As an example, suppose you want to cite the arXiv preprint PersLay: A Neural Network Layer for Persistence Diagrams and New Graph Topological Signatures. I would format it as follows:

@misc{Carriere19,
  author        = {Carri{\`e}re, Mathieu and Chazal, Fr{\'e}d{\'e}ric and Ike, Yuichi and Lacombe, Th{\'e}o and Royer, Martin and Umeda, Yuhei},
  title         = {{P}ers{L}ay: A Neural Network Layer for Persistence Diagrams and New Graph Topological Signatures},
  year          = {2019},
  eprint        = {1904.09378},
  archiveprefix = {arXiv},
  primaryclass  = {stat.ML},
}

You can see that eprint contains the internal ID assigned by arXiv, and archiveprefix specifies that it is an arXiv article. The primaryclass field is helpful in declaring the main subject assignment of the preprint but it is not necessary.

Scenario 2: BibTeX and a pre-defined bibliography style

In this case, to be on the safe side with most styles, I tend to use the article type (which is wrong because the journal field is required, so please consider this a workaround only). Hence, the aforementioned preprint would be formatted like this:

@article{Carriere19,
  author        = {Carri{\`e}re, Mathieu and Chazal, Fr{\'e}d{\'e}ric and Ike, Yuichi and Lacombe, Th{\'e}o and Royer, Martin and Umeda, Yuhei},
  title         = {{P}ers{L}ay: A Neural Network Layer for Persistence Diagrams and New Graph Topological Signatures},
  year          = {2019},
  eprint        = {1904.09378},
  archiveprefix = {arXiv},
  primaryclass  = {stat.ML},
}

Some bibliography styles, such as the one used by ICML, are incapable of formatting such an entry correctly. In this case, I add the field pages with incorrect information:

pages = {arXiv:1904.09378}

You should only do this if you are forced to use a bibliography style that does not support arXiv preprints otherwise! In all other cases, consider using misc and providing information about the eprint etc. As a ‘milder’ form of formatting the entry, you could also add a ‘fake’ journal by setting journal = {arXiv e-prints} or journal = {arXiv preprint}. This is sometimes suggested when you export a citation from arXiv. I cannot say that I love this practice, but it works reasonably well.

Scenario 3: BibTeX and your own bibliography style

First of all, consider using BibLaTeX as package in your documents; it will make formatting your bibliography much easier. If you do not want to make the switch, I would first stick with misc type as described above. If that does not work, use article or, in the worst case, the unpublished type.

Common pitfalls

Having now discussed at length how to keep entries in a bibliography, I want to close this post with a list of common pitfalls and how to avoid them:

  1. Double-check all .bib files that you download. Publishers are notorious for incorrectly-formatted files. While they might work, you might introduce problems in your bibliography that are hard to find later on.

  2. Always check author names and reformat them, if necessary. A full name with an initial is best stored as Riker, William T. as it permits BibTeX to abbreviate it as W.T. Riker. It is a common mistake to provide abbreviated names already in the file, such as Riker, WT. This will be formatted as W. Riker. Hence, if you only have initials available, it is best so separate them by periods.

  3. Check for duplicated entries, in particular for files downloaded from somewhere else.

  4. Choose the right entry type as outlined above. People often use inbook when they actually mean incollection. The former is almost always unnecessary (at least in machine learning, where we only tend to cite publications that can be assigned to one or more individuals).

  5. Remove superfluous information from all items. Only keep the things that are required to uniquely identify a publication and put it into context. For many modern publications, there is no need to keep an ISSN, for example.

  6. Use DOIs whenever you can. Remember that not every field in a bibliographic item needs to be shown—but having a DOI makes it easier for you to track down an article later on.

  7. Be consistent with journal titles, abbreviations, and the like.

  8. Check the capitalisation of your entries. Do not fiddle too much with curly braces (some people suggest putting the whole title in curly braces, but this essentially removes all options for reformatting later on).

If that list has not worn you out, there is also a great discussion of more common mistakes, courtesy of TeX StackExchange.

The next steps

By now you should be familiar with the basic rules in keeping a bibliography using BibTeX. When you start curating your own entries, strive for consistency and correctness. This will make your life much easier and permit you to use bibliographic entries efficiently.

If you want to learn more, I would suggest reading Tame the BeaST, which discusses many details and provides the rationale behind certain choices in BibTeX. Moreover, you should consider using BibLaTeX whenever you can—it makes formatting your bibliography so much easier. Finally, if you want to see BibLaTeX in action, you might want to take a look at latex-mimosis, my document class providing a minimal and modern LaTeX template for all your thesis needs?

Happy bibliography management, until next time!


  1. The aspects of idea generation have to wait for a subsequent post. Here, I want to focus on the practicalities. In the meantime, you might enjoy reading about the Zettelkasten method↩︎

  2. I want to avoid the word should here, because it is hard to quantify differences between fields. A Ph.D. thesis should not be about numbers; there are exceptional theses that only cite very few works, simple because there was no related work to begin with. My statement should thus be considered nothing more than a rule of thumb. ↩︎

  3. Of course, not all papers in your bibliography will remain equally important, but in my experience, taking care to curate a bibliography has paid off nicely so far. ↩︎

  4. There is also a more modern package called BibLaTeX that complements and/or replaces BibTeX. However, since this post is not about typesetting a bibliography, I will not provide further details about this package and use the two words synonymously, except for when there is a real difference between the two. ↩︎

  5. Meaning that your whole writing workflow can be easily extended to accommodate modern version control software such as git or Mercurial. If you are not using one of these, you should—but that is the topic of another post. ↩︎

  6. Since I will be discussing how to write BibTeX files manually, it really does not matter which editor you pick. Here is a non-comprehensive list of some free ones: BibDesk, JabRef, KBibTeX, Mendeley, and Pybliographer. Personally, I prefer JabRef, but these days, I also write and curate BibTeX files manually. ↩︎

  7. Since this presupposes the existence of LaTeX environment, it is slightly beyond the scope of this article. I will add a link to an example document, though. ↩︎

  8. A quick search in your favourite search engine of this combination and of other combinations will reveal that certain bibliographies commit that mistake. I am sure Laurens is used to this by now, but we should always strive to cite people correctly. ↩︎

  9. This field is meant to specify the address of the publisher. ICML/PMLR uses it to specify the location of the conference. This is incorrect; you should use the venue field instead. It is fully supported by BibLaTeX. ↩︎

  10. This field is incorrectly formatted. It is supposed to contain only month, but ICML/PMLR uses it to specify the date of the conference. Again, this is incorrect; you should use the eventdate field instead, which is fully supported by BibLaTeX. ↩︎