An index to quantify an individual's scientific research output
J. E. Hirsch *
+ Author Affiliations
Department of Physics, University of California at San Diego, La Jolla, CA 92093-0319
Communicated by Manuel Cardona, Max Planck Institute for Solid State Research, Stuttgart, Germany, September 1, 2005 (received for review August 15, 2005)
Next SectionAbstract
I propose the index h, defined as the number of papers with citation number ≥h, as a useful index to characterize the scientific output of a researcher.
citations impact unbiased
For the few scientists who earn a Nobel prize, the impact and relevance of their research is unquestionable. Among the rest of us, how does one quantify the cumulative impact and relevance of an individual's scientific research output? In a world of limited resources, such quantification (even if potentially distasteful) is often needed for evaluation and comparison purposes (e.g., for university faculty recruitment and advancement, award of grants, etc.).
The publication record of an individual and the citation record clearly are data that contain useful information. That information includes the number (Np ) of papers published over n years, the number of citations (Nj c ) for each paper (j), the journals where the papers were published, their impact parameter, etc. This large amount of information will be evaluated with different criteria by different people. Here, I would like to propose a single number, the “h index,” as a particularly simple and useful way to characterize the scientific output of a researcher.
A scientist has index h if h of his or her Np papers have at least h citations each and the other (Np – h) papers have ≤h citations each.
The research reported here concentrated on physicists; however, I suggest that the h index should be useful for other scientific disciplines as well. (At the end of the paper I discuss some observations for the h index in biological sciences.) The highest h among physicists appears to be E. Witten's h, which is 110. That is, Witten has written 110 papers with at least 110 citations each. That gives a lower bound on the total number of citations to Witten's papers at h 2 = 12,100. Of course, the total number of citations (Nc,tot ) will usually be much larger than h 2, because h 2 both underestimates the total number of citations of the h most-cited papers and ignores the papers with <h citations. The relation between Nc,tot and h will depend on the detailed form of the particular distribution (1), and it is useful to define the proportionality constant a as I find empirically that a ranges between 3 and 5.
Other prominent physicists with high hs are A. J. Heeger (h = 107), M. L. Cohen (h = 94), A. C. Gossard (h = 94), P. W. Anderson (h = 91), S. Weinberg (h = 88), M. E. Fisher (h = 88), M. Cardona (h = 86), P. G. deGennes (h = 79), J. N. Bahcall (h = 77), Z. Fisk (h = 75), D. J. Scalapino (h = 75), G. Parisi (h = 73), S. G. Louie (h = 70), R. Jackiw (h = 69), F. Wilczek (h = 68), C. Vafa (h = 66), M. B. Maple (h = 66), D. J. Gross (h = 66), M. S. Dresselhaus (h = 62), and S. W. Hawking (h = 62). I argue that h is preferable to other single-number criteria commonly used to evaluate scientific output of a researcher, as follows:
Total number of papers (Np ). Advantage: measures productivity. Disadvantage: does not measure importance or impact of papers.
Total number of citations (Nc,tot ). Advantage: measures total impact. Disadvantage: hard to find and may be inflated by a small number of “big hits,” which may not be representative of the individual if he or she is a coauthor with many others on those papers. In such cases, the relation in Eq. 1 will imply a very atypical value of a, >5. Another disadvantage is that Nc,tot gives undue weight to highly cited review articles versus original research contributions.
Citations per paper (i.e., ratio of Nc,tot to Np ). Advantage: allows comparison of scientists of different ages. Disadvantage: hard to find, rewards low productivity, and penalizes high productivity.
Number of “significant papers,” defined as the number of papers with >y citations (for example, y = 50). Advantage: eliminates the disadvantages of criteria i, ii, and iii and gives an idea of broad and sustained impact. Disadvantage: y is arbitrary and will randomly favor or disfavor individuals, and y needs to be adjusted for different levels of seniority.
Number of citations to each of the q most-cited papers (for example, q = 5). Advantage: overcomes many of the disadvantages of the criteria above. Disadvantage: It is not a single number, making it more difficult to obtain and compare. Also, q is arbitrary and will randomly favor and disfavor individuals.
Instead, the proposed h index measures the broad impact of an individual's work, avoids all of the disadvantages of the criteria listed above, usually can be found very easily by ordering papers by “times cited” in the Thomson ISI Web of Science database (http://isiknowledge.com),† and gives a ballpark estimate of the total number of citations (Eq. 1).
Thus, I argue that two individuals with similar hs are comparable in terms of their overall scientific impact, even if their total number of papers or their total number of citations is very different. Conversely, comparing two individuals (of the same scientific age) with a similar number of total papers or of total citation count and very different h values, the one with the higher h is likely to be the more accomplished scientist.
For a given individual, one expects that h should increase approximately linearly with time. In the simplest possible model, assume that the researcher publishes p papers per year and that each published paper earns c new citations per year every subsequent year. The total number of citations after n + 1 years is then Assuming all papers up to year y contribute to the index h, we have The left side of Eq. 3a is the number of citations to the most recent of the papers contributing to h; the left side of Eq. 3b is the total number of papers contributing to h. Hence, from Eq. 3, The total number of citations (for not-too-small n) is then approximately of the form Eq. 1. The coefficient a depends on the number of papers and the number of citations per paper earned per year as given by Eq. 5. As stated earlier, we find empirically that a ≈ 3–5 is a typical value. The linear relation should hold quite generally for scientists who produce papers of similar quality at a steady rate over the course of their careers; of course, m will vary widely among different researchers. In the simple linear model, m is related to c and p as given by Eq. 4. Quite generally, the slope of h versus n, the parameter m, should provide a useful yardstick to compare scientists of different seniority.
In the linear model, the minimum value of a in Eq. 1 is a = 2, for the case c = p, where the papers with >h citations and those with <h citations contribute equally to the total Nc,tot . The value of a will be larger for both c > p and c < p. For c > p, most contributions to the total number of citations arise from the “highly cited papers” (the h papers that have Nc > h), whereas for c < p, it is the sparsely cited papers (the Np – h papers that have <h citations each) that give the largest contribution to Nc,tot . We find that the first situation holds in the vast majority of, if not all, cases. For the linear model defined in this example, a = 4 corresponds to c/p = 5.83 (the other value that yields a = 4, c/p = 0.17, is unrealistic).
The linear model defined above corresponds to the distribution where Nc (y) is the number of citations to the yth paper (ordered from most cited to least cited) and N 0 is the number of citations of the most highly cited paper (N 0 = cn in the example above). The total number of papers ym is given by Nc (ym ) = 0; hence, We can write N 0 and ym in terms of a defined in Eq. 1 as For a = 2, N 0 = ym = 2h. For larger a, the upper sign in Eq. 9 corresponds to the case where the highly cited papers dominate (the more realistic case), and the lower sign corresponds to the case where the less frequently cited papers dominate the total citation count.
In a more realistic model, Nc (y) will not be a linear function of y. Note that a = 2 can safely be assumed to be a lower bound quite generally, because a smaller value of a would require the second derivative ∂2 Nc /∂y 2 to be negative over large regions of y, which is not realistic. The total number of citations is given by the area under the Nc (y) curve that passes through the point Nc (h) = h. In the linear model, the lowest a = 2 corresponds to the line of slope –1, as shown in Fig. 1.
View larger version:
In this page In a new window
Download as PowerPoint Slide Fig. 1. Schematic curve of number of citations versus paper number, with papers numbered in order of decreasing citations. The intersection of the 45° line with the curve gives h. The total number of citations is the area under the curve. Assuming the second derivative is nonnegative everywhere, the minimum area is given by the distribution indicated by the dotted line, yielding a = 2 in Eq. 1.
A more realistic model would be a stretched exponential of the form Note that for β ≤ 1, N″ c(y) > 0 for all y; hence, a > 2 is true. We can write the distribution in terms of h and a as with I(β) the integral and α determined by the equation The maximally cited paper has citations and the total number of papers (with at least one citation) is determined by N(ym ) = 1 as
A given researcher's distribution can be modeled by choosing the most appropriate β and a for that case. For example, for β = 1, if a = 3, α = 0.661, N 0 = 4.54h, and ym = h[1 + .66lnh]. With a = 4, α = 0.4644, N 0 = 8.61h, and ym = h[1 + 0.46ln(h)]. For β = 0.5, the lowest possible value of a is 3.70; for that case, N 0 = 7.4h and ym = h[1 + 0.5ln(h)]2. Larger a values will increase N 0 and reduce ym . For β = 2/3, the smallest possible a is a = 3.24, for which case N 0 = 4.5h and ym = h[1 + 0.66ln(h)]3/2.
The linear relation between h and n (Eq. 6) will of course break down when the researcher slows down in paper production or stops publishing altogether. There is a time lag between the two events. In the linear model, assuming the researcher stops publishing after n stop years, h continues to increase at the same rate for a time and then stays constant, because now all published papers contribute to h. In a more realistic model, h will smoothly level off as n increases rather than with a discontinuous change in slope. Still, quite generally, the time lag will be larger for scientists who have published for many years, as Eq. 16 indicates.
Furthermore, in reality, of course, not all papers will eventually contribute to h. Some papers with low citations will never contribute to a researcher's h, especially if written late in the career, when h is already appreciable. As discussed by Redner (3), most papers earn their citations over a limited period of popularity and then they are no longer cited. Hence, it will be the case that papers that contributed to a researcher's h early in his or her career will no longer contribute to h later in the individual's career. Nevertheless, it is of course always true that h cannot decrease with time. The paper or papers that at any given time have exactly h citations are at risk of being eliminated from the individual's h count as they are superseded by other papers that are being cited at a higher rate. It is also possible that papers “drop out” and then later come back into the h count, as would occur for the kind of papers termed “sleeping beauties” (4).
For the individual researchers mentioned earlier, I find n from the time elapsed since their first published paper till the present and find the following values for the slope m defined in Eq. 6: Witten, m = 3.89; Heeger, m = 2.38; Cohen, m = 2.24; Gossard, m = 2.09; Anderson, m = 1.88; Weinberg, m = 1.76; Fisher, m = 1.91; Cardona, m = 1.87; deGennes, m = 1.75; Bahcall, m = 1.75; Fisk, m = 2.14; Scalapino, m = 1.88; Parisi, m = 2.15; Louie, m = 2.33; Jackiw, m = 1.92; Wilczek, m = 2.19; Vafa, m = 3.30; Maple, m = 1.94; Gross, m = 1.69; Dresselhaus, m = 1.41; and Hawking, m = 1.59. From inspection of the citation records of many physicists, I conclude the following:
A value of m ≈ 1 (i.e., an h index of 20 after 20 years of scientific activity), characterizes a successful scientist.
A value of m ≈ 2 (i.e., an h index of 40 after 20 years of scientific activity), characterizes outstanding scientists, likely to be found only at the top universities or major research laboratories.
A value of m ≈ 3 or higher (i.e., an h index of 60 after 20 years, or 90 after 30 years), characterizes truly unique individuals.
The m parameter ceases to be useful if a scientist does not maintain his or her level of productivity, whereas the h parameter remains useful as a measure of cumulative achievement that may continue to increase over time even long after the scientist has stopped publishing.
|