Procter & Gamble Co. v. Chesebrough-Pond's Inc.

588 F. Supp. 1082 (1984)

The PROCTER & GAMBLE COMPANY, Plaintiff,
v.
CHESEBROUGH-POND'S INC., Defendant.
CHESEBROUGH-POND'S INC., Plaintiff,
v.
The PROCTER & GAMBLE COMPANY and Benton & Bowles, Inc., Defendants.

No. 84 Civ. 0093 (GLG).

United States District Court, S.D. New York.

June 11, 1984.

*1083 Kramer, Levin, Nessen, Kamin & Frankel, New York City, for The Procter & Gamble Co. and Benton & Bowles, Inc.; Harold P. Weinberger, Geoffrey M. Kalmus, Greg A. Danilow, New York City, and Thomas R. Hillhouse, Cincinnati, Ohio, of counsel.

Proskauer, Rose, Goetz & Mendelsohn, New York City, Hyman, Phelps & McNamara, P.C., Washington, D.C., for Chesebrough-Pond's Inc.; Jeffrey B. Schreier, New York City, James R. Phelps, Thomas J. Donegan, Jr., Washington, D.C., and Arnold I. Friede, Greenwich, Conn., of counsel.

OPINION

GOETTEL, District Judge:

Advertising is a pervasive part of modern life. We are confronted with it wherever we turn. Most of us do not take advertising too seriously.[1] Those who do are the advertisers themselves, particularly when their competitors make comparative claims of superiority which may influence consumer purchases.

In these actions, two of the nation's largest consumer product manufacturers, The Procter & Gamble Company ("P & G")[2] and Chesebrough-Pond's Inc. ("Chesebrough"),[3] have sued one another. Each alleges that the other's comparative advertising claims concerning certain products are false.[4] Filed just a few business hours apart, the two actions have been consolidated, and evidentiary hearings lasting more than seven days have been held to consider the parties' cross-motions for preliminary injunction.

The following opinion constitutes the Court's findings of fact and conclusions of law on these cross-motions.

FACTS

The Skin Lotions

The products involved are skin lotions, which are referred to by the parties as "hand and body lotions" to distinguish them from facial lotions. They are consumer products sold in stores throughout the country and shipped in interstate commerce. No prescription is necessary for *1084 their purchase and no restriction has been placed on their usage.

Two of these skin lotions figure prominently in this litigation. Chesebrough's product, Vasoline Intensive Care Lotion ("VICL"), has a leading sixteen percent share of the skin lotion market, and P & G's product, Wondra, commands just over five percent of the market.[5] Recently, a "new and improved" formulation of the latter product ("New Wondra" or "Wondra V") has been marketed and extensively advertised.[6]

These and the other skin lotions are designed to counter dry, rough skin, a condition which has a number of causes. Experts consider one of the causes to be genetic. Another is exposure to water, wind, and sun. A third is contact with detergents, particularly those for clothing and dishes.[7]

To counter dry, rough skin, the lotions work on the basis of one or both of two methods. The first is occlusivity, whereby an impermeable layer prevents the loss of water from the skin. Petroleum jelly, lanolin (from sheep's oil), and other relatively greasy substances are effective in creating such an impermeable layer. The second method relies upon the use of a humectant, a chemical that permeates to the stratum corneum of the skin and there attracts and holds water. The most commonly used humectant is glycerin, an ingredient of skin lotion for more than fifty years.[8] Indeed, the claimed improvement in New Wondra was the addition of more glycerin.

Glycerin and the effective occlusive agents tend to be greasy, however, and the consumers who use these products, overwhelmingly women, customarily reject products that look or feel greasy. The manufacturer's task, therefore, has been to create a product that contains the effective agents of occlusive or humectant products yet rubs into the skin easily and leaves no greasy coat.[9] In the latter respect, the products are more cosmetic than medicinal.

Although there are approximately 100 brands of skin lotion besides Chesebrough's VICL and P & G's Wondra, only VICL has a substantial position in the market. Each of the other Chesebrough skin lotions — Intensive Care Extra Strength ("VICL Extra Strength"), Intensive Care Herbal and Aloe, and Vasoline's Dermatology Formula Lotion ("VDL") — has a relatively small market share, so that the four Chesebrough products together command only about 25% of the market.

VICL is Chesebrough's most heavily promoted product. In its advertising, Chesebrough claims that "no leading lotion beats" VICL. Following the reformulation of Wondra (known as Wondra V within the company, but called New Wondra for advertising purposes) and the completion of certain consumer tests discussed more fully below, P & G began intensive advertising *1085 of New Wondra in the summer of 1983. These advertisements proclaim that New Wondra is more effective because of its additional glycerin and that clinical tests have established that New Wondra relieves dry skin better than any other leading lotion.[10]

Thus, with P & G claiming that its product is superior to all other lotions and Chesebrough claiming that nobody's product is better than VICL, we have a situation in which at least one of the advertising claims must logically be wrong. Looking to the Court to determine which claim is misleading, each party contends that, it is being injured by the other's advertising and that because the injury cannot be concretely measured, the harm is irreparable.

Consumer Product Testing

Both companies have conducted extensive tests to support their advertising claims. The evidence presented at the hearings in this matter concerned primarily the propriety and accuracy of the testing methods employed by each company. The methods range from small-scale tests, often done with expert panels who give subjective responses to the use of the product, to large-scale clinical tests in which numerous consumers participate. In between are various other types of tests, including one known as the Kligman regression, which is named after a doctor at the University of Pennsylvania who has been active in attempting to make testing methods more scientific.[11]

While it is generally agreed that a fairly large-scale clinical test offers the best opportunity for demonstrating differences in efficacy, maintaining controlled laboratory conditions when conducting such tests is very difficult. Questions also arise in determining who should be included in the test population. Should it be composed primarily of adult women, who are the major users of the product, or should it be drawn from the population at large? Should the participants be lotion users and persons with an existing problem (such as "dishpan hands")? In this regard, it seems fairly obvious that participants in the test must start with some degree of rough, dry skin if there is to be any possibility of improvement. Also, because the tests used are relatively crude, it is generally agreed that those with fairly extensive problems make better subjects because the differences in improvement are more likely to be measurable. Nonetheless, it is also essential to exclude people who have rough, dry skin that is caused by skin diseases, rather than by the more common, everyday causes. Finally, the parties also agree that large-scale clinical tests are best conducted on a "double blind" basis (meaning that neither the subject nor the grader knows which product has been applied),[12] and that, where the subjects are using different products, an attempt should be made to classify the subjects by initial skin condition, age, and detergent exposure, so that every product is tested against a relatively random sample.

What the real controversy in this case concerns is the means used to evaluate the condition of the skin at the start of the testing and at the various grading points thereafter. No reliable mechanical or electrical tests for accomplishing this exist. The evaluation, therefore, must be made visually.[13] Toward this end, the manufacturers *1086 employ "graders." These are persons who are trained and schooled in evaluating skin condition by sight and touch. In their grading, they assign numerical values to the condition of the skin, using various scales ("parametric systems"), some of which set forth verbal descriptions of the equivalent skin condition at various points on the scale. The graders need not be dermatologists or even doctors, but a dermatologist is necessary for the initial screening in order to eliminate potential subjects who have specific skin diseases.

Both companies employed dermatologists as graders in their large-scale clinical tests. Having heard considerable testimony concerning the grading process, the Court is convinced that a skilled grader can determine with reasonable accuracy the relative condition of a subject's skin and compare it to that of other people. The Court also concludes, however, that the numerical designations do not necessarily correspond to the verbal descriptions on the scale. Moreover, although the graders are internally consistent, which is to say that one of them will give approximately the same score to all skin having the same condition, that score will not necessarily be the same as that which is given by another grader. Thus, for example, what one grader may consistently rate as 3.5 another grader may consistently rate as 4.5.

This weakness in the grading system creates a problem with any statistical analysis that is done on the scores. Those who believe that it is improper to apply a parametric system to the condition of skin find little basis for statistical analysis.[14] In addition, even allowing for the difficulty in a numerical grading system, and the subjectiveness of the grading, it is inherently difficult to detect comparative differences between two or more effective products unless a placebo is also tested. Yet, if a placebo is incorporated into the tests, it increases the number of subjects required and thus makes the testing even more expensive.

Beyond this, a major dispute between the parties is whether testing should be done on a controlled clinical basis or on an ad libitum basis. Chesebrough contends that, in conducting studies designed to compare the effectiveness of skin lotions, it is important to control as many variables as possible to ensure that the study reflects the effects of the products tested instead of the effects of unknown or uncontrolled variables. Chesebrough continues by arguing that an ad libitum test, which allows subjects to use as much of the product as often as they wish, amounts to nothing more than a consumer performance test, and that the better scores for New Wondra are due to the fact that more of the product was applied. P & G contends that ad libitum testing is superior because it mirrors the actual use of skin lotions, for which there is no prescribed dosage. Although P & G acknowledges that the instructions to the subjects to apply the product wherever and whenever they would normally use a lotion might have resulted in its product being used more frequently or in larger doses than Chesebrough's, P & G argues that this result indicates the effectiveness *1087 of the product because it is designed to encourage extensive consumer use.[15]

It is apparent from the testimony that, if the intent is to determine the abstract efficacy of one product as compared to another, controlled clinical tests, such as would be used with a prescription drug, are most appropriate. However, if the product is a non-prescription product that is in part a cosmetic, and if the intent is to induce the consumer to use large amounts of the effective ingredients, then the "effectiveness" of the product (in terms of its ultimate goal) may be validly tested in an ad libitum manner. It must be noted, however, that there is a substantial school of scientific thought that holds that, while such tests may be appropriate for product efficiency, they are not suitable for comparative advertising.

P & G's Tests

P & G's advertising claims are based primarily on two clinical studies: SC-207 and SC-215. SC-207 was conducted in Tucson, Arizona, in November of 1981. The test compared the efficacy of four products: New Wondra, VICL, Wondra GD (the Wondra formula being marketed at the time), and a placebo. SC-215 was conducted in Chicago Heights, Illinois, in January and February of 1983. Six products were tested: New Wondra, Jergen's Extra Dry and Soft Sense Extra Moisturizing (the two most popular lotions after VICL), Wondra E (the formula of Wondra being marketed at the time), and Chesebrough's VICL Extra Strength and VDL.

The test design of SC-207 and SC-215 was the same as that used in earlier comparative efficacy tests of earlier formulas of Wondra. In four of those tests, the prior Wondra formulations had been found to be either less effective than or only equally effective as the other leading products on the market, including VICL. For three of these tests and for SC-207 and SC-215, P & G employed a dermatologist Dr. Frank Dunlap,[16] to grade the subjects' hand conditions. In all of these tests, P & G used what is known as a parallel design test, in which a large number of subjects are selected and then divided into as many treatment groups as there are products to be tested. Those in a treatment group use only one product.

The subjects in SC-207 and SC-215 were selected randomly to create a representative group of users of hand and body lotions. To insure that some dry skin was present, subjects were required to have a certain minimum combined score for the dorsal (back) surfaces of the left and right hands.[17] The subjects were then classified according to age, initial skin condition, and dishwashing frequency, and were randomly assigned to the products to be compared.

Both SC-207 and SC-215 were properly "double-blinded." The subjects and the examiner did not know which products they *1088 were using. All of the products were placed in identical containers labeled only with random subject numbers and the words "skin lotion." The type of container was similar to the type most commonly used for hand and body lotions.[18]

Analysis of the scores given by the dermatologist in SC-207 showed New Wondra to be better than either VICL, the old Wondra formulation (GD), or the placebo. The superiority was sufficiently demonstrated to be considered statistically significant.[19] Analysis of the scores in SC-215 showed New Wondra to be better than the other five products tested but not necessarily at a statistically significant level. When P & G went further, however, and analyzed subsections of the SC-215 data, it found that if only those with demonstrably rough, dry skin, were compared, New Wondra's superiority over all the competing products was again demonstrated at a statistically significant level. Consequently, the advertising claims were qualified to refer only to the treatment of dry, rough skin.

Chesebrough has numerous criticisms of the P & G tests. Its primary challenge is that VICL was not included in the second, large clinical test, even though there had been a change in the formula of New Wondra during the period between the two tests. Chesebrough claims that P & G has thus failed to establish the efficacy of New Wondra as compared with VICL. In response, P & G argues that there was no change in the effective ingredients in the Wondra formula during this two-year period. Indeed, just over three percent of the old non-active ingredients were removed and just under two percent new ingredients were added. (The difference of between one and two percent was made up by adding more water.)[20] Chesebrough points out, however, that because the product is four-fifths water, fifteen percent of the non-water ingredients have been eliminated and ten percent new ingredients added. Although these changes were not made to the active ingredients, they could have affected the physical characteristics of the emulsion, which, in turn, may have affected the amount of product used and the way in which the consumer applied it. P & G ran several limited tests to determine whether the product's effectiveness had been altered[21] and concluded that there had been no change.

Chesebrough argues, however, that without a valid clinical comparison equivalency cannot be established. Having heard extensive testimony on the point, the Court can only say that, although the changes could have affected how the product is applied (for better or for worse), neither *1089 side has actually demonstrated the importance or lack of importance of the changes.[22]

An additional argument made by P & G is that, because SC-215 demonstrated New Wondra's superiority over Chesebrough's VICL Extra Strength and VDL, and because both of those products are touted by Chesebrough as being superior to VICL, it follows that New Wondra must also be better than VICL. The problem with this argument is that, while the evidence established that both VICL Extra Strength and VDL have more effective ingredients than VICL, it also demonstrates that these two products are thicker and perhaps greasier than VICL and may not invite the same amount of usage. Under the ad libitum conditions that prevailed in SC-215, the amount of usage of these products could differ from that of VICL and did differ from that of New Wondra. It cannot be said, therefore, that, simply because these two products have more intrinsic efficacy, they should perform more effectively than VICL in an ad libitum test.

Another serious challenge offered by Chesebrough is that P & G's first parametric statistical analysis of the total test population in SC-215 did not show to a statistically significant degree that New Wondra was better than the other tested lotions, and that subsequent reliance upon analyses made of subsets of the total population was not justified. P & G argues that such reliance was proper and that it has narrowed its advertising campaign to conform with the subset involved: those with very rough, dry skin. Chesebrough's response is that this qualified advertising claim does not conform with the verbal equivalents on the 0-5 scale, which do not mention roughness for any scores below 2.5.[23] Indeed, it does appear that the only basis for P & G's having chosen 1.5 as the minimum indication of roughness was that it was the mid-point score for all subjects tested. In other words, starting with a somewhat select population in the first place (women who use skin lotions), New Wondra proved to be a superior product to a statistically significant degree only with that half that had the worst problems. Of course, as noted earlier, considering the limitations of this type of testing, product efficacy can best be demonstrated by studying those who have significant problems and can, therefore, manifest the most change.

Chesebrough is also very critical of the manner in which the statistical analysis was done. As provided in the protocols, the data from SC-207 and SC-215 were subjected to an analysis of covariance. This is a parametric analysis that is used in studies involving grading scales of various sorts. Chesebrough claims that use of this analysis was inappropriate because it could find no documents showing that the assumptions justifying the use of this analysis had been met. There was testimony, however, that appropriate tests had been performed and that they indicated the appropriateness of this analysis.[24]

Chesebrough also criticizes the failure to use weather statistics as a co-variable in the statistical analyses. Unquestionably, weather has a pronounced effect upon the *1090 subject's skin. Cold, dry weather tends to make the skin lose its natural moisture, whereas warm, moist weather does not. The flaw in this criticism, however, is that the weather was the same for all subjects using every product. Consequently, while weather might affect skin condition of the subjects independently of the products used (and, indeed, the tests reflect this), the overall effect should be negligible because of the random selection of the subjects.

Chesebrough further argues that the results of the subset may be statistically significant but they are not "clinically significant." By that it means that the differences are not great enough to be visually or tangibly noticeable. The problem with this argument is that there is no established criteria to make such a determination, and P & G is not claiming that the tests showed a "clinically significant" difference.

Finally, of course, Chesebrough argues against the use of any ad libitum tests whatsoever. That point has already been discussed, however. See supra pp. 1085-1087.

From the foregoing, we can conclude that the P & G tests were far from perfect and are subject to various infirmities. We cannot conclude, however, that they were worthless. The question of whether advertising based on such tests is illegal is considered later.

Chesebrough's Tests

Chesebrough's claims of parity for VICL[25] were made before New Wondra became commercially available in quantity in November of 1983, because P & G had carefully guarded the secrecy of its new ad campaign. Consequently, Chesebrough had to run its tests hastily. It did not withdraw its ads while awaiting the results of these tests — two small tests and a third large-scale clinical test, which were conducted by independent testers using different methodologies to compare the effectiveness of New Wondra with that of VICL.

Very little evidence was introduced concerning the first two small-scale tests.[26] Although they purported to show no significant differences between the products, these two tests were probably not of sufficient size to detect any such differences.[27]

Chesebrough's large-scale clinical test was conducted by a consumer product testing service in Killington, Vermont, in December of 1983. The test took place over ten days and required seventy-three subjects to apply New Wondra to one hand and VICL to the other. The subjects were instructed to apply the lotion twice a day. On days one, three, five, eight, and ten, grading was done by Dr. Donald I. McIntyre, *1091 a dermatologist of some experience. He used a numerical scale with whole number intervals from one to nine, with the higher numbers indicating increased dryness, roughness, etc. The participants were ski instructors and other employees of a ski lodge, most of whom were in their twenties and had substantial skin problems. In order to be selected for the test, subjects had to have grades of seven on the nine-point scale and at least one of the following attributes: scaling, peeling or flaking, erythema, or cracking or fissures. Analysis of the results of the grading revealed no statistically significant difference between the products.

P & G criticizes several aspects of Chesebrough's tests, however. The two small-scale tests are dismissed out of hand for reasons already described. As for the Killington clinical test, P & G makes a number of points. Although the subjects were graded and double-blinded as in P & G's tests, P & G contends that it was not appropriate to use New Wondra on one hand and regular VICL on the other.[28] The instructions that were given were long and somewhat complicated. There was a substantial risk that subjects would forget and put the wrong product on the wrong hand. Also, there were substantial possibilities of contamination since New Wondra was applied on the New Wondra test hand by the VICL test hand and vice versa, a risk that was exacerbated by the fact that the self-applications were usually unsupervised. In addition, there is the fact that the subjects, because of their severe skin conditions, their ages, and their occupations, were not representative users of skin lotions. Furthermore, the grading was done by showing the right and left hands of each subject to the grader consecutively, which increased the chance of a bias for parity.[29]

To compound this grading problem, the nine-point grading scale in the Chesebrough protocol did not have specific verbal descriptions for any grade other than grades 1 and 9. There were no intermediate descriptions. In addition, Dr. McIntyre used one grading scale on the first day and another on subsequent grading days.[30] With the first he separated the nine numbers into three groups to help him determine who would qualify for the study. Then he used the second nine-point scale, the one described above, to determine the improvement of the hands as the study progressed. This use of two different scales may have resulted in the application of different criteria to the initial scores than to subsequent scores. Not surprisingly, some uncertainty in grading was shown in the tabulated results. For example, there was a dramatic but unexplained improvement in the condition of all the hands graded by Dr. McIntyre between day eight and day ten, far greater than what had previously occurred and inexplicable on any basis other than the grader's inconsistency.[31]

Another weakness in the execution of the Killington study was that the treatment period was too short. Indeed, the Killington test was far shorter than any of the other tests conducted by Chesebrough or P & G. Concluded before the Christmas holidays, *1092 the test did not permit a determination of whether the dramatic changes registered between days eight and ten were real and would continue.

Since this was a controlled test, the participants were each given a measured amount of lotion, which was considered to be a minimum dosage. This, of course, made the Killington test substantially different from the ad libitum tests conducted by P & G.

Interestingly, although P & G criticizes the Killington test in many respects, the majority of the participants found in favor of P & G's New Wondra on each day in the non-parametric portion of the test. This was the part of the test in which the users were simply asked to give their own subjective evaluation of the overall effectiveness of each product, the relative softness of the skin on each hand, and the relative degree of relief of tautness. The participant's preference for New Wondra was deemed statistically significant for all but the last couple of evaluations.

The Court concludes that Chesebrough's tests were more questionable than P & G's. They have been used, however, to support a lesser advertising claim of parity, not one of superiority. Moreover, Chesebrough's conclusion that "nothing beats Intensive Care" derives not merely from the Killington test but also from numerous other studies, including those made by various respectable, outside consultants. These studies have consistently shown that there is no clinically significant difference between any of the products in their ability to relieve dry skin.

Of course, whether differences are "significant" depends upon how fine a line is being drawn. Thus, in the final analysis, this case becomes little more than a dispute over testing methods, with neither side able to show fraud, deception, or bad faith on the part of its competitor.

LEGAL CONCLUSIONS

Section 43(a) of the Lanham Trade-Mark Act, 15 U.S.C. § 1125(a) (1982) (the "Lanham Act"), was not addressed primarily to advertising. Indeed, for several decades after its passage, it was rarely invoked with respect to advertising,[32] and even then almost never to challenge comparative claims.[33]

When the Lanham Act was used to challenge comparative advertising, the general response of the courts was to restrict its application. See, e.g., Bernard Food Industries, Inc. v. Dietene Co., 415 F.2d 1279, 1283-84 (7th Cir.1969) (defendant's false comparative advertising claim found not to constitute a section 43(a) violation), cert. denied, 397 U.S. 912, 90 S. Ct. 911, 25 L. Ed. 2d 92 (1970). Only ten years ago, this circuit held that it was not a violation of the Lanham Act to sell water-damaged goods as if they were first quality goods as long as no affirmative misrepresentations as to their quality were made. Alfred Dunhill Ltd. v. Interstate Cigar Co., 499 F.2d 232, 237-38 (2d Cir.1974). As a result, we find law review articles as late as 1976 bemoaning the failure of the judiciary to apply the Lanham Act to comparative advertising and exhorting the courts "to fashion a comprehensive set of remedies for comparative advertising abuses." Note, The Law of Comparative Advertising: How Much Worse is "Better" Than "Great", 76 Colum.L.Rev. 80, 112 (1976).

That challenge was taken up in this circuit six years ago in American Home Products Corp. v. Johnson & Johnson, 577 F.2d 160 (2d Cir.1978). In that case, Judge Oakes asserted flatly: "That section 43(a) of the Lanham Act encompasses more than literal falsehoods cannot be questioned." Id. at 165. The court held that advertising claims made by the manufacturers of Anacin for relief of pain and inflammation were, in light of the consumers' interpretation of the claims, ultimately *1093 false and enjoinable under the Lanham Act. Id. at 169-70.

The next step in the development of the law concerning comparative advertising was Vidal Sassoon, Inc. v. Bristol-Myers Co., 661 F.2d 272 (2d Cir.1981). There, Judge Kaufman began by acknowledging that "[o]ne of the most delicate tasks a court faces is the application of the legislative mandate of a prior generation to novel circumstances created by a culture grown more complex." Id. at 273. The issue before the court was whether the Lanham Act's prohibition against false advertising included misrepresentations regarding the results and methods of tests purporting to reflect consumer preferences. The court noted that

[n]othing in the history of either the 1920 Act or the 1946 amendments speaks to consumer testing. This silence is hardly surprising, given that the growth in consumer testing for comparative advertising claims occurred only with the advent of television and increasing sophistication of marketing techniques during the 1950's and 1960's.

Id. at 277. The court went on to conclude: "We are therefore reluctant to accord the language of § 43(a) a cramped construction, lest rapid advances in advertising and marketing methods outpace technical revisions in statutory language and finally defeat the clear purpose of Congress in protecting the consumer." Id. The court further noted that, while the Lanham Act literally applied only to misrepresentations concerning the "inherent quality or characteristic" of a product, if the intent and effect of an advertisement was to lead consumers into believing that a product was comparatively superior, then the statement of superiority amounted to a representation concerning the product's inherent quality. Id. at 278. Consequently, the court concluded:

In a case like this, where many of the qualities of a product (such as "body") are not susceptible to objective measurement, it is difficult to see how the manufacturer can advertise its product's "quality" more effectively than through the dissemination of the results of consumer preference studies. In such instances, the medium of the consumer test truly becomes the message of inherent superiority. We do not hold that every misrepresentation concerning consumer test results or methodology can result in liability pursuant to § 43(a). But where depictions of consumer test results or methodology are so significantly misleading that the reasoanbly intelligent consumer would be deceived about the product's inherent quality or characteristics, an action under § 43(a) may lie.

Id.

The Second Circuit's concern for the gullible consumer was further demonstrated in the case of Coca-Cola Co. v. Tropicana Products, Inc., 690 F.2d 312 (2d Cir.1982). There the court enjoined advertisements stating that Tropicana's product was "pasteurized juice as it comes from the orange," since the court was apparently concerned that consumers might believe that oranges contain pasteurized juice.[34]Id. at 318.

In the instant actions, the parties attempt to go a significant step further by attacking advertisements that are not obviously false but that rest upon tests whose efficacy is questioned. Essentially, the Court is being called upon to evaluate the standards for conducting tests intended to form the basis of comparative advertising claims.

In theory at least, a respectable argument can be made for the wisdom of creating such standards. The Court, however, listened for more than seven days to the testimony of more than a dozen expert witnesses[35] — statisticians, dermatologists, *1094 chemists, and physicists — and found that much of their testimony was incomprehensible.[36] Indeed, it is doubtful that there are many, if any, trial judges who could fully comprehend the testimony. Courts generally lack the expertise of the Federal Trade Commission when it comes to evaluating advertising practices. American Home Products Corp., supra, 577 F.2d at 172 n. 27. Although it is, of course, improper to explicitly misrepresent the results of tests or the manner in they they are carried out, an advertiser is not required to disclose all aspects of his test findings, provided the non-disclosure does not render the advertising misleading. See, e.g., FTC v. Sterling Drug, Inc., 317 F.2d 669, 675-76 (2d Cir.1963). Not every misrepresentation concerning consumer test methodology results in Lanham Act liability — only those so significantly misleading that consumers would be deceived about a product's inherent quality or characteristics. Vidal Sassoon, supra, 661 F.2d at 278.

Here, we are confronted with somewhat inconsistent product claims based on tests that were conducted in apparent good faith but with somewhat differing results. The difference in the results, in turn, was partially caused by the different test protocols that the parties chose. As a consequence, neither of the parties has successfully proven that the other has chosen tests and conducted them in such a manner as to mislead the public. Courts are not always able to determine whether an advertising claim is true or false, see, e.g., American Home Products Corp. v. Johnson & Johnson, 436 F. Supp. 785, 795 (S.D.N.Y.1977), aff'd, 577 F.2d 160 (1978), and where this occurs, the only possible conclusion is that the moving party has failed to prove by a preponderance of the evidence that the advertising claim is false. Such is the case here.

Moreover, the Court is most skeptical of the parties' contention that it is, or should be, the duty of a court in a case such as this to determine the winner and enjoin the loser. Only by making policy for the testing of consumer goods could a court take such a step. While there are those who believe that for every wrong there must be a remedy and that courts should intervene where the executive branch and the legislature have not, there are substantial constitutional objections to judicial policy-making under our form of government. Wilkey, Activism by the Branch of Last Resort: Of the Seizure of Abandoned Swords and Purses at 12 (National Legal Center for the Public Interest 1984). Judge Wilkey notes that courts have no facilities for holding public hearings to gather the information and facts on which public policy decisions should be based, and that judges are not adequately trained to make such policy decisions. Id. at 12-14. As his colleague on the D.C. Circuit, Judge Robert Bork, has noted, the proliferation of cases like these is changing the nature of courts from that of a judicial body to that of a bureaucratic model. R. Bork, Dealing with the Overload in Article III Courts, address delivered at the National Conference on the Causes of Popular Dissatisfaction with the Administration of Justice, 70 F.R.D. 231, 233-34 (1976).

One does not have to oppose judicial activism to recognize that the role that these parties ask the judiciary to play exceeds that which the judiciary has the power to accept under our form of government. We are dealing with rough tests that have no certifiable standards and that rest upon nothing more than subjective evaluations of skin conditions. The conditions being evaluated are not of serious import, and the products being evaluated are far from the most effective available to achieve the results desired. The parties are sparring to obtain commercial advantage over what is at most a cosmetological distinction. Thus, if any injunctive relief were called for, it would be an order requiring both *1095 parties to remove from their advertisements any implication that their products are the most effective available, for they are really nothing more than the most acceptable adaptations for female users.

Finding that neither party has shown a likelihood of success on the merits of its claim, both parties' motions for preliminary injunction are denied.

SO ORDERED.

NOTES

[1] F. Scott Fitzgerald, who worked for awhile in an advertising company, wrote in The Crack Up, "Advertising is a racket ..., its constructive contribution to humanity is exactly minus zero." B. Evans, Dictionary of Quotations 8 (1978).

[2] The Procter & Gamble Company is an Ohio corporation with its principal place of business in Cincinnati, Ohio.

[3] Chesebrough-Pond's Inc. is a New York corporation with its principal place of business in Greenwich, Connecticut.

[4] The actions are brought under the Lanham Trade-Mark Act, 15 U.S.C. §§ 1051-1127 (1982) (the "Lanham Act"), but the complaint also contains a number of pendent state claims, including those brought under sections 349 and 350 of New York's General Business Law. Since there has been no demonstration that the parties' rights are any different under New York law than under the Lanham Act, no separate consideration has been given to the state claims.

[5] As is apparently required in comparative advertising, each party has limited its claims to "leading brands." In addition, the parties have stipulated that for purposes of this case only the leading brands are defined as those commanding five percent or more of the market, even though there are numerous brands having less than five but more than one percent of the market. Fortunately for the Court, whether exclusion of these brands is actually appropriate need not be considered here.

[6] Also named as a defendant is P & G's advertising agency, Benton & Bowles, Inc., a New York corporation. It produced and disseminated the advertising for the reformulated Wondra product, but made no independent evaluation of the substantiation provided by P & G in support of its claims concerning that product.

[7] Aging also produces noticeable changes in the skin. None of these products was designed to deal with that problem, however. Indeed, the evidence indicates that no known product can correct wrinkles, though some may effectively conceal them.

[8] At relative humidities greater than twenty-five percent, glycerin will absorb water from the atmosphere. At relative humidities less than twenty-five percent, it will draw water from the stratum corneum. Thus, weather can have a considerable effect upon a high-humectant product.

[9] Making the producers' balancing act even more difficult is the fact that many consumers shun a product that is completely invisible because it is perceived as having no beneficial effect.

[10] The purpose of this advertising campaign is to achieve market leadership — a sixteen percent share of the market — for New Wondra.

[11] Several of the witnesses during these hearings were students or associates of Dr. Kligman.

[12] In the P & G tests described hereafter, all of the products were placed in identical bottles containing a simple squeeze-type opener.

[13] P & G has been attempting to develop instrumental measurement of skin dryness by testing sonic velocity and electrical properties. In earlier distributions to the medical profession, P & G indicated that such tests confirmed the results obtained by visual grading. Chesebrough complains about this reference to the tests on the ground that they are too experimental to be reliable. P & G has agreed that it will make no further distributions of this type until the scientific accuracy of the new tests has been better established.

Chesebrough also complains about another reference to the tests, published in an article in the January 1983 issue of Current Therapeutics (purportedly authored by a dermatologist, Dr. Frank Dunlap, but, in fact, ghost-written by one of P & G's employees). On January 13, 1984, however, the Court addressed this particular issue and denied Chesebrough's motion for a temporary restraining order. We found that the article "is not the stuff of which advertising is made," and that the article is "relatively factual in that it acknowledges the extent to which you cannot be too reliant on the instrumental tests — there not having been any long history of using them in this way or any strong basis for assuming their correctness." The Court also indicated that dissemination of this article could not "possibly be the sort of thing which could work irreparable harm upon Chesebrough-Pond's." At this time, we can only add that, even if this particular reference to the tests did constitute a Lanham Act violation, a restraining order would not be appropriate because there would still be no foreseeable chance of the reference being repeated. See, e.g., United States v. W.T. Grant Co., 345 U.S. 629, 633, 73 S. Ct. 894, 897, 97 L. Ed. 1303 (1953).

[14] The hand grading scale is an ordinal or ranking scale. It is not an interval scale, in which the distance between points of measurement is fixed and constant.

[15] A current advertisement for a toothpaste provides what is arguably an analogous situation. The ad suggests that the toothpaste may be more effective than others because it tastes good — the argument being that because it tastes good children will brush with it more often and longer and thereby attain cleaner teeth.

[16] P & G had previously verified Dr. Dunlap's consistency in his use of the scale. Dr. Dunlap was proved to be a well-trained, consistent, and internally reproducible grader.

[17] The grading scale that appears in the SC-207 and SC-215 protocols is labeled the "Overall Hand Grading Scale, Dorsal and Digital Description." The scale ranges from a grade of 0 to a grade of 5.0, with intervals of 0.5 along the way. Each of the intervals (from 0 to 5.0) is explained by a verbal description of the condition of the skin. One grade is given for each of four surfaces: the digital portions of the right and left hands (from approximately the knuckles to the fingertips) and the dorsal surfaces of the right and left hands (from the wrists to the beginning of the knuckles). These are averaged to provide an overall grade. The grade is in part determined by the volume of area of the hand exhibiting the various specific characteristics. Subjects were screened for acceptance into the tests at the time of initial grading. To be accepted, they were required to have a total score of at least 1.0 (sum of left and right hands) on the dorsal surface. 1.0 is described on the scale as "Skin slightly smooth, patches of powdery scales/ashing definitely in the creases." Those subjects who began the study with an average skin severity grade of 1.5 or higher on the 0-5 scale were in the subset considered to have rough, dry skin.

[18] One problem with the test in this regard is that Chesebrough's VDL is never sold in that type of dispenser but always in a pump-type because of the thickness of the product. Consequently, subjects may have used less of that product than they would have if it had been supplied in its customary container, and any such diminished use may have affected its comparative test results. (An additional ground for doubting the validity of comparing VDL with New Wondra or VICL, is that the first product is considerably more expensive than the latter two.)

[19] That is, it was significant at the ninety-five percent confidence level.

[20] Specifically, the changes in formulation between SC-207 and SC-215 were as follows: the content of sodium hydroxide (in a 50% solution) was increased from .31% of the product weight to .34%; the 2% of ethanol (drinking alcohol) was eliminated; the content of cetyl alcohol was reduced from 3% of the product weight to 1.8% and stearyl alcohol was added to offset the resulting 1.2% reduction; the preservative used in SC-207, a product known as tektamer 38, which constituted .025% of the overall product, was replaced in SC-215 by methyl paraben, propyl paraben, and Germall 115, which then made up .2%, .1%, and .1% of the overall product, respectively; finally, as mentioned above, a change of between 1% and 2% in product weight was made by adding water. Both formulations contained as their active ingredients glycerin and petrolatum at 10% and 2.5% of the product weight, respectively. The numerous other chemicals in the two formulations remained the same. The purposes of the changes were to enhance the preservative characteristics of the product and to make it more stable.

[21] One of those tests compared the lotions' viscosities, because, as noted earlier, the thickness of a product has been shown to affect its use by consumers.

[22] In seeking an injunction against P & G, Chesebrough, of course, bears the burden of showing a likelihood of success on the merits. Thus, it is incumbent upon Chesebrough to establish its ability to demonstrate that the changes in the Wondra formulation invalidate the results of SC-207 which suggest that New Wondra is superior to VICL.

[23] P & G has represented 1.0 on the overall grading scale to be "smooth, soft, healthy skin." P & G has characterized 0.5 to 1.5 as "good skin condition," also. The Dunlap paper referred to earlier, see supra note 13, does not mention that the reported data is extracted from a subset of the total population, and the paper is not written in a manner suitable for publication in a peer-reviewed journal. (P & G had to pay part of the cost of having the article published.) The purpose of publishing the article was to convey to practicing physicians information concerning the efficacy of New Wondra.

[24] Chesebrough also claims that the tests should have been "two-tailed" rather than "one-tailed." A one-tail test is suitable if you are seeking to show simply superiority. If you are interested in testing for equality or superiority, a two-tailed test would be required.

[25] Although the claim that "no leading lotion beats" VICL is literally a parity claim, it carries with it an intimation of superiority. Indeed, consumer tests performed by P & G indicate that some of the public so construed it, and, of course, in determining whether an advertising claim is misleading, implications and public reaction must be considered, American Home Products Corp. v. Johnson & Johnson, 577 F.2d 160, 165-67 (2d Cir.1978). However, because the majority of the public did not perceive it as a claim of superiority, the Court shall treat it, for purposes of this motion only, as a parity claim.

[26] The first such study was conducted in Watertown, New York, with twenty-eight subjects. Seven subjects applied VICL, seven others New Wondra, and the remaining fourteen two other skin lotion products. The subjects applied the test products for two weeks and then went through a one-week regression (no treatment) period. Skin conditions were evaluated by graders on a ten-point scale at intervals throughout the three-week period. Analysis of the grades showed no significant difference in relief of dry skin between the treatment groups.

The second study was conducted in Waltham, Massachusetts, and involved eleven subjects. At stated intervals each subject had New Wondra applied to one leg and VICL applied to the other. To be admitted to the study, each subject had to exhibit severe dryness. The study, which was conducted over a twenty-two day period, consisted of two periods of product usage (days one through five and eight through twelve) and a regression period. Analysis of the data showed no significant difference between the products.

[27] The "power" of a test indicates how effective it is in detecting differences. Statisticians agree that a moderately large number of subjects have to be tested to produce sufficient power for the test to demonstrate differences at a statistically significant level.

[28] The statistical analysis of the data was more appropriate for a test in which different people use different products than for one in which the each person uses both products. The analysis also improperly included scores from the first day of the test, before the subjects had ever used the products. The effect of this inclusion was to make it more difficult to find statistically significant differences between the treatments and to dilute any differences that might otherwise have been shown to exist.

[29] Indeed, on about sixty occasions Dr. McIntyre changed grades already given to one hand of a subject after looking at the other hand, even though it was the two hands of each person that the study was meant to compare.

[30] Dr. McIntyre was not trained in the use of the Chesebrough nine-point grading scale, had never used the scale before, and had limited experience in grading, in general.

[31] The Killington test employed an analysis that is appropriate only when the products involved cause a uniform type of response at each of the various times the skin is examined. According to the results of days eight and ten, this was not the case.

[32] Note, The Law of Comparative Advertising: How Much Worse is "Better" Than "Great", 76 Colum.L.Rev. 80, 91 (1976).

[33] Of course, the scarcity of such suits may have been largely due to the fact that comparative advertising did not become commonplace until the 1970's.

[34] An additional problem with the freshness advertising was that, although the defendants' product was not made from a frozen concentrate, during the peak growing season, when supply exceeded demand, quantities were frozen for use during the slack season.

[35] Some of these served as fact witnesses as well.

[36] This was true despite the fact that I have taken an intensive seminar in statistics and econometrics to better equip me for such cases.