Merrell Dow Pharmaceuticals, Inc. v. Havner

OWEN, Justice,

delivered the opinion of the Court

in which PHILLIPS, Chief Justice, and GONZALEZ, HECHT, CORNYN, ENOCH and ABBOTT, Justices, join.

The issue in this case is whether there is any evidence that the drug Bendectin caused Kelly Havner to be born with a birth defect. We hold that the evidence offered is legally insufficient to establish causation. Accordingly, we reverse the judgment of the court of appeals. 907 S.W.2d 535.

I

Kelly Havner was born with a limb reduction birth defect. The fingers on her right hand were not formed. Kelly’s mother had taken the prescription drug Bendectin in 1981 during her pregnancy to reheve nausea and other symptoms associated with morning sickness. Bendectin was formulated by Merrell Dow and its predecessors and marketed in the United States from 1957 to 1983. It was sold in other countries as well, but was called Debendox in the British Commonwealth,, Ireland, and Australia and Leno-tan in West Germany. The Bendectin Marilyn Havner ingested had two components: doxylamine succinate, which is an antihistamine, and pyridoxine hydrochloride, which is vitamin B-6. Prior to 1977, Bendectin had contained a third component, dicylomine hydrochloride, which is an antieholergenic. Approximately thirty million women took Bendectin in either the two- or three-ingredient form.

More than twenty years ago, questions were raised about Bendectin and its possible association with birth defects. The FDA investigated the concerns, but failed to conclude that Bendectin increased the'risk of birth defects. More than thirty studies on Bendectin and birth defects have been conducted and published in peer-reviewed scientific and medical journals since questions were first raised. None of these studies concludes that children of women who took Bendectin during pregnancy had an increased risk of limb reduction birth defects. Some of these studies affirmatively conclude that there is no association between Bendec-tin and birth defects and that Bendectin is a safe drug. Although FDA approval of Ben-dectin has never been revoked, Merrell Dow withdrew the drug from the market in 1983, a little over a year after Kelly Havner was born.

The Havners’ suit is based on theories of negligence, defective design, and defective marketing. It is one of thousands brought against Merrell Dow and its predecessors for the manufacture and distribution of Bendec-tin. In virtually all the Bendectin litigation, the central issue has been the scientific reliability of the expert testimony offered to establish causation. Merrell Dow challenged the Havners’ causation evidence at several junctures in these proceedings. It filed a motion for summary judgment, contending that there is no scientifically reliable evidence that Bendectin causes limb reduction birth defects or that it caused Kelly Havner’s birth defect. Before denying the motion, the trial court held a hearing at which the scien-*709tifie reliability of the Havners’ summary judgment evidence was extensively aired.

Just before trial, the scientific reliability of the Havners’ evidence was again raised by Merrell Dow in motions in limine that sought to exclude the testimony of certain of the Havners’ experts and other causation evidence. One of these motions requested that testimony about causation be excluded until a prima facie case had been established that there was a statistically significant elevated risk that a child would be born with limb reduction birth defects if the child’s mother ingested Bendectin. Another motion sought to preclude the Havners’ witnesses from relying on in vitro and in vivo animal studies. Other motions sought to exclude entirely the testimony of three of the Havners’ causation witnesses. The issues were fully briefed, and after a lengthy hearing, the trial court denied each of the motions.

A bifurcated jury trial ensued. In the liability phase, the Havners called five experts on the causation question. Merrell Dow objected to the admission of some, but not all, of this evidence. Merrell Dow also unsuccessfully moved for a directed verdict on the issue of causation at the close of the Havners’ evidence. As can be seen from the record, the question of scientific reliability was raised repeatedly.

At the conclusion of the liability phase, the jury found in favor of the Havners and awarded $3.75 million. In the punitive damages stage, the jury awarded $30 million, but that amount was reduced by the trial court to $15 million pursuant to former Tex. Crv. PRAC. & Rem.Code § 41.007. Merrell Dow appealed.

The panel of the court of appeals that originally heard the case reversed and rendered judgment that the Havners take nothing, holding that the evidence of causation was legally insufficient. 907 S.W.2d at 548. The panel concluded that “[t]he Havners have failed to bring forward anything more than suspicion on the essential element of causation.” Id. On rehearing en banc, a divided court disagreed. It affirmed the trial court’s award of actual damages, but reversed and rendered the award of punitive damages. Id. at 564. We granted Merrell Dow’s application for writ of error.

Merrell Dow challenges the legal sufficiency of the Havners’ causation evidence and the admissibility of some of that evidence and further contends that its due process rights under the United States Constitution and its ’ due course rights under the Texas Constitution were denied. Because of our disposition of this case, we reach only the no evidence point of error.

II

All the expert witnesses on causation have appeared in other cases in which Bendectin was claimed to have caused limb reduction birth defects. The Sixth Circuit commented that the Bendectin suits are “variations on a theme, somewhat like an orchestra which travels to different music halls, substituting musicians from time to time but playing essentially the same repertoire.” Turpin v. Merrell Dow Pharms., Inc., 959 F.2d 1349, 1351 (6th Cir.1992).

The federal courts have dealt extensively with Bendectin litigation. To date, no plaintiff has ultimately prevailed in federal court. The evidence in those cases has been similar to that offered by the Havners. The federal decisions have discussed the substance of the evidence in detail, and often the testimony under scrutiny included that of Drs. Palmer, Newman, Glasser, Gross, and Swan, the Hav-ners’ witnesses. These decisions are not binding on our Court, but they do provide extensive consideration of the scientific reliability of the causation evidence.

Some federal courts have concluded that the expert evidence of causation is legally insufficient. See Elkins v. Richardson-Merrell, Inc., 8 F.3d 1068 (6th Cir.1993); Turpin, 959 F.2d 1349; Brock v. Merrell Dow Pharms., Inc., 874 F.2d 307 (5th Cir.), modified on reh’g, 884 F.2d 166 (5th Cir.1989); Richardson v. Richardson-Merrell, Inc., 857 F.2d 823 (D.C.Cir.1988); LeBlanc v. Merrell Dow Pharms., Inc., 932 F.Supp. 782 (E.D.La.1996); Hull v. Merrell Dow Pharms., Inc., 700 F.Supp. 28 (S.D.Fla.1988); Monahan v. Merrell-National Labs., No. 83-3108-WD, 1987 WL 90269 (D.Mass. Dec.18, 1987).

*710Other federal courts have found the expert evidence to be inadmissible. See Raynor v. Merrell Pharms., Inc., 104 F.3d 1371 (D.C.Cir.1997); Daubert v. Merrell Dow Pharms., Inc., 43 F.3d 1311 (9th Cir.) (on remand), cert. denied, — U.S. -, 116 S.Ct. 189, 133 L.Ed.2d 126 (1995); Ealy v. Richardson-Merrell, Inc., 897 F.2d 1159 (D.C.Cir.1990); Lynch v. Merrell-National Labs., 830 F.2d 1190 (1st Cir.1987); DeLuca v. Merrell Dow Pharms., Inc., 791 F.Supp. 1042 (D.N.J.1992), aff'd, 6 F.3d 778 (3d Cir.1993); Lee v. Richardson-Merrell, Inc., 772 F.Supp. 1027 (W.D.Tenn.1991), aff'd, 961 F.2d 1577 (6th Cir.1992); Cadarian v. Merrell Dow Pharms., Inc., 745 F.Supp. 409 (E.D.Mich.1989); Ambrosini v. Richardson-Merrell, Inc., No. 86-278, 1989 WL 298429 (D.D.C. June 30, 1989), aff'd, 946 F.2d 1563 (D.C.Cir.1991); Will v. Richardson-Merrell, Inc., 647 F.Supp. 544 (S.D.Ga.1986).

One federal circuit court initially found the expert testimony admissible and reversed a summary judgment for Merrell Dow. DeLuca v. Merrell Dow Pharms., Inc., 911 F.2d 941, 952-59 (3d Cir.1990). However, on remand the trial court once again found the evidence inadmissible and, after entering extensive findings of fact and conclusions of law, granted summary judgment for Merrell Dow. The Third Circuit affirmed that judgment with an unpublished opinion. DeLuca v. Merrell Dow Pharms., Inc., 791 F.Supp. 1042 (D.N.J.1992), aff'd, 6 F.3d 778 (3d Cir.1993).

A few federal district courts have denied summary judgment for Merrell Dow on the basis that the evidence raised a fact question. Longmore v. Merrell Dow Pharms., Inc., 737 F.Supp. 1117 (D.Idaho 1990); In re Bendectin Prods. Liab. Litig., 732 F.Supp. 744 (E.D.Mich.1990); Hagen v. Richardson-Merrell, Inc., 697 F.Supp. 334 (N.D.Ill.1988); see also Lanzilotti v. Merrell Dow Pharms., Inc., No. 82-0183, 1986 WL 7832 (E.D.Pa. July 10, 1986) (denying motion for directed verdict).

Decisions in which Merrell Dow obtained a jury verdict in its favor include Wilson v. Merrell Dow Pharmaceuticals, Inc., 893 F.2d 1149 (10th Cir.1990), and In re Bendectin Litigation, 857 F.2d 290 (6th Cir.1988).

However, a state trial court recently entered judgment on a jury verdict against Merrell Dow that included a finding of fraud. In a written opinion, the court was highly critical of the evidence offered by Merrell Dow, concluding that there was ample evidence Merrell Dow had made misrepresentations to the FDA, including misrepresentations about its animal studies on Bendectin. Blum v. Merrell Dow Pharm., Inc., No. 1027 (Pa.Ct.C.P. Dec. 13, 1996) (appeal pending).

At least one state court has granted summary disposition for Merrell Dow on the basis that the expert testimony of Drs. Newman, Palmer, and Swan was inadmissible. DePyper v. Navarro, No. 83-303467-NM, 1995 WL 788828 (Mich.Cir.Ct. Nov.27, 1995) (holding plaintiffs’ experts’ testimony inadmissible under the Davis/Frye rule and rendering judgment for Merrell Dow).

The only appellate decision we have found, state or federal, that has upheld a verdict in favor of a plaintiff in a Bendectin case is from the court of appeals for the District of Columbia in Oxendine v. Merrell Dow Pharmaceuticals, Inc., 506 A.2d 1100 (D.C.1986) (reversing judgment notwithstanding the verdict and remanding for reinstatement of compensatory damages and determination of punitive damages). However, the subsequent history of that case is somewhat extraordinary. Upon remand to the trial court, instead of following the court of appeals’ directive, the trial court granted Merrell Dow’s motion for new trial and vacated the judgment. Another appeal ensued, and the case was remanded with instructions that a judgment be entered on the verdict. Oxendine v. Merrell Dow Pharmaceuticals, Inc., 563 A.2d 330, 331, 338 (D.C.1989). Judgment was entered. Yet another appeal was taken, but the appeal was dismissed for lack of finality because the question of punitive damages remained to be tried. Merrell Dow Pharms., Inc. v. Oxendine, 593 A.2d 1023 (D.C.1991). Following remand, judgment was entered, but Merrell Dow sought relief from the judgment in light of post-trial developments including epidemiological studies that were not completed at the time of trial. Merrell Dow also relied on appellate decisions decided on the heels of the first appel*711late decision in Oxendine that had concluded that there was no scientifically rehable evidence of causation in the Bendectin eases. The trial court declined to set aside the judgment. Merrell Dow Pharms., Inc. v. Oxendine, 649 A.2d 825, 827 (D.C.1994). The fourth appeal ensued, and the appellate court remanded the case to the trial court for a determination of whether Merrell Dow could demonstrate “that the newly discovered evidence “would probably produce a different verdict if a new trial were granted.’ ” Id. at 832. On remand, the trial court extensively reviewed the evidence, including the testimony or affidavits of Drs. Newman, Swan, Palmer, Gross, and Glasser, and granted relief from the verdict, rendering judgment for Merrell Dow. Oxendine v. Merrell Dow Pharms., Inc., No. 82-1245, 1996 WL 680992 (D.C.Super.Ct. Oct. 24, 1996) (appeal pending).

Thus, we are not the first court to wrestle with the issues presented by the Bendectin litigation.

Ill

As in most of the Bendectin cases, the central issue before us is not whether the plaintiffs’ witnesses possessed adequate credentials, skills, or experience to testify about causation. The only witness whose qualifications have been challenged is Dr. Palmer, whose experience in identifying the cause of birth defects is questioned by Merrell Dow. Cf. United Blood Servs. v. Longoria, 938 S.W.2d 29, 30-31 (Tex.1997); Broders v. Heise, 924 S.W.2d 148, 151-54 (Tex.1996). Indeed, the Havners’ causation witnesses, including Dr. Palmer, testified in a case that reached the United States Supreme Court, and that Court deemed their credentials “impressive.” Daubert v. Merrell Dow Pharms., Inc., 509 U.S. 579, 583 & n. 2, 113 S.Ct. 2786, 2792 & n. 2, 125 L.Ed.2d 469 (1993). The issue before us, as in most of the previously cited Bendectin cases, is whether the Hav-ners’ evidence is scientifically rehable and thus some evidence to support the judgment in their favor.

In determining whether there is no evidence of probative force to support a jury’s finding, all the record evidence must be considered in the light most favorable to the party in whose favor the verdict has been rendered, and every reasonable inference deducible from the evidence is to be indulged in that party’s favor. Harbin v. Seale, 461 S.W.2d 591, 592 (Tex.1970). A no evidence point will be sustained when (a) there is a complete absence of evidence of a vital fact, (b) the court is barred by rules of law or of evidence from giving weight to the only evidence offered to prove a vital fact, (c) the evidence offered to prove a vital fact is no more than a mere scintilla, or (d) the evidence conclusively establishes the opposite of the vital fact. Robert W. Calvert, “No Evidence” and “Insufficient Evidence” Points of Error, 38 Tex. L.Rev. 361, 362-63 (1960). More than a scintilla of evidence exists when the evidence supporting the finding, as a whole, “‘rises to a level that would enable reasonable and fair-minded people to differ in their conclusions.’ ” Burroughs Wellcome Co. v. Crye, 907 S.W.2d 497, 499 (Tex.1995) (quoting Transportation Ins. Co. v. Moriel, 879 S.W.2d 10, 25 (Tex.1994)).

Several of the Havners’ experts testified that Bendectin can cause limb reduction birth defects. Dr. Palmer testified that, to a reasonable degree of medical certainty, Kelly Havner’s birth defect was caused by the Bendectin her mother ingested during pregnancy. We have held, however, that an expert’s bare opinion will not suffice. See Burroughs Wellcome, 907 S.W.2d at 499-500; Schaefer v. Texas Employers’ Ins. Ass’n, 612 S.W.2d 199, 202-04 (Tex.1980). The substance of the testimony must be considered. Burroughs Wellcome, 907 S.W.2d at 499-500; Schaefer, 612 S.W.2d at 202.

In Schaefer, a workers’ compensation case, the plaintiff suffered from atypical tuberculosis, some strains of which were carried by fowl. An expert testified that based on reasonable medical probability, the plaintiffs disease resulted from his employment as a plumber in which he was exposed to soil contaminated with the feces of birds. Schaefer, 612 S.W.2d at 202. Nevertheless, this Court looked at the testimony in its entirety, noting that to accept the expert’s opinion as some evidence “simply because he used the magic words” would effectively remove the *712jurisdiction of the appellate courts to determine the legal sufficiency of the evidence in any case requiring expert testimony. Id. at 202-05. After considering the record in Schaefer, this Court held that there was no evidence of causation because despite the “magic language” used, the expert testimony was not based on reasonable medical probability but instead relied on possibility, speculation, and surmise. Id. at 204-05.

Other courts have likewise recognized that it is not so simply because “an expert says it is so.” Viterbo v. Dow Chem. Co., 826 F.2d 420, 421 (5th Cir.1987). When the expert “br[ings] to court little, more than his credentials and a subjective opinion,” this is not evidence that would support a judgment. Id. at 421-22. The Fifth Circuit in Viterbo affirmed a summary judgment and the exclusion of expert testimony that was unreliable, holding that “[i]f an opinion is fundamentally unsupported, then it offers no expert assistance to the jury.” Id. at 422; see also Rosen v. Ciba-Geigy Corp., 78 F.3d 316, 319 (7th Cir.) (“[A]n expert who supplies nothing but a bottom line supplies nothing of value to the judicial process.”), cert. denied, — U.S. -, 117 S.Ct. 73, 136 L.Ed.2d 33 (1996); Turpin v. Merrell Dow Pharms., Inc., 959 F.2d 1349, 1360 (6th Cir.1992) (holding evidence legally insufficient in Bendeetin case when no understandable scientific basis was stated).

It could be argued that looking beyond the testimony to determine the reliability of scientific evidence is incompatible with our no evidence standard of review. If a reviewing court is to consider the evidence in the light most favorable to the verdict, the argument runs, a court should not look beyond the expert’s testimony to determine if it is reli- ■ able. But such an argument is too simplistic. It reduces the no evidence standard of review to a meaningless exercise of looking to see only what words appear in the transcript of the testimony, not whether there is in fact some evidence. We have rejected such an approach. See Schaefer, 612 S.W.2d at 205; see also Burroughs Wellcome, 907 S.W.2d at 499-500.

Justice Gonzalez, in writing for the Court, gave rather colorful examples of unreliable scientific evidence in E.I. du Pont de Nemours & Co. v. Robinson, 923 S.W.2d 549, 558 (Tex.1995), when he said that even an expert with a degree should not be able to testify that the world is flat, that the moon is made of green cheese, or that the Earth is the center of the solar system. If for some reason such testimony were admitted in a trial without objection, would a reviewing court be obliged to accept it as some evidence? The answer is no. In concluding that this testimony is scientifically unreliable and therefore no evidence, however, a court necessarily looks beyond what the expert said. Reliability is determined by looking at numerous factors including those set forth in Robinson and Daubert. The testimony of an expert is generally opinion testimony. Whether it rises to the level of evidence is determined under our rules of evidence, including Rule 702, which requires courts to determine if the opinion testimony will assist the jury in deciding a fact issue.1 While Rule 702 deals with the admissibility of evidence, it offers substantive guidelines in determining if the expert testimony is some evidence of probative value.

Similarly, to say that the expert’s testimony is some evidence under our standard of review simply because the expert testified that the underlying technique or methodology supporting his or her opinion is generally accepted by the scientific community is putting the cart before the horse. As we said in Robinson, an expert’s bald assurance of validity is not enough. 923 S.W.2d at 559 (quoting Daubert v. Merrell Dow Pharms., Inc., 43 F.3d 1311, 1316 (9th Cir.) (on remand) (holding that expert’s assertion of validity is not enough; there must be objective, independent validation of the expert’s methodology), cert. denied, — U.S. -, 116 S.Ct. 189, 133 L.Ed.2d 126 (1995)).

*713The view that courts should not look beyond an averment by the expert that the data underlying his or her opinion are the type of data on which experts reasonably rely has likewise been rejected by other courts. The underlying data should be independently evaluated in determining if the opinion itself is reliable. See, e.g., In re Paoli R.R. Yard PCB Litig., 35 F.3d 717, 747-48 (3d Cir.1994); Richardson v. Richardson-Merrell, Inc., 857 F.2d 823, 829 (D.C.Cir.1988); In re Agent Orange Liab. Litig., 611 F.Supp. 1223, 1245 (E.D.N.Y.1985), aff'd, 818 F.2d 187 (2d Cir.1987). In the wake of the Supreme Court’s decision in Daubert, the Third Circuit overruled its prior holding in DeLuca v. Merrell Dow Pharmaceuticals, Inc., 911 F.2d 941, 952 (3d Cir.1990), that an expert’s averment that his or her testimony is based on the type of data on which experts reasonably rely is generally enough to survive a Federal Rule of Evidence 703 inquiry. In re Paoli, 35 F.3d at 747-48. The Third Circuit was persuaded by Judge Weinstein’s opinion in In re Agent Orange: “ ‘If the underlying data are so lacking in probative force and reliability that no reasonable expert could base an opinion on them, an opinion which rests entirely upon them must be excluded.’ ” Id. at 748 (quoting In re Agent Orange, 611 F.Supp. at 1245). If the expert’s scientific testimony is not reliable, it is not evidence. The threshold determination of reliability does not run afoul of our no evidence standard of review.

Indeed, the United States Supreme Court would agree that a determination of scientific reliability is appropriate in reviewing the legal sufficiency of evidence. While admissibility rather than sufficiency was the focus of the Supreme Court’s decision in Daubert, that Court explained that when “wholesale exclusion” is inappropriate and the evidence is admitted, a review of its sufficiency is not foreclosed:

[I]n the event the trial court concludes that the scintilla of evidence presented supporting a position is insufficient to allow a reasonable juror to conclude that the position more likely than not is true, the court remains free to direct a judgment ... and likewise to grant summary judgment.

509 U.S. at 595, 113 S.Ct. at 2798.

The Court cited two Bendectin decisions in support of this statement, Turpin, 959 F.2d 1349, and Brock v. Merrell Dow Pharmaceuticals, Inc., 874 F.2d 307 (5th Cir.), modified on reh’g, 884 F.2d 166 (5th Cir.1989). In Turpin, the Sixth Circuit held that the scientific evidence, viewed in the light most favorable to the plaintiffs, was not sufficient to allow a jury to find that it was more probable than not that the defendant caused the injury. Turpin, 959 F.2d at 1350. In Brock, the Fifth Circuit reversed a judgment entered on a jury verdict because the evidence of causation was legally insufficient. Brock, 874 F.2d at 315; see also Raynor v. Merrell Pharms. Inc., 104 F.3d 1371, 1376 (D.C.Cir.1997) (affirming judgment notwithstanding the verdict and noting that even if expert testimony were admissible under Daubert, it was “unlikely” that a jury could reasonably find it sufficient to show causation).

As already discussed, a number of other decisions in the Bendectin litigation have held that the causation evidence was legally insufficient, sometimes setting aside a jury verdict and in other cases granting summary judgment or a directed verdict. See supra at 709. The decision in Richardson-Merrell said in no uncertain terms that the trial court did not err in granting judgment notwithstanding the verdict because “[wjhether an expert’s opinion has an adequate basis” is an issue “falling within the province of the court.” 857 F.2d at 833.

There are many decisions outside the Ben-dectin litigation that have examined the reliability of scientific evidence in a review of the legal sufficiency of the evidence. See, e.g., Conde v. Velsicol Chem. Corp., 24 F.3d 809, 813 (6th Cir.1994) (stating that even if evidence is admissible under Daubert, it can still be legally insufficient to withstand summary judgment); Wade-Greaux v. Whitehall Labs., Inc., 874 F.Supp. 1441, 1485-86 (D.Vi.) (granting summary judgment in toxic tort case when evidence of causation was insufficient to sustain a jury verdict), aff'd, 46 F.3d 1120 (3d Cir.1994); see also Vadala v. Teledyne Indus., Inc., 44 F.3d 36, 39 (1st Cir.*7141995) (noting that even if expert testimony about cause of plane crash were admitted, it would not be sufficient to permit a jury to find in plaintiffs’ favor); In re Paoli, 35 F.3d at 750 n. 21 (“[I]f the scintilla of evidence presented is insufficient to allow a reasonable juror to conclude that the position more likely than not is true, the court remains free to direct a judgment ... [or] to grant summary judgment.”); cf. In re Joint Eastern & Southern Dist. Asbestos Litig., 52 F.3d 1124, 1131-37 (2d Cir.1995) (finding evidence of causation in asbestos case legally sufficient and reversing trial court’s judgment notwithstanding the verdict); Gruca v. Alpha Therapeutic Corp., 51 F.3d 638, 643 (7th Cir.1995) (holding that trial court abdicated its responsibility by refusing to rule on admissibility and by instructing a verdict for the defendant in a blood bank ease; assuming admissibility of the evidence, it would be legally sufficient). But see Joiner v. General Elec. Co., 78 F.3d 524, 534 (11th Cir.1996) (Birch, J., concurring) (stating that the sufficiency and weight of evidence are beyond the scope of a Daubert analysis), cert. granted, — U.S. -, 117 S.Ct. 1243, 137 L.Ed.2d 325 (1997).

In Robinson, we set forth some of the factors that courts should consider in looking beyond the bare opinion of the expert. Those factors include:

(1) the extent to which the theory has been or can be tested;
(2) the extent to which the technique relies upon the subjective interpretation of the expert;
(3) whether the theory has been subjected to peer review and publication;
(4) the technique’s potential rate of error;
(5) whether the underlying theory or technique has been generally accepted as valid by the relevant scientific community; and
(6) the non-judicial uses that have been made of the theory or technique.'

See Robinson, 923 S.W.2d at 557. The issue in Robinson was admissibility of evidence, but as we have explained the same factors may be applied in a no evidence review of scientific evidence.

If the foundational data underlying opinion testimony are unreliable, an expert will not be permitted to base an opinion on that data because any opinion drawn from that data is likewise unreliable. Further, an expert’s testimony is unreliable even when the underlying data are sound if the expert draws conclusions from that data based on flawed methodology. A flaw in the expert’s reasoning from the data may render reliance on a study unreasonable and render the inferences drawn therefrom dubious. Under that circumstance, the expert’s scientific testimony is unreliable and, legally, no evidence.

We next consider some of the difficult issues surrounding proof of causation in a toxic tort case such as this.

IV

The Havners do not contend that all limb reduction birth defects are caused by Ben-dectin or that Bendectin always causes limb reduction birth defects even when taken at the critical time of limb development. Experts for the Havners and Merrell Dow agreed that some limb reduction defects are genetic. These experts also agreed that the cause of a large percentage of limb reduction birth defects is unknown. Given these undisputed facts, what must a plaintiff establish to raise a fact issue on whether Bendectin caused an individual’s birth defect? The question of causation in cases like this one has engendered considerable debate. Courts that have addressed the issue have not always agreed, and commentators have expressed widely divergent views on the quantum and quality of evidence necessary to sustain a recovery.

Sometimes, causation in toxic tort cases is discussed in terms of general and specific causation. See, e.g., Raynor v. Merrell Pharms., Inc., 104 F.3d 1371, 1376 (D.C.Cir.1997); Joseph Sanders, From Science to Evidence: The Testimony on Causation in the Bendectin Cases, 46 Stan. L.Rev. 1, 14 (1993). General causation is whether a substance is capable of causing a particular injury or condition in the general population, while specific causation is whether a substance caused a particular individual’s injury. In some cases, controlled scientific experi*715ments can be carried out to determine if a substance is capable of causing a particular injury or condition, and there will be objective criteria by which it can be determined with reasonable certainty that a particular individual’s injury was caused by exposure to a given substance. However, in many toxic tort eases, direct experimentation cannot be done, and there will be no rehable evidence of specific causation.

In the absence of direct, scientifically reliable proof of causation, claimants may attempt to demonstrate that exposure to the substance at issue increases the risk of then-particular injury. The finder of fact is asked to infer that because the risk is demonstrably greater in the general population due to exposure to the substance, the claimant’s injury was more likely than not caused by that substance. Such a theory concedes that science cannot tell us what caused a particular plaintiffs injury. It is based on a policy determination that when the incidence of a disease or injury is sufficiently elevated due to exposure to a substance, someone who was exposed to that substance and exhibits the disease or injury can raise a fact question on causation. See generally Daubert v. Merrell Dow Pharms., Inc., 43 F.3d 1311, 1320 n. 13 (9th Cir.) (on remand), cert. denied, — U.S. -, 116 S.Ct. 189, 133 L.Ed.2d 126 (1995). The Havners rely to a considerable extent on epidemiological studies for proof of general causation. Accordingly, we consider the use of epidemiological studies and the “more likely than not” burden of proof.

A

Epidemiological studies examine existing populations to attempt to determine if there is an association between a disease or condition and a factor suspected of causing that disease or condition. See, e.g., Bert Black & David E. Lilienfeld, Epidemiologic Proof in Toxic Tort Litigation, 52 Fordham L.Rev. 732, 750 (1984). However, witnesses for the Havners and commentators in this area uniformly acknowledge that epidemiological studies cannot establish that a given individual contracted a disease or condition due to exposure to a particular drug or agent. See, e.g., Michael Dore, A Commentary on the Use of Epidemiological Evidence in Demonstrating Cause-In-Fact, 7 Haev. Envtl. L. Rev. 429, 431-35 (1983); Steve Gold, Causation in Toxic Torts: Burdens of Proof, Standards of Persuasion, and Statistical Evidence, 96 Yale L.J. 376, 380 (1986). Dr. Glasser, a witness for the Havners, gave as an example a study designed to see if a given drug causes rashes. Even though a study may show that ten people who took the drug exhibited a rash, while rashes appeared on only three people who did not take the drug, Dr. Glasser explained that the study cannot tell us which of the exposed ten got the rash because of the drug. We know that things other than the drug cause rashes.

Recognizing that epidemiological studies cannot establish the actual cause of an individual’s injury or condition, a difficult question for the courts is how a plaintiff faced with this conundrum can raise a fact issue on causation and meet the “more likely than not” burden of proof. Generally, more recent decisions have been willing to recognize that epidemiological studies showing an increased risk may support a recovery. Judge Weinstein, whose decision in the Agent Orange litigation has been widely discussed and followed, has observed that courts have been divided between the “strong” and “weak” versions of the preponderance rule. In re “Agent Orange” Prod. Liab. Litig., 611 F.Supp. 1223, 1261 (E.D.N.Y.1985) (citing David Rosenberg, The Causal Connection in Mass Exposure Cases: A “Public Law” Vision of the Tort System, 97 HaRV. L.Rev. 851, 857 (1984)). The “strong” version requires a plaintiff to offer both epidemiological evidence that the probability of causation exceeds fifty percent in the exposed population and “particularistic” proof that the substance harmed the individual. The “weak” version allows verdicts to be based solely on statistical evidence. Rosenberg, supra, 97 Haev. L. Rev. at 857-58. Judge Weinstein concluded that the plaintiffs in Agent Orange were required to offer evidence that causation was “more than 50 percent probable,” 611 F.Supp. at 1262, and that the plaintiffs’ experts were required to “rule out the myriad other possible causes of the veterans’ afflictions,” id. at 1263.

*716Other courts have likewise found that the requirement of a more than 50% probability means that epidemiological evidence must show that the risk of an injury or condition in the exposed population was more than double the risk in the unexposed or control population. See, e.g., Daubert, 43 F.3d at 1320 (requiring Bendeetin plaintiff's to show that mothers’ ingestion of the drug more than doubled the likelihood of birth defects); DeLuca v. Merrell Dow Pharms., Inc., 911 F.2d 941, 958 (3d Cir.1990) (requiring that Ben-dectin plaintiffs establish relative risk of limb reduction defects arising from epidemiological data of at least 2.0, which equates to more than a doubling of the risk); Hall v. Baxter Healthcare Corp., 947 F.Supp. 1387, 1403 (D.Or.1996) (requiring breast-implant plaintiffs to demonstrate that exposure to breast implants more than doubled the risk of their alleged injuries, which, in epidemiological terms, requires a relative risk of more than 2.0); Manko v. United States, 636 F.Supp. 1419, 1434 (W.D.Mo.1986) (stating that a relative risk of 2.0 in an epidemiological study means that the disease more likely than not was caused by the event), affd in relevant part, 830 F.2d 831 (8th Cir.1987); Marder v. G.D. Searle & Co., 630 F.Supp. 1087, 1092 (D.Md.1986) (stating that in IUD litigation, a showing of causation by a preponderance of the evidence, in epidemiological terms, requires a relative risk of at least 2.0), aff'd, 814 F.2d 655 (4th Cir.1987); Cook v. United States, 545 F.Supp. 306, 308 (N.D.Cal.1982) (stating that in vaccine case, when relative risk is greater than 2.0, there is a greater than 50% chance that the injury was caused by the vaccine).

Some courts have reached a contrary conclusion, holding that epidemiological evidence showing something less than a doubling of the risk may support a jury’s finding of causation. In In re Joint Eastern & Southern District Asbestos Litigation, 52 F.3d 1124, 1134 (2d Cir.1995), the Second Circuit observed that the district court cited no authority for the “bold” assertion that standardized mortality ratios of 1.5 are statistically insignificant and cannot be relied upon by a jury. The circuit court held that it was far preferable to instruct the jury on statistical significance and to let the jury decide whether studies over the 1.0 mark have any significance. Id.; see also Allen v. United States, 588 F.Supp. 247, 418-19 (D.Utah 1984) (explicitly rejecting the greater than 50% standard of causation in connection with statistical evidence), rev’d on other grounds, 816 F.2d 1417 (10th Cir.1987); Grassis v. Johns-Manville Corp., 248 N.J.Super. 446, 591 A.2d 671, 674-76 (App.Div.1991) (holding that trial court erred in precluding opinion testimony based on epidemiological studies showing relative risks of less than 2.0).

The “doubling of the risk” issue in toxic tort cases has provided fertile ground for the scholarly plow. Those who advocate that something short of a doubling of the risk is adequate to support liability or who advocate that some type of proportionate liability should be imposed include Daniel A. Farber, Toxic Causation, 71 Minn. L.Rev. 1219, 1237-51 (1987); Gold, supra, 96 Yale L.J. at 395-401; Khristine L. Hall & Ellen K. Silbergeld, Reappraising Epidemiology: A Response to Mr. Dore, 7 Harv. Envtl. L.Rev. 441, 445-46 (1983); Rosenberg, supra, 97 Harv. L.Rev. at 859-60; see also 2 American Law Inst., Enterprise Responsibility for Personal Injury 369-75 (1991) (discussing toxic tort cases and suggesting that proportionate compensation to all with the disease or disorder should be based on the attributable fractions of causation); D.H. Kaye, Apples and Oranges: Confidence Coefficients and the Burden of Persuasion, 73 Cornell L.Rev. 54, 71-73 (1987).

On the other end of the spectrum is Michael Dore, who asserts that epidemiological studies cannot, standing alone, establish causation. See Dore, A Commentary on the Use of Epidemiological Evidence, supra, 7 Harv. Envtl. L. Rev. at 434; see also Michael D. Green, Expert Witnesses and Sufficiency of Evidence in Toxic Substances Litigation: The Legacy of Agent Orange and Bendeetin Litigation, 86 Nw. U.L.Rev. 643, 691 (1992) (concluding that in the absence of other information, a doubling of the risk would be inadequate to support a plaintiff’s verdict, but advocating that a lower risk might be sufficient if other risk factors could be eliminated); Melissa Moore Thompson, Causal Inference in Epidemiology: Implications for *717Toxic Tort Litigation, 71 N.C. L.Rev. 247, 253, 289 (1992) (arguing that a strong association requires a risk ratio greater than or equal to 8.0, although moderate association of 3.0 to 8.0 could suffice if coupled with other factors).

Some commentators have been particularly critical of attempts by the courts to meld the more than 50% probability requirement with the relative risks found in epidemiological studies in determining if the studies were admissible or were some evidence that would support an award for the claimant. But there is disagreement on how epidemiological studies should be used. Some commentators contend that the more than 50% probability requirement is too stringent, while others argue that epidemiological studies have no relation to the legal requirement of “more likely than not.” Compare Gold, supra, 96 Yale L.J. at 395-97 (advocating a relaxed threshold of proof), with Diana B. Petitti, Reference Guide on Epidemiology, 36 Jurimetrics J. 159, 167-68 (1996) (finding no support in textbooks of epidemiology or from empirical studies for the proposition that when attributable risk exceeds 50% an agent is more likely than not to be the cause of the plaintiffs disease), and Thompson, supra, 71 N.C. L.Rev. at 264-65 (asserting that the use of statistical association to satisfy a more likely than not standard is “misguided”). See also Carl F. Cranor et al., Judicial Boundary Drawing and the Need for Context-Sensitive Science in Toxic Torts after Daubert v. Merrell Dow Pharmaceuticals, Inc., 16 Va. Envtl. L.J. 1, 37—40 (1996) (arguing that epidemiological evidence should not be excluded simply because it reveals a relative risk less than 2.0, unless there is no other supporting evidence); Kaye, supra, 73 Cornell L.Rev. at 69 (arguing that it is fallacious to reason that “if the data are more probable under one hypothesis than another, then the former hypothesis is more likely to be true than the latter”); James Robins & Sander Greenland, The Probability of Causation Under a Stochastic Model for Individual Risk, 45 Biometrics 1125, 1131 (1989) (concluding that proportional liability schemes cannot be based on epidemiological data alone).

B

Although we recognize that there is not a precise fit between science and legal burdens of proof, we are persuaded that properly designed and executed epidemiological studies may be part of the evidence supporting causation in a toxic tort ease and that there is a rational basis for relating the requirement that there be more than a “doubling of the risk” to our no evidence standard of review and to the more likely than not burden of proof. See generally DeLuca v. Merrell Dow Pharms., Inc., 911 F.2d 941, 958-59 (3d Cir.1990); Black & Lilienfeld, supra, 52 Fordham L.Rev. at 767; see also Daubert, 43 F.3d at 1321; Cook, 545 F.Supp. at 308.

Assume that a condition naturally occurs in six out of 1,000 people even when they are not exposed to a certain drug. If studies of people who did take the drug show that nine out of 1,000 contracted the disease, it is still more likely than not that causes other than the drug were responsible for any given occurrence of the disease since it occurs in six out of 1,000 individuals anyway. Six of the nine incidences would be statistically attributable to causes other than the drug, and therefore, it is not more probable that the drug caused any one incidence of disease. This would only amount to evidence that the drug could have caused the disease. However, if more than twelve out of 1,000 who take the drug contract the disease, then it may be statistically more likely than not that a given individual’s disease was caused by the drug.

This is an oversimplification of statistical evidence relating to general causation, as we discuss below, but it illustrates the thinking behind the doubling of the risk requirement. For another viewpoint in this same vein, see Robert P. Charrow & David E. Bernstein, Washington Legal Foundation, Scientific Evidence in the Courtroom: Admissibility and Statistical Significance After Dau-bert 28-34 (1994), who advocate that there is a mathematically demonstrable relationship between relative risk and the more likely than not standard. They contend that a relative risk of slightly more than 2.0 will rarely, if ever, satisfy the legal causation *718standard. From a mathematical perspective, the probability of general causation changes as the level of statistical significance changes. Id. at 29-31. A relative risk of 2.2 may be sufficient to show more than a 50% probability at the 0.05 level (5 chances out of 100 that result occurred by chance), but not at the 0.10 level (10 chances out of 100). With calculations that we do not attempt to set out here, these commentators offer an example in which a relative risk ratio of 2.75 results in a probability of general causation of about 52% with a statistical significance of 0.05, but only about a 43% probability of general causation with a statistical significance of 0.10. Id. at 31-32.

We recognize, as does the federal Reference Manual on Scientific Evidence, that a disease or condition either is or is not caused by exposure to a suspected agent and that frequency data, such as the incidence of adverse effects in the general population when exposed, cannot indicate the actual cause of a given individual’s disease or condition. See Linda A. Bailey et al., Reference Guide on Epidemiology, in Federal Judioial Center, REFERENCE MANUAL ON SCIENTIFIC EVIDENCE 169 (1994). But the law must balance the need to compensate those who have been injured by the wrongful actions of another with the concept deeply imbedded in our jurisprudence that a defendant cannot be found liable for an injury unless the preponderance of the evidence supports cause in fact. The use of scientifically reliable epidemiological studies and the requirement of more than a doubling of the risk strikes a balance between the needs of our legal system and the limits of science.

C

We do not hold, however, that a relative risk of more than 2.0 is a litmus test or that a single epidemiological test is legally sufficient evidence of causation. Other factors must be considered. As already noted, epidemiological studies only show an association. There may in fact be no causal relationship even if the relative risk is high. For example, studies have found that there is an association between silicone breast implants and reduced rates of breast cancer. This does not necessarily mean that breast implants caused the reduced rate of breast cancer. See David E. Bernstein, The Admissibility of Scientific Evidence After Daubert v. Merrell Dow Pharmaceuticals, Inc., 15 Cardozo L.Rev. 2139, 2167 (1994) (citing H. Berkel et al., Breast Augmentation: A Risk Factor for Breast Cancer?, 326 New Eng. J. Med. 1649 (1992)). Likewise, even if a particular study reports a low relative risk, there may in fact be a causal relationship. The strong consensus among epidemiologists is that conclusions about causation should not be drawn, if at all, until a number of criteria have been considered. One set of criteria widely used by epidemiologists was published by Sir Austin Bradford Hill in 1965.2 Another set of crite*719ria used by epidemiologists in studying disease is the Henle-Koch-Evans Postulates.3 Although epidemiologists do not consider it necessary that all these criteria be met before drawing inferences about causation, they are part of sound methodology generally accepted by the current scientific community.

Sound methodology also requires that the design and execution of epidemiological studies be examined. For example, bias can dramatically affect the scientific reliability of an epidemiological study. See, e.g., Bailey et al., Reference Guide on Epidemiology, in Reference Manual on Scientific Evidence, supra, at 138-43; Thompson, supra, 71 N.C. L.Rev. at 259-61. Bias can result from confounding factors, selection bias, and information bias. Thompson, supra, 71 N.C. L.Rev. at 260. We will not undertake an extended discussion of the many ways in which bias may cause results of a study to be misleading. We note only that epidemiological studies “are subject to many biases and therefore present formidable problems in design and execution and even greater problems in interpretation.” Marcia Angell, The Interpretation of Epidemiologic Studies, 323 New Eng. J. Med. 823, 824 (1996).

We also note that some of the literature indicates that epidemiologists consider a relative risk of less than three to indicate a weak association. See Thompson, supra, 71 N.C. L.Rev. at 252 (citing Ernest L. Wynder, Guidelines to the Epidemiology of Weak Associations, 16 Preventive Med. 139, 139 (1987)). The executive editor of the New England Journal of Medicine, Marcia An-gelí, has stated that “[a]s a general rale of thumb, we are looking for a relative risk of three or more [before accepting a paper for publication], particularly if it is biologically implausible or if it’s a brand-new finding.” Gary Taubes, Epidemiology Faces Its Limits, Science, July 14,1995, at 168. Similarly, Robert Temple, the director of drag evaluation at the FDA, has said that “[m]y basic rule is if the relative risk isn’t at least three or four, forget it.” Id. We hasten to point out that these statements are contained in what is more akin to the popular press, not peer-reviewed scientific journals, and the context of those statements is not altogether clear. We draw no conclusions from any of the foregoing articles other than to point out that there are a number of reasons why rebanee on a relative risk of 2.0 as a bright-bne boundary would not be in accordance with sound scientific methodology in some cases. Careful exploration and expbeation of what is rebable scientific methodology in a given context is necessary.

D

A few courts that have embraced the more-than-double-the-risk standard have indicated in dicta that in some instances, epidemiological studies with relative risks of less than 2.0 might suffice if there were other evidence of causation. See, e.g., Daubert, 43 F.3d at 1321 n. 16; Hall, 947 F.Supp. at 1398, 1404. We need not decide in this case whether epidemiological evidence with a relative risk less than 2.0, coupled with other credible and rebable evidence, may be legally sufficient to support causation. We emphasize, however, that evidence of causation from whatever source must be scientifically rebable. Post hoc, speculative testimony will not suffice.

A physician, even a treating physician, or other expert who has seen a skewed data sample, such as one of a few infants who has a birth defect, is not in a position to infer causation. The scientific community would not accept as methodologically sound *720a “study” by such an expert reporting that the ingestion of a particular drug by the mother caused the birth defect. Similarly, an expert’s assertion that a physical examination confirmed causation should not be accepted at face value. In O’Conner v. Commonwealth Edison Co., 13 F.3d 1090 (7th Cir.1994), a treating physician testified that he knew what radiation-induced cataracts looked like because they are clinically deseribable and definable and “cannot be mistaken for anything else.” Id. at 1106. Nevertheless, his opinion that exposure to radiation caused the plaintiffs cataracts was found to be inadmissible because it had no scientific basis. The literature on which the expert relied did not support his assertion that radiation-induced cataracts could be diagnosed by visual examination. Id. at 1106-07. For a good discussion of the evils of “evidence” of this nature, see Bernstein, supra, 15 Cardozo L.Rev. at 2148-49. Further, as we discuss in Part VI(A), an expert cannot dissect a study, picking and choosing data, or “reanalyze” the data to derive a higher relative risk if this process does not comport with sound scientific methodology.

The FDA has promulgated regulations that detail the requirements for clinical investigations of the safety and effectiveness of drugs. 21 C.F.R. § 314.126 (1996). These regulations state that “[i]solated case reports, random experience, and reports lacking the details which permit scientific evaluation will not be considered.” Id. § 314.126(e). Courts should likewise reject such evidence because it is not scientifically reliable. As Bernstein points out, physicians following scientific methodology, would not examine a patient or several patients in uncontrolled settings to determine whether a particular drug has favorable effects, nor would they rely on case reports to determine whether a substance is harmful. See Bernstein, supra, 15 Cardozo L.Rev. at 2148-49; see also Rosenberg, supra, 97 Harv. L.Rev. at 870 (arguing that anecdotal or particularized evidence accomplishes no more than a false appearance of direct and actual knowledge of a causal, relationship). Expert testimony that is not scientifically reliable cannot be used to shore up epidemiological studies that fail to indicate more than a doubling of the risk.

E

To raise a fact issue on causation and thus to survive legal sufficiency review, a claimant must do more than simply introduce into evidence epidemiological studies that show a substantially elevated risk. A claimant must show that he or she is similar to those in the studies. This would include proof that the injured person was exposed to the same substance, that the exposure or dose levels were comparable to or greater than those in the studies, that the exposure occurred before the onset of injury, and that the timing of the onset of injury was consistent with that experienced by those in the study. See generally Thompson, supra, 71 N.C. L.Rev. at 286-88. Further, if there are other plausible causes of the injury or condition that could be negated, the plaintiff must offer evidence excluding those causes with reasonable certainty. See generally E.I. du Pont de Nemours & Co. v. Robinson, 923 S.W.2d 549, 559 (Tex.1995) (finding that the failure of the expert to rule out other causes of the damage rendered his opinion little more than speculation); Parker v. Employers Mut. Liab. Ins. Co., 440 S.W.2d 43, 47 (Tex.1969) (holding that a cause becomes “probable” only when “in the absence of other reasonable causal explanations it becomes more likely than not that the injury was a result”).

In sum, we emphasize that courts must make a determination of reliability from all the evidence. Courts should allow a party, plaintiff or defendant, to present the best available evidence, assuming it passes muster under Robinson, and only then should a court determine from a totality of the evidence, considering all factors affecting the reliability of particular studies, whether there is legally sufficient evidence to support a judgment.

Finally, we are cognizant that science is constantly reevaluating conclusions and theories and that over time, not only scientific knowledge but scientific methodology in a particular field may evolve. We have strived to make our observations' and holdings in light of current, generally accepted scientific *721methodology. However, courts should not foreclose the possibility that advances in science may require reevaluation of what is “good science” in future cases.

V

Certain conventions are used in conducting scientific studies, and statistics are used to evaluate the reliability of scientific endeavors and to determine what the results tell us. In this opinion, we consider some of the basic concepts currently used in scientific studies and statistical analyses and how those concepts mesh with our legal sufficiency standard of review. For an extended discussion of statistical methodology and its use in epidemiological studies, see DeLuca v. Merrell Dow Pharmaceuticals, Inc., 911 F.2d 941, 945-48 (3d Cir.1990). See also Turpin v. Merrell Dow Pharms., Inc., 959 F.2d 1349, 1353 n. 1 (6th Cir.1992); Bailey et al., Reference Guide on Epidemiology, in Reference Manual on Scientific Evidence, supra, at 138-43, 171-78. We do not attempt to discuss all the multifaceted aspects of the scientific method and statistics, but focus on the principles that shed light on the particular facts and issues in this case.

A

One way to study populations is by a retrospective case-control or case-comparison epidemiological study. For example, this type of study identifies individuals with a disease and a suitable control group of people without the disease and then looks back to examine postulated causes of the disease. See Bailey et al., Reference Guide on Epidemiology, in Reference Manual on Scientific Evidence, supra, at 136-38, 172. Another type of epidemiological study is a cohort study, or incidence study, which is a prospective study that identifies groups and observes them over time to see if one group is more likely to develop disease. Id. at 134-36, 173.

An “odds ratio” can be calculated for a case-control study. Id. at 175. For example, an odds ratio could be used to show the odds that ingestion of a drag is associated with a particular disease. The odds ratio compares the odds of having the disease when exposed to the drag versus when not exposed. If the ratio is 2.67, the odds are that a person exposed to the drug is 2.67 times more likely to develop the disease under study.

Similarly, the “relative risk” that a person who took a drug will develop a particular disease can be determined in a cohort study. Id. at 173, 176. The relative risk is calculated by comparing the incidence of disease in the exposed population with the incidence of the disease in the control population. If the relative risk is 1.0, the risk in exposed individuals is the same as unexposed individuals. If the relative risk is greater than 1.0, the risk in exposed individuals is greater than in those not exposed. If the relative risk is less than 1.0, the risk in exposed individuals is less than in those not exposed. For the result to indicate a doubling of the risk, the relative risk must be greater than 2.0. See id. at 147-48.

Perhaps the most useful measure is the attributable proportion of risk, which is the statistical measure of a factor’s relationship to a disease in the population. It represents the “proportion of the disease among exposed individuals that is associated with the exposure.” Id. at 149. In other words, it reflects the percentage of the disease or injury that could be prevented by eliminating exposure to the substance. For a more detailed discussion of the calculation and use of the attributable proportion of risk, see id. at 149-50; Black & Lilienfeld, supra, 52 Fordham L.Rev. at 760-61. See also Thompson, supra, 71 N.C. L.Rev. at 252-56.

The numeric value of an odds ratio is at least equal to the relative risk, but the odds ratio often overstates the relative risk, especially if the occurrence of the event is not rare. For an example of the difference between the mathematical calculation of the odds ratio and the relative risk, see Barbara Hazard Munro & Ellis Batten Page, Statistical Methods for Health Care Research 233-35 (2d ed. 1993). In the example given by Munro and Page, the odds ratio was 3.91, while the relative risk was only 3.0 based on the same set of data. See also Bailey et al., Reference Guide on Epidemiology, in Reference Manual on Scientific Evidence, supra, at 149; Thompson, supra, 71 N.C. L.Rev. at 250 n. 22.

*722The relative risk may be expressed algebraically as:

RR — R Ic

where RR is the relative risk, Ie is the incidence of the disease in the exposed population, and Ie is the incidence of disease in the control population. A sample calculation is as follows:

• the incidence of the disease in exposed individuals (Ie) is 30 cases per 100 persons, or 0.3
• the incidence of the disease in the unexposed individuals (Ic) is 10 cases per 100 persons, or 0.1
• the relative risk is the incidence in the exposed group (0.3) divided by the incidence in the unexposed group (0.1), which equals 3.0

Using this hypothetical, can we conclude that people who are exposed are three times more likely to contract disease than those who are not? Not necessarily. The result in any given study or comparison may not be representative of the entire population. The result may have occurred by chance. The discipline of statistics has determined means of telling us how significant the results of a study may be.

B

The first step in understanding significance testing is to understand how research is often conducted. A researcher tests hypotheses and does so by testing whether the data support a particular hypothesis. The starting point is the null hypothesis, which assumes that there is no difference or no effect. If you were studying the effects of Bendectin, for example, the null hypothesis would be that it has no effect. The researcher tries to find evidence against the hypothesis. See David S. Moore & George P. McCabe, Introduction to the Practice of Statistics 449 (2d ed. 1993); Munro & Page, supra, at 54. The statement that the researcher suspects may be true is stated as the alternative hypothesis. If a significant difference is found, the null hypothesis is rejected. If a significant difference is not found, the null hypothesis is accepted. Munro & Page, supra, at 54. This concept is important because it is the basis of the statistical test. Id.

A study may contain error in deciding to reject or accept a hypothesis, and this error can be one of two types. Id.; Moore & McCabe, supra, at 482-87. A Type I error occurs when the null hypothesis is true but has been rejected, and a Type II error occurs when the null hypothesis is false but has been accepted. Munro & Page, supra, at 55. An example of the two types of error given by Munro and Page is a comparison of two groups of people who have been taught statistics by different methods. Id. Group A scored significantly higher than Group B on a test of their knowledge of statistics. The null hypothesis is that there is no difference between the teaching methods, but because the study indicated there was a difference, the null hypothesis was rejected. Suppose, however, that Group A was composed of people with higher math ability and that in actuality the teaching method did not matter at all. The rejection of the null hypothesis is a Type I error. Id.

The probability of making a Type I error can be decreased by changing the level of significance, that is, the probability that the results occurred by chance. Id. If the level of significance had been five in one hundred (0.05), there is only a five in one hundred chance that the result occurred by chance alone. If the level of significance is one in one hundred (0.01), there is only a one in one hundred chance that the result occurred by chance alone. However, as the significance level is made more stringent (e.g., from 0.05 to 0.01), it will be more difficult to find a significant result. Id. Altering the significance level in this manner also increases the risk of a Type II error, which is accepting a false null hypothesis. Id. To avoid Type II errors, the level of significance can be lowered, for example, to ten in one hundred (0.1). Id.

Different levels of significance may be appropriate for different types of studies depending on how much risk one is willing to accept that the conclusion reached is wrong. Again, to take examples offered by Munro and Page, assume that a test for a particular genetic defect exists and that if the defect is *723diagnosed at an early stage, a child •with the defect can be successfully treated. If the genetic defect is not diagnosed in time, the child’s development -will be severely impaired. If a child is mistakenly diagnosed as having the defect and treated, there are no harmful effects. Most would agree that it would be preferable to make a Type I error rather than a Type II error under these circumstances. Id. A Type II error would be failing to diagnose a child that had the genetic defect.

Contrast that hypothetical with one in which a federal study is conducted to determine whether a particular method of teaching underprivileged children increases their success in school. Id. The cost of implementing this teaching method in a nationwide program would be very great. A Type I error would be to conclude that the program had an effect when it did not. Id. The significance level for this project would probably be higher than the one used to screen for genetic defects in the other hypothetical. In the genetic defects example, it is preferable to treat children even if they may not have the disease, but in the teaching method example, it is not preferable to teach children at considerable cost if it has no effect.

A confidence level can be used in epidemiological studies to establish the boundaries of the relative risk. These boundaries are known as the confidence interval. See id. at 59-63; see also David H. Kaye & David A. Freedman, Reference Guide on Statistics, in Reference Manual on Scientific Evidence, supra, at 376-77, 396; Moore & McCabe, supra, at 432-37. The confidence interval tells us if the results of a given study are statistically significant at a particular confidence level. See Moore & McCabe, supra, at 432-33. A confidence interval shows a “range of values within which the results of a study sample would be likely to fall if the study were repeated numerous times.” Bailey et al., Reference Guide on Epidemiology, in Reference Manual On Scientific Evidence, supra, at 173. If, based on a confidence level of 95%, a study showed a relative risk of 2.3 and had a confidence interval of 1.3 to 3.8, we would say that, if the study were repeated, it would produce a relative risk between 1.3 and 3.8 in 95% of the repetitions. However, if the interval includes the number 1.0, the study is not statistically significant or, said another way, is inconclusive. This is because the confidence interval includes relative risk values that are both less than and greater than the null hypothesis (1.0), leaving the researcher with results that suggest both that the null hypothesis should be accepted and that it should be rejected. See, e.g., Turpin, 959 F.2d at 1353 n. 1; Brock v. Merrell Dow Pharms., Inc., 874 F.2d 307, 312 (5th Cir.), as modified on reh’g, 884 F.2d 166 (5th Cir.1989); Bailey et al., Reference Guide on Epidemiology, in Reference Manual on Scientific Evidence, supra, at 173. This concept was explained to the jury in this case by Dr. Glasser, one of the Havners’ witnesses. Thus, a study may produce a relative risk of 2.3, meaning the risk is 2.3 times greater based on the data, but at a confidence level of 95%, the confidence interval has boundaries of 0.8 and 3.2. The results are therefore insignificant at the 95% level. If the researcher is willing to accept a greater risk of error and lowers the confidence level to 90%, the results may be statistically significant at that lower level because the range does not include the number 1.0. See generally Bailey et al., Reference Guide on Epidemiology, in Reference Manual on Scientific Evidence, supra, at 151-55. “[T]he narrower the confidence interval, the greater the confidence in the relative risk estimate found in the study.” Id. at 173.

C

The generally accepted significance level or confidence level in epidemiological studies is 95%, meaning that if the study were repeated numerous times, the confidence interval would indicate the range of relative risk values that would result 95% of the time. See DeLuca v. Merrell Dow Pharms., Inc., 791 F.Supp. 1042, 1046 (D.N.J.1992), aff'd, 6 F.3d 778 (3d Cir.1993); Bailey et al., Reference Guide on Epidemiology, in Reference Manual on Scientific Evidence, supra, at 153; Dore, A Proposed Standard, supra note 3, 28 How. L.J. at 693; Thompson, supra, 71 N.C. L.Rev. at 256. Virtually all the published, peer-reviewed studies on Bendectin have *724a confidence level of at least 95%. Although one of the Havners’ witnesses, Dr. Swan, advocated the use of a 90% confidence level (10 in 100 chance of error), she and other of the Havners’ witnesses conceded that 95% is the generally accepted level.

Another of the Havners’ witnesses, Dr. Glasser, explained that in any scientific application, the confidence interval is kept very high. He testified that'you “don’t ever see [confidence intervals of 50% or 60%] in a scientific study because that means we’re going to miss it a lot of times and [scientists] are not willing to take that risk.” One commentator advocates that the confidence level for admissibility of epidemiological studies should be higher than the generally accepted 95% and should be 99%. See Dore, A Proposed Standard, supra note 3, 28 How. L.J. at 693-95. But cf. DeLuca, 911 F.2d at 948 (discussing statistics expert Kenneth Roth-man’s view that the predominate choice of a 95% confidence level is an arbitrarily selected convention of his discipline); Longmore v. Merrell Dow Pharms., Inc., 737 F.Supp. 1117, 1119-20 (D.Idaho 1990) (concluding that the scientific standard for determining causation is much stricter than the standard employed by the court and that confidence levels of 95%, 90%, or even 80% should not be required).

We think it unwise to depart from the methodology that is at present generally accepted among epidemiologists. See generally Bert Black, The Supreme Court’s View of Science: Has Daubert Exorcised the Certainty Demon?, 15 CardozO L.Rev. 2129, 2135 (1994) (stating that ‘[a]lmost all thoughtful scientists would agree ... that [a significance level of five percent] is a reasonable general standard’ ” (quoting Amicus Curiae Brief of Professor Alvan R. Feinstein in Support of Respondent at 16, Daubert v. Merrell Dow Pharms., Inc., 509 U.S. 579, 113 S.Ct. 2786, 125 L.Ed.2d 469 (1993) (No. 92-102))). Accordingly, we should not widen the boundaries at which courts will acknowledge a statistically significant association beyond the 95% level to 90% or lower values.

It must be reiterated that even if a statistically significant association is found, that association does not equate to causation. Although there may appear to be an increased risk associated with an activity or condition, this does not mean the relationship is causal. As the original panel of the court of appeals, observed in this ease, there is a demonstrable association between summertime and death by drowning, but summertime does not cause drowning. 907 S.W.2d at 544 n. 8.

There are many other factors to consider in evaluating the reliability of a scientific study including, but certainly not limited to, the sample size of the study, the power of the study, confounding variables, and whether there was selection bias. These factors are not central to a resolution of this appeal, and we do no more than acknowledge that determining scientific reliability can have many facets.

VI

Armed with some of the basic principles employed by the scientific community in conducting studies, we turn to an examination of the evidence in this case measured against the Robinson factors. See E.I. du Pont de Nemours & Co. v. Robinson, 923 S.W.2d 549, 557 (Tex.1995). The evidence relied upon by the Havners’ experts falls into four categories: (1) epidemiological studies; (2) in vivo animal studies; (3) in vitro animal studies; and (4) a chemical structure analysis of doxy-lamine succinate, the antihistamine component of Bendectin. We consider each in turn.

A

Dr. J. Howard Glasser, an associate professor at the University of Texas School of Public Health at the Texas Medical Center in Houston, is an epidemiologist with a Ph.D. in experimental statistics and a Master of Science of Bio-Statistics. He gave the jury an overview of statistics. As noted earlier, he explained that statistics are used to determine if there is a significant association between two events or occurrences, but cautioned that a statistical association is not the same thing as causation.

Glasser identified a number of epidemiological studies from which he concluded that it was more likely than not that there is an *725association between Bendeetin and birth defects, even though the authors of those studies did not find such an association. One study was done by Cordero and had a relative risk of 1.18 and a confidence interval of 0.65 to 2.13. However, the relative risk would need to exceed 2.0, and the confidence interval could not include 1.0, for the results to indicate more than a doubling of the risk and a statistically significant association between Bendeetin and limb reduction birth defects. See supra Part V; see also Daubert v. Merrell Dow Pharms., Inc., 43 F.3d 1311, 1320 (9th Cir.) (on remand) (noting that more likely than not standard requires, in terms of statistical proof, a more than doubling of the risk), cert denied, — U.S. -, 116 S.Ct. 189, 133 L.Ed.2d 126 (1995). None of the other studies identified by Glasser showed a doubling of the risk. The MeCredie study had a relative risk of 1.1 and a confidence interval of 0.8 to 1.5. The data in the Eskan-zi study that considered limb reduction birth defects resulted in a relative risk of 4.18, but the confidence interval was 0.48 to 36.3, a very large interval that included 1.0. Dr. Glasser agreed that results with a confidence interval that included 1.0 or a lower number would be inconclusive and statistically insignificant.

Dr. Glasser did, however, reanalzye some data, called the Jiek data, that had been included in a report to the FDA. Glasser isolated information on women who had filled two or more prescriptions of Bendeetin and who were not exposed to spermicide, which resulted in a relative risk of 13.0 of limb reduction birth defects. However, the confidence level he used was 90%. Further, there is no testimony or other evidence regarding the confidence interval. The confidence interval may or may not have contained 1.0.

The Havners also point to a memorandum prepared within the FDA that was identified by Dr. Glasser. The document indicates that the relative risk of limb defects when Ben-dectin is given within the first three lunar months of pregnancy is 2.13. The only conclusion drawn by Dr. Glasser from this memorandum is that, taken in conjunction with the other articles he had discussed, there is an “importance of time” and an “importance of exposure with the highest relative risk coming when the exposure period one to three lunar months is counted.” The memo itself was not introduced into evidence, and there is no evidence of the confidence level at which the relative risk of 2.13 was found or of the confidence interval. The confidence interval may or may not have contained 1.0.

Finally, Glasser testified about published studies on Bendeetin that did show statistically significant results, but they dealt with birth defects other than limb reduction defects. These studies cannot of course support a finding that Bendeetin causes limb reduction defects. Further, later studies of these other types of birth defects did not bear out an association with Bendeetin.

The other expert witness for the Havners who testified about epidemiological studies was Dr. Shanna Swan. She has a doctorate in statistics and is the Chief of the Reproductive Epidemiological Program for the state of California. She also teaches epidemiology at the University of California at Berkeley.

Dr. Swan conceded that none of the published epidemiological studies found an association between Bendeetin and limb reduction defects. She identified a number of these studies and confirmed that the confidence intervals in each of them included 1.0. However, Dr. Swan testified about these studies at some length and criticized the methodology. Then, relying on these same studies, she opined that Bendeetin more probably than not is associated with limb reduction birth defects. Swan considered the findings of these studies in the aggregate and testified that the results fall along a curve in which the “weight of the curve” was in the direction of an increased risk. Yet, she also said that these studies were consistent with a relative risk that was between 0.7 and 1.8. That is not a doubling of the risk. It may support her opinion that it is more probable than not that there is an association between Bendeetin and limb reduction defects, but the magnitude of the association she gleaned from these studies is not more than 2.0, based on her own testimony.

Dr. Swan also performed a reanalysis of data from at least two studies. One reanalysis was of raw unpublished data underlying *726the Jiek study of limb reduction birth defects, the same data about which Dr. Glasser testified. Dr. Swan derived a relative risk estimate of 2.2 for women exposed to Ben-deetin during the first trimester. She also testified that the relative risk for women who were exposed to Bendectin but not exposed to spermicide was 8.8 and finally, that if women who were exposed to two or more Bendectin prescriptions were considered, without regard to exposure to spermicide, the relative risk was 13 with a confidence interval from 3 to 53. She did not reveal the confidence level used in obtaining these results, and there is no evidence of the confidence level in the record.

The other reanalysis by Dr. Swan was of data in the Cordero study, which was based on information collected by the Center for Disease Control in Atlanta. An abstract she prepared regarding this data was published in the Journal for the Society of Epidemiological Research in 1983 or 1984 and states that the original Cordero study found the odds ratio for limb reduction birth defects to be 1.2. Swan concluded, however, that when a different control group is selected, the relative risk estimates are affected. Swan’s abstract stated that, “under certain assumptions,” which are not identified, “the odds ratio for limb reduction defects” are “a highly significant” 2.8. There is no explanation in the abstract or in Dr. Swan’s testimony of the significance level used to obtain the 2.8 result. The result may well be statistically inconclusive at a 95% confidence level. We simply do not know from this record. Without knowing the significance level or the confidence interval, there is no scientifically reliable basis for saying that the 2.8 result is an indication of anything. Further, her choice of the control group could have skewed the results. Although her abstract does not identify what control group she used, Swan testified at trial that she chose births of Downs Syndrome babies. Swan’s reanalysis using Downs Syndrome babies as the control group was considered in Lynch and in Richardson-Merrell, and those courts likewise found it insufficient. See Lynch v. Merrell-National Labs., 830 F.2d 1190, 1195 (1st Cir.1987), aff'd, 857 F.2d 823 (D.C.Cir.1988); Richardson v. Richardson-Merrell, Inc., 649 F.Supp. 799, 802 n. 10 (D.D.C.1986), aff'd, 857 F.2d 823 (D.C.Cir.1988).

In addition to the statistical shortcomings of the Havners’ epidemiological evidence, another strike against its reliability is that it has never been published or otherwise subjected to peer review, with the exception of Dr. Swan’s abstract, which she acknowledges is not the equivalent of a published paper. Dr. Swan has published a number of papers in scientific journals, including a study that concluded Bendectin is not associated with cardiac birth defects. Although she has been testifying in Bendectin limb reduction birth defect cases for many years, Dr. Swan has never attempted to publish her opinions or conclusions about Bendectin and limb reduction defects. Similarly, studies by Dr. Glas-ser have been published in refereed journals, but none of his 32 to 33 publications mentions Bendectin or limb reduction birth defects.

As already discussed, there are over thirty published, peer-reviewed epidemiological studies on the relationship between Bendec-tin and birth defects. None of the findings offered by the Havners’ five experts in this case have been published, studied, or replicated by the relevant scientific community. As Judge Kozinski has said, “the only review the plaintiffs’ experts’ work has received has been by judges and juries, and the only place their theories and studies have been published is in the pages of federal and state reporters.” Daubert, 43 F.3d at 1318 (commenting on the same five witnesses called by the Havners). A related factor that should be considered is whether the study was prepared only for litigation. Has the study been used or relied upon outside the courtroom? Is the methodology recognized in the scientific community? Has the litigation spawned its own “community” that is not part of the purely scientific community? The opinions to which the Havners’ witnesses testified have never been offered outside the confines of a courthouse.

Publication and other peer review is a significant indicia of the reliability of scientific evidence when the expert’s testimony is in an area in which peer review or publication would not be uncommon. Publication in *727reputable, established scientific journals and other forms of peer review “increases the likelihood that substantive flaws in methodology will be detected.” Daubert v. Merrell Dow Pharms., Inc., 509 U.S. 579, 593, 113 S.Ct. 2786, 2797, 125 L.Ed.2d 469 (1993). One legal commentator has suggested that the ultimate test of the integrity of an expert witness in the scientific arena is “her readiness to publish and be damned.” Daubert, 43 F.3d at 1318 (quoting Peter W. Huber, Galileo’s Revenge: Junk Science in the Courtroom 209 (1991)). Further, “the examination of a scientific study by a cadre of lawyers is not the same as its examination by others trained in the field of science or medicine.” Richardson v. Richardson-Merrell, Inc., 857 F.2d 823, 831 n. 55 (D.C.Cir.1988) (quoting Perry v. United States, 755 F.2d 888, 892 (11th Cir.1985)).

We do not hold that publication is a prerequisite for scientific reliability in every case, but courts must be “especially skeptical” of scientific evidence that has not been published or subjected to peer review. Brock v. Merrell Dow Pharms., Inc., 874 F.2d 307, 313 (5th Cir.), as modified on reh’g, 884 F.2d 166 (5th Cir.1989); see also Bert Black et al., Science and the Law in the Wake of Daubert: A New Search for Scientific Knowledge, 72 Tex. L.Rev. 715, 778 (1994). Publication and peer review allow an opportunity for the relevant scientific community to comment on findings and conclusions and to attempt to replicate the reported results using different populations and different study designs.

The need for the replication of results was acknowledged by the Havners’ witnesses. Moreover, it must be borne in mind that the discipline of epidemiology studies associations, not “causation” per se. Particularly where, as here, direct experimentation has not been conducted, it is important that any conclusions about causation be reached only after an association is observed in studies among different groups and that the association continues to hold when the effects of other variables are taken into account. See, e.g., Moore & McCabe, supra, at 202.

As we have already observed, an isolated study finding a statistically significant association between Bendectin and limb reduction defects would not be legally sufficient evidence of causation. The Havners’ witnesses conceded that when a number of studies have been done, it would not be good practice to pick out one to support a conclusion. As the federal Reference Manual on Scientific Evidence points out, “[m]ost researchers are conservative when it comes to assessing causal relationships, often calling for stronger evidence and more research before a conclusion of causation is drawn.” Bailey et al., Reference Guide on Epidemiology, in Reference Manual on Scientific Evidence, supra, at 157. For example, Dr. Swan explained that initially, some studies showed a statistically significant association between Bendectin and the birth defect pyloric steno-sis. However, subsequent, much larger studies did not bear out that association, and in fact, Swan herself has published studies that failed to find an association between Bendec-tin and this type of birth defect.

Accordingly, if scientific methodology is followed, a single study would not be viewed as indicating that it is “more probable than not” that an association exists. See, e.g., Richardson v. Richardson-Merrell, Inc., 649 F.Supp. 799, 802 n. 10 (D.D.C.1986) (noting that no single study would be sufficient to exonerate or to implicate Bendectin with certainty and that studies become “conclusive” only in the aggregate), aff'd, 857 F.2d 823 (D.C.Cir.1988). In affirming the district court in Richardsonr-Merrell, the District of Columbia Circuit recognized that the plaintiffs’ expert had recalculated epidemiological data and had obtained a statistically significant result. See Richardson, 857 F.2d at 831. The court nevertheless held this was not evidence that would support a verdict. Id. Courts should not embrace inferences that good science would not draw. But cf. Lynch, 830 F.2d at 1194 (asserting that a new study coming to a different conclusion and challenging the consensus would be admissible).

The argument is sometimes made that waiting until an association found in one study is confirmed by others will mean that early claimants will be denied a recovery. See, e.g., Green, supra, 86 Nw. U.L.Rev. at *728680-81; Wendy E. Wagner, Trans-Science in Torts, 96 Yale L.J. 428, 428-29 (1986). A related argument is that history tells us that the scientific community has been slow at times to accept valid research and its results. While these observations are true, history also tells us that-valid and rehable research and theories are generally accepted quickly within the scientific community when sufficient explanation is provided and empirical data are adequate. See Black et al., supra, 72 Tex. L. Rev. at 779-82 (discussing Galileo, Pasteur, DNA, and continental drift).

Others have argued that liability should not be allocated only on the basis of rehable proof of fault because legal rules should have the goals of “risk spreading, deterrence, ahocating costs to the cheapest cost-avoider, and encouraging sociahy favored activities,” and because “ ‘consumers of American justice want people compensated.’ ” Rochelle Cooper Dreyfuss, Is Science a Special Case? The Admissibility of Scientific Evidence After Daubert v. Merrell Dow, 73 Tex. L.Rev. 1779, 1795-96 (1995) (quoting Kenneth R. Feinberg, Civil Litigation in the Twentieth-First Century: A Panel Discussion, 59 Brook. L.Rev. 1199, 1206 (1993)). It has been contended that “[f]or some cases that very well may mean creating a compensatory mechanism even in the absence of clear scientific proof of cause and effect” and that “[d]eferring to scientific judgments about fault only obscures the core policy questions that are addressed by the laws that the court is applying.” Id. We expressly reject these views. Our legal system requires that claimants prove their eases by a preponderance of the evidence. In keeping with this sound proposition at the heart of our jurisprudence, the law should not be hasty to impose liability when scientifically reliable evidence is unavailable. As Judge Posner has said, “[l]aw lags science; it does not lead it.” Rosen v. Cibar-Geigy Corp., 78 F.3d 316, 319 (7th Cir.), cert. denied, — U.S. -, 117 S.Ct. 73, 136 L.Ed.2d 33 (1996).

B

The Havners relied on in vivo animal studies to support the conclusion that Bendectin causes limb reduction birth defects in humans. This evidence was presented by Dr. Adrian Gross, a veterinarian and a veterinary pathologist who had worked at the FDA from 1964 to 1979, served as the Chief of the Toxicology Branch at the Environmental Protection Agency from 1979 to 1980, and thereafter was a Senior Science Advisor at the EPA. Dr. Gross confirmed that the FDA and EPA consider animal studies in assessing the potential human response to drugs or pesticides. He testified that what will affect an animal is likely to affect humans in the same way and that the only reason animal studies are done is to predict if the drug at issue will have an adverse effect on humans.

Dr. Gross reviewed a number of animal studies that had been conducted on Bendec-tin. He described studies on rabbits exposed to Bendectin in which he saw “a lot of malformed kits.” Gross testified about another study of rabbits that he found statistically significant. He opined that the probability that the malformations in this study occurred by chance were six in 10,000. With respect to another animal study on rabbits, he stated that the probability that the drug was harmless was less than one per 1,000,000. He listed studies on monkeys, rats, and mice showing “highly significant deleterious harmful effects as far as birth defects are concerned.” Based on these animal studies, Dr. Gross was of the opinion that Bendectin was teratogenic in humans, which means that it causes birth defects. However, he conceded that the dosage levels at which Bendectin became associated with birth defects in rats was at 100 milligrams per kilogram per day, which would be the equivalent of a daily dosage of 1200 tablets for a woman weighing 132 pounds.

The Havners assert in their briefing before this Court that the accepted technique for determining if a substance is a teratogen in humans is to look at all information, including epidemiological data, animal data, biological plausibility, and in vitro studies. Dr. Swan confirmed that these are the relevant sources of information in determining terato-genicity. See also Brent, Comment on Comments on “Teratogen Update: Bendectin,” *729TeRatology 31:429-30 (1985) (stating process for determining if a substance is a teratogen: (1) consistent, reproducible findings in human epidemiological studies; (2) development of an animal model; (3) embryo toxicity that is dose related; and (4) consistency with basic, recognized concepts of embryology and fetal development). Thus, scientific methodology would not rely on animal studies, standing alone, as conclusive evidence that a substance is a teratogen in humans. See Raynor v. Merrell Pharms., Inc., 104 F.3d 1371, 1375 (D.C.Cir.1997) (noting that the only way to test whether data from nonhuman studies can be extrapolated to humans would be to conduct human experiments or to use epidemiological data); Elkins v. Richardson-Merrell, Inc., 8 F.3d 1068, 1071 (6th Cir.1993) (holding that expert opinion indicating a basis of support in animal studies is admissible but is simply inadequate to permit a jury to conclude that Bendeetin more probably than not causes limb defects); Lynch, 830 F.2d at 1194 (asserting that in vivo and in vitro animal studies singly or in combination do not have the capability of proving causation in human beings in the absence of any confirming epidemiological data); see also Brock, 874 F.2d at 313 (recognizing that animal studies are of very limited usefulness when confronted with questions of toxicity); Allen v. Pennsylvania Eng’g Corp., 102 F.3d 194, 197 (5th Cir.1996) (quoting and following Brock in toxic tort case).

We further note that with respect to the in vivo studies about which Dr. Gross testified, their reliability as predictors of the effect of Bendeetin in humans is questionable because of the dosage levels. Dr. Gross offered no explanation of how the very high dosages could be extrapolated to humans. Other courts have rejected animal studies that relied on high dosage levels as evidence of causation in humans. See, e.g., Turpin v. Merrell Dow Pharms., Inc., 959 F.2d 1349 (6th Cir.1992) (reasoning that to eliminate drugs toxic to embryos at high dosage levels would eliminate most drugs and many useful chemicals on which modem society depends heavily) (citing James Wilson, Current Status of Teratology, in Handbook of Teratology 60 (1977)). Gross also failed to explain why the published studies from which he extracted his data had concluded Bendeetin was not harmful.

The in vivo studies identified in this case cannot support the jury’s verdict.

C

Dr. Stuart Allen Newman also relied on animal studies to support his opinion that Bendeetin is a teratogen in humans. Dr. Newman holds a doctorate in chemical physics and is a professor at New York Medical College. He has published over fifty articles, although none contain the opinions or conclusions to which he testified in this case.

The studies Newman reviewed were in vitro studies, which are based on tests conducted on cells in a test tube or petri dish. Doxylamine succinate was placed directly on the limb bud cells of animals including chickens and mice. The development of cartilage was affected. Newman acknowledged that in these studies, the researchers who had conducted them concluded only that doxylamine succinate was potentially capable of inducing genetic damage and that it should be tested on other systems. But Newman testified that if you find an effect that prevails across a number of different species, “you can be awfully sure that the same thing will prevail in humans.”

Newman opined that Kelly Havner’s defect was due to loss of portions of the skeleton that could with scientific certainty have been caused by a teratogen that affected the embryo. Similarly, he testified that the findings of one study, the Hassell/Hori-gan Study, indicated to him that doxylamine succinate can interfere with ehondrogenesis, which is the process of certain cells turning into cartilage. We note that testimony to the effect that a substance “could” or “can” cause a disease or disorder is not evidence that in reasonable probability it does. See, e.g., Parker v. Employers Mut. Liab. Ins. Co., 440 S.W.2d 43, 47 (Tex.1969); Bowles v. Bourdon, 148 Tex. 1, 219 S.W.2d 779, 785 (1949). Newman testified, however, that based on the Hassell/Horigan and other animal studies, he concluded with a reasonable degree of medical certainty that doxylamine succinate is a teratogen for cartilage development and *730that doxylamine succinate is a teratogen in humans. He also testified that he had reviewed the records surrounding Marilyn Havner’s pregnancy and that to a reasonable certainty, she was not exposed to any terato-gen other than Bendectin.

The in vitro studies are similar to the cell biology data at issue in Allen v. Pennsylvania Engineering, 102 F.3d at 198. The fact that Bendectin may have an adverse effect on limb bud cells is “the beginning, not the end of the scientific inquiry and proves nothing about causation without other scientific evidence.” Id.; see also Richardson, 857 F.2d at 830 (“Positive results from in vitro studies may provide a clue signaling the need for further research, but alone do not provide a satisfactory basis for opining about causation in the human context.”); Bailey et al., Reference Guide on Epidemiology, in Reference Manual on Scientific Evidence, supra, at 130-31 (noting that the problem with in vitro studies is extrapolating the findings “from tissues in laboratories to whole human beings”).

Logical support for Dr. Newman’s opinions was also lacking. A number of substances, such as vitamin G, have been shown to damage animal cells when placed directly on tissue. Dr. Newman offered no explanation of how he made the logical leap from the in vitro studies on animal tissue to his conclusion that Bendectin causes birth defects in humans. Dr. Newman’s testimony is not evidence of causation.

D

Of the five witnesses who testified on the question of causation, the only witness who opined that Bendectin was the cause of Kelly Havner’s birth defect, as opposed to birth defects in general, was Dr. John Davis Palmer. Dr. Palmer is a licensed medical doctor and holds a doctorate in pharmacology. He is a professor at the University of Arizona College of Medicine and the acting head of its Pharmacology Department. His opinion was based in part on the testimony of the Havners’ other witnesses.

Dr. Palmer testified that there is a critical period during gestation when the limbs of a fetus are forming. Marilyn Havner took Bendectin somewhere between the 32nd and 42nd day of gestation, depending on how the date of conception is calculated, which was within the period for the development of Kelly Havner’s hand and arm. Palmer explained that the molecular structure of doxy-lamine succinate, one of the two components of Bendectin, permits it to cross the placenta from the mother’s body and reach the fetus. Based on this fact and on in vitro animal studies, intact animal studies, and epidemiological information, he concluded that doxyla-mine succinate is a teratogen in humans. Relying on this same information and on information concerning Kelly Havner, including the date her mother ingested Bendectin, Dr. Palmer concluded that to a reasonable degree of medical certainty, Bendectin caused the birth defect seen in Kelly Hav-ner’s hand.

However, Dr. Palmer’s testimony is based on epidemiological studies that conclude just the opposite. To the extent that he relied on the opinions of Drs. Swan, Glasser, Newman, or Gross, there is no scientifically reliable evidence to support their opinions, as we have seen. Palmer identified no other study or body of knowledge that would support his opinion, other than the chemical structure of doxylamine succinate and a study done on antihistamines, not Bendectin. The Sixth Circuit captured the essence of Dr. Palmer’s testimony when it said, “no understandable scientific basis is stated. Personal opinion, not science, is testifying here.” Turpin, 959 F.2d at 1360. That court further observed that Dr. Palmer’s conclusions so overstated their predicate that it could not legitimately form the basis for a jury verdict. Id. We agree with that observation based on the record in this case.

* ❖ * * * *

There is no scientifically reliable evidence to support the verdict in this case. Accordingly, we reverse the judgment of the court of appeals in part and render judgment for Merrell Dow.

BAKER, J., not sitting.

. Rule 702 provides:

If scientific, technical, or other specialized knowledge will assist the trier of fact to understand the evidence or to determine a fact in issue, a witness qualified as an expert by knowledge, skill, experience, training, or education, may testify in the form of an opinion or otherwise.

TexR. Civ. Evid. 702.

.The Bradford Hill criteria are summarized as follows:

1. Strength of association. "First upon my list I would put the strength of association. To take a very old example, by comparing the occupations of patients with scrotal cancer with the occupations of patients presenting with other diseases, Percival Pott could reach the correct conclusion because of the enormous increase of scrotal cancer in the chimney sweeps.”
2. Consistency. "Next on my list of features to be specifically considered I would place the consistency of association. Has it been repeatedly observed by different persons, in different places, circumstances and times?”
3. Specificity. "If ... the association is limited to specific workers and to particular sites and types of disease and there is no association between the work and other modes of dying, then clearly that is a strong argument in favor of causation.”
4. Temporality. "Which is the cart and which the horse?"
5. Biological gradient. "Fifthly, if the association is one which can reveal a biological gradient, or dose-response curve, then we should look most carefully for such evi-dence_ The clear-dose response curve admits of a simple explanation and obviously puts the case in a clearer light."
6. Plausibility. "It would be helpful if the causation we suspect is biologically plausible. But this is a feature I am convinced we cannot demand. What is biologically plausible depends on the biological knowledge of the day.”
7. Coherence. "The cause-and-effect interpretation of our data should not seriously conflict with the generally known facts of the natural history and biology of the disease."
8. Experiment. "Occasionally it is possible to appeal to experimental ... evidence.... Here the strongest support for the causation hypothesis may be revealed.”
*7199. Analogy. “In some circumstances it would be fair to judge by analogy. With the effects of thalidomide and rubella before us we would surely be ready to accept slighter but similar evidence with another drug or another viral disease in pregnancy."

Bernstein, supra, 15 Cardozo L.Rev. at 2167-68 (quoting Austin Bradford Hill, The Environment and Disease: Association or Causation?, 58 Proc. Royal Soc'y Med. 295, 299 (1965)); see also Thompson, supra, 71 N.C. L.Rev. at 268-74.

. See, e.g., Black & Lilienfeld, supra, 52 Fordham L.Rev. at 762-63; Christopher L. Callahan, Establishment of Causation in Toxic Tort Litigation, 23 Ariz St. LJ. 605, 626 (1991); Michael Dore, A Proposed Standard For Evaluating the Use of Epidemiological Evidence in Toxic Tort and other Personal Injury Cases, 28 How. LJ. 677, 691 (1985); see also Bailey et ah, Reference Guide on Epidemiology, in Reference Manual on Scientific Evidence, supra, at 160-64.