OPINION
MURGUIA, Circuit Judge:Stephanie Garcia appeals from the district court’s order affirming the Commissioner of Social Security’s (the “Commissioner”) denial of benefits on the basis that she was not intellectually disabled. Garcia argues that the administrative law judge (ALJ) who determined that she was not disabled had a duty to develop the record because that record did not include a complete set of valid IQ scores. We agree that the ALJ had a duty to order further IQ testing, and we further conclude that the ALJ’s failure to do so was an error that cannot be considered harmless. We therefore reverse the district court and remand for further proceedings.
I
As a minor, Stephanie Garcia received social security benefits because of her intellectual disability. After she reached the age of 18 in 2007, the Social Security Administration (SSA or the “Administration”) concluded that she no longer qualified as disabled and was therefore not entitled to further benefits. Garcia sought review by an ALJ, before whom she had a hearing on April 8, 2010. At the time of her hearing, Garcia lived with her mother and two siblings, as well as her own disabled daughter. Although she had learned some skills for caring for herself through an indepen*927dent living program, Garcia was dependent on her mother for her own care and for the care of her child. After taking special education classes, Garcia earned a high school diploma, but she was unable to read . and did not know the alphabet.
Garcia worked part-time at a pizza shop for several months in 2008. She testified to having had difficulty with making pizzas, taking orders, and cashiering; as a result, she required constant supervision. She quit because she found the work “too hard.” Garcia was then placed in a clerical job by the California Department of Rehabilitation; her duties included photocopying, alphabetizing files, and removing staples from documents. She worked four or five hours per day, five days per week. She testified at her hearing that she had difficulty understanding how to perform the tasks assigned to her and had to rely on a coworker for help. Garcia also quit this job after two months because “[i]t was too hard.” Vicky Medina, Garcia’s counselor at the Central Valley Regional Center, testified that, based on her observations, Garcia would be unable to “do any job eight hours a day, five days a week as it would be performed in the national economy without extra supervision.” Medina explained that Garcia has difficulty remembering how to perform tasks, and that she needs to be re-taught “on a constant basis.”
Apart from her intellectual disability, Garcia has suffered from depression stemming from having to care for her young daughter, who has Down Syndrome, asthma, and heart and thyroid problems. Garcia has been treated for her depression, and her psychiatric condition has improved.
In evaluating Garcia’s disability claim, the ALJ considered the reports of three experts: psychologist Mary K. McDonald, Ph.D., psychologist Allen Middleton, Ph. D., and physician Evangeline Murillo, M.D.
On February 13, 2008, Dr. McDonald evaluated Garcia at the request of the California Department of Social Services. Dr. McDonald administered the Bender Visual Motor Gestalt Test, II Edition; the Wechsler Memory Scale, III Edition; and the Wechsler Adult Intelligence Scale, III Edition (“WAIS-III”). The WAIS-III measures an individual’s “intelligence quotient,” or “IQ”; IQ is reported as three scores: verbal, performance (non-verbal), and full scale. See 20 C.F.R. § 404, subpt. P, app. 1, listing 12.00 (“Listing 12.00”) (D)(6). Garcia’s scores on the Motor Gestalt Test were average to low average, and her Memory Scale scores indicated that her “[vjerbal memory is impaired and visual memory is within the low average range.”
Dr. McDonald administered only the performance portion of the WAIS-III “[d]ue to the constraints of time and the slowness with which [Garcia] worked.”1 Consequently, Dr. McDonald did not report a verbal or full-scale score. Garcia’s performance IQ score was 77, which is in the “borderline range” for disability. Mc*928Donald concluded that Garcia was “capable of employment.”
After reviewing Garcia’s medical records, including the incomplete IQ test results, Dr. Middleton completed a Mental Residual Functional Capacity Assessment,2 Psychiatric Review Technique,3 and Case Analysis.4 He determined that Garcia was “moderately limited” in her “ability to [understand, remember, and carry out] detailed instructions.” He concluded that Garcia was “able to understand and remember [work] locations [and] procedures of a simple, routine nature involving 1-2 step job tasks [and] instructions.”
Dr. Murillo also reviewed Garcia’s medical records, including the incomplete IQ results, and completed a Mental Residual Functioning Capacity Assessment and Case Analysis.5 Like Dr. Middleton, Dr. Murillo concluded that Garcia was “moderately limited” in her “ability to [understand, remember, and carry out] detailed instructions.” She determined that Garcia could “understand and remember work locations and procedures of a simple, routine nature involving 1-2 step job tasks and instructions” and “maintain concentration and attention for above in 2 hour increments” during “8 hr/40 hr work schedules.”
At the hearing, the ALJ also heard testimony from vocational expert Thomas Dachelet. Dachelet testified that the ability to read and write at a basic level is a requirement for even those jobs classified by the Dictionary of Occupational Titles (DOT) as needing the lowest “general educational development.” However, he also acknowledged that Garcia had worked at “light unskilled” jobs at which “she didn’t read or write.” Dachelet testified that in California “there were 1,020,830 persons employed at the light unskilled level.” He identified three light unskilled jobs Garcia could perform: (1) a bagger, of which 44,-304 were employed in California, (2) a garment sorter, of which 21,179 were employed in California, and (3) a grader,6 of which 20,188 were employed in California.
In a May 18, 2010, decision, the ALJ concluded that Garcia was not disabled as *929of February 1, 2008, consistent with the SSA’s original determination. The ALJ determined that Garcia had the severe impairment of borderline intellectual functioning but that the impairment was not so severe that it met the requirements for intellectual disability; see 20 C.F.R. § 404, subpt. P, app. 1, listing 12.05 (“Listing 12.05”).
Listing 12.05 lays out four ways in which an individual may qualify as intellectually disabled without requiring any further inquiry into her ability to work: (1) “[m]ental incapacity ... such that the use of standardized measures of intellectual functioning is precluded”; (2) “[a] valid verbal, performance, or full scale IQ of 59 or less”; (3) “[a] valid verbal, performance, or full scale IQ of 60 through 70 and a physical or other mental impairment imposing an additional and significant work-related limitation of function”; and (4) “[a] valid verbal, performance, or full scale IQ of 60 through 70, resulting in at least two [milder impairments].” Id. Each of these alternatives depends on a subject’s IQ test performance, unless she is unable to undergo testing.
Based on Garcia’s performance IQ score of 77, the ALJ concluded that Garcia could not meet Listing 12.05. The ALJ further concluded that Garcia had the RFC “to perform a full range of work at all exertional levels but with the following nonexertional limitations: [Garcia] can perform simple repetitive tasks where the jobs can be learned mostly by demonstration, but she cannot perform reading and/or writing as a job task.” Based primarily on Dachelet’s testimony, the ALJ concluded that Garcia was “capable of making a successful adjustment to other work that exists in significant numbers in the national economy,” including the jobs of bagger, garment sorter, and grader. For this reason, the ALJ concluded that Garcia was “not disabled.”
Garcia appealed the ALJ’s decision to the Social Security Administration Appeals Council, but her appeal was denied, making the ALJ’s decision the final decision of the Commissioner. Garcia then sought judicial review of the Commissioner’s decision in the district court, arguing in part that the ALJ erred when she failed to develop the record by ordering a new IQ test administration to obtain a complete set of test scores. The district court affirmed the final decision of the Commissioner.
II
We review de novo a district court’s judgment affirming the denial of social security benefits. Bray v. Comm’r of Soc. Sec. Admin., 554 F.3d 1219, 1222 (9th Cir.2009). “We may set aside a denial of benefits only if it is not supported by substantial evidence or is based on legal error.” Robbins, 466 F.3d at 882.
It was legal error for the ALJ not to ensure that the record included a complete set of IQ test results that both the ALJ and the reviewing experts could consider. While it is not certain from the record before us that Garcia would have been determined to be disabled if the record had been properly developed, it is also not “clear from the record that ‘the ALJ’s error was inconsequential to the ultimate nondisability determination.’ ” Tommasetti v. Astrue, 533 F.3d 1035, 1038 (9th Cir. 2008) (quoting Robbins v. Soc. Sec. Admin., 466 F.3d 880, 885 (9th Cir.2006)). Therefore we reverse the district court and remand with instructions to reverse the final decision of the Commissioner and to order the Commissioner to develop the record through further IQ testing.
*930III
To be eligible for disability benefits, an individual must be unable “to engage in any substantial gainful activity by reason of any medically determinable physical or mental impairment which can be expected to result in death or which has lasted or can be expected to last for a continuous period of not less than 12 months.” 42 U.S.C. § 423(d)(1)(A).
The evaluation of disability in adults is governed by a five-step process, which the ALJ followed in assessing Garcia. 20 C.F.R. § 416.920. The ALJ skipped the first and fourth steps, as they were not applicable to Garcia’s situation.7 At the second step, the ALJ determines whether a claimant has an impairment or combination of impairments that is medically severe; if not, the claimant is not disabled. Id. §§ 416.920(a)(4)(h), 416.920(c). The ALJ concluded that Garcia had the severe impairment of “borderline intellectual functioning,” and so proceeded to the third step.
At the third step, the ALJ again considers the severity of the impairment or combination of impairments by comparing it to the listings in 20 C.F.R. § 404, subpart P, appendix 1. Id. §§ 416.920(a)(4)(iii), 416.920(d). If the impairment or combination of impairments is at least as severe as the relevant listing, and has lasted at least twelve months, then the claimant is deemed disabled, and the inquiry ends; otherwise, the ALJ proceeds to the next step. Id. The ALJ concluded that Garcia did not meet Listing 12.05 and so proceeded to step five. At the fifth step, the ALJ considers the claimant’s RFC — that is, her ability to work in spite of her limitations- — ■ along with her age, education, and work experience, to determine whether she can make an adjustment to a new kind of work. Id. § 416.920(a)(4)(v). The ALJ concluded that Garcia could perform jobs requiring the ability to undertake simple, repetitive tasks, and so found that she was not disabled.
IV
Garcia argues that the ALJ erred by failing to order additional IQ testing and instead relying on the results of the partial examination performed by Dr. McDonald. We agree. “The ALJ always has a ‘special duty to fully and fairly develop the record and to assure that the claimant’s interests are considered.’ ” Celaya v. Halter, 332 F.3d 1177, 1183 (9th Cir.2003) (quoting Brown v. Heckler, 713 F.2d 441, 443 (9th Cir.1983)).
The ALJ is not a mere umpire at such a proceeding ...: it is incumbent upon the ALJ to scrupulously and conscientiously probe into, inquire of, and explore for all the relevant facts. He must be especially diligent in ensuring that favorable as well as unfavorable facts and circumstances are elicited.
Id. (quoting Higbee v. Sullivan, 975 F.2d 558, 561 (9th Cir.1992)).
In a case, such as this one, that turns on whether a claimant has an intellectual disability and in which IQ scores are relied upon for the purpose of assessing that disability, there is no question that a “fully and fairly develop[ed]” record, id., will include a complete set of IQ scores that report verbal, non-verbal, and full-scale *931abilities. There are two principal reasons for our conclusion.
First, IQ testing plays a particularly important role in assessing the existence of intellectual disability. Listing 12.00 generally lays out the necessary procedures for evaluating mental disorders, including intellectual disability, and for documenting relevant objective findings. In that listing the SSA has recognized that “[sjtandardized intelligence test results are essential to the adjudication of all cases of intellectual disability,” except where a claimant is unable to complete such testing. Listing 12.00(d)(6)(b). At the third step of the SSA’s five-step process, when a claimant’s impairment is compared to the criteria in Listing 12.05, three of the four criteria for intellectual disability rely in whole or in part on IQ test scores. (The fourth criterion applies when the claimant’s incapacity precludes IQ testing.) Because meeting the relevant listing conclusively determines that a claimant is indeed disabled, 20 C.F.R. § 416.920(a)(4)(iii), the claimant’s IQ score can be the deciding factor in a determination of intellectual disability.
Further, as was the case with Garcia, IQ test results can play a role in the development of other evidence in the record. For example, Dr. Middleton and Dr. Murillo both reviewed Garcia’s IQ results before making their determinations about her ability to work. Thus, as a practical matter, the importance of IQ scores in this case did not end with step three. The partial test results also affected the ALJ’s conclusions about Garcia’s ability to work, even if less directly.
The second reason for our conclusion is that the regulations promulgated by the SSA demonstrate that the Administration, based on its considerable expertise, has determined that it is essential for complete — rather than partial — sets of IQ scores to be used in evaluating intellectual disability. As a general principle, all reports of test results “must conform to accepted professional standards and practices in the medical field for a complete and competent examination,” 20 C.F.R. § 416.919n(b), and an examination is not complete unless it includes “all the elements of a standard examination in the applicable medical specialty,” id. § 416.919n(c).
The regulations specifically identify the “Wechsler series” of IQ tests (of which WAIS-III is a part) as “customarily” including “verbal, performance, and full scale IQs.” Listing 12.00(D)(6)(c). This characteristic of the Wechsler exam makes it particularly well suited to the assessment of intellectual disability, because “[gjenerally, it is preferable to use IQ measures that are wide in scope and include items that test both verbal and performance abilities.” Listing 12.00(D)(6)(d).
The Commissioner argues that the regulations themselves suggest it is acceptable for an AL J to rely on partial test results in a situation, such as this one, in which only part of an IQ test was administered. The Commissioner points specifically to a passage in Listing 12.00 providing that “[i]n cases where more than one IQ is customarily derived from the test administered, e.cj., where verbal, performance, and full scale IQs are provided in the Wechsler series, we use the lowest of these in conjunction with [Listing] 12.05.” Id. at 12.00(D)(6)(c).8
However, our reading of this same passage leads us to conclude the opposite: Listing 12.00 strongly disfavors reliance on partial test results. The plain text of the regulation clearly suggests that IQ tests like those in the Wechsler series should be *932administered and reported in Ml, because it assumes that the ALJ will have multiple scores — “verbal, performance, and Ml scale” — from which to “use the lowest.” We also note that the regulations’ insistence that the ALJ look at all three scores in order to identify the lowest among them seems intended to benefit the disability claimant, for whom each test score is an opportunity to demonstrate that she meets one of the IQ-related criteria specified in Listing 12.05 — as well as an opportunity to demonstrate the extent of her impairment to other experts reviewing her IQ as part of their own evaluations of her limitations.
Because the regulations clearly assert the importance of a complete IQ test administration, the ALJ had a duty to develop the record so that it included a compíete set of IQ test results. Her failure to do so was legal error.9
V
Our conclusion that the ALJ committed legal error is not the end of our inquiry. We will not reverse an ALJ’s decision on the basis of a harmless error, “which exists when it is clear from the record that ‘the ALJ’s error was inconsequential to the ultimate nondisability determination.’ ” Tommasetti, 533 F.3d at 1038 (quoting Robbins, 466 F.3d at 885). While the record here may not definitively demonstrate that Garcia would have been adjudicated disabled if the ALJ had ordered that a complete set of IQ tests be administered, it is certainly not clear from the record that Garcia was not harmed by the ALJ’s error.10
*933Again, we recognize that the importance of IQ test results in adjudicating intellectual disability is not limited to the claimant’s ability to meet the listing at step three of the five-step process. Both Dr. Middleton and Dr. Murillo considered Garcia’s incomplete IQ test results in assessing her ability to support herself through gainful employment, and the ALJ relied on these experts’ findings in assessing Garcia’s RFC and ultimately in determining that she was not disabled. The Commissioner points out that neither Dr. Middleton nor Dr. Murillo “expressed any concerns about the adequacy of Dr. McDonald’s psychological testing,” but that does not necessarily mean that neither would have reached a different conclusion or offered other findings beneficial to Garcia based on a complete set of scores. Such an outcome seems particularly plausible where, as here, Garcia’s testing history as a juvenile strongly suggests that her verbal and full-range IQ scores would be considerably lower than the performance score of 77 obtained by Dr. McDonald. In a December 2004 test administration, Garcia was assessed with a verbal score of 61, a performance score of 74, and a full-scale score of 66. In June 2005, she received a full-scale score of 44 and a verbal score of 53. Further, the testimony of Garcia’s counsel- or Vicky Medina also suggests that verbal functioning was a particular weakness for Garcia.
In this case, there is a genuine probability that, had a complete set of valid IQ test scores been included in the record, the opinions of the reviewing experts might have been different, or Garcia might have had an additional factual basis for challenging their opinions. This is especially true when, just three years earlier, Garcia’s full-scale test score was dramatically below the threshold for establishing disability even on the basis of just the score by itself. See Listing 12.05(B) (providing that intellectual disability may be established by “[a] valid verbal, performance, or full-scale IQ of 59 or less”). The fact that IQ test results may be considered by multiple reviewing experts, as well as by the ALJ, makes it particularly difficult to conclude that any error affecting the quality of those results is “inconsequential to [an] ultimate nondisability determination,” let alone to conclude that such harmlessness is “clear from the record.” Tommasetti, 533 F.3d at 1038.
Perhaps even more significantly, Garcia may have been able to meet Listing 12.05(B),11 under which she would have been adjudicated disabled if she had scored below 60 on either the verbal, performance, or full-scale portion of an IQ test. Given that Garcia had previously received a childhood Wechsler full-scale score of 44 and a verbal score of 55, and that she tended to score lower on the verbal component than on the performance component, it appears likely that Garcia could have met Listing 12.05(B) at step three of the evaluation process. Based on that evidence alone, it cannot be “clear from the record” that failure to obtain *934those two tests was “inconsequential.” Tommasetti, 533 F.3d at 1038.
VI
The ALJ’s failure to develop the record to include a complete set of IQ scores was legal error. Because we cannot conclude that the error was harmless, we REVERSE the judgment of the district court and REMAND with instructions to remand to the Commissioner for further proceedings.
. This is not that first time that Dr. McDonald has given this reason for failing to administer a complete IQ test when evaluating a patient for intellectual disability. See Andrade v. Comm’r of Soc. Sec., No. 1:09-cv-1926 GSA, 2011 WL 864700 (E.D.Cal. Mar. 10, 2011), affd, 474 Fed.Appx. 642 (9th Cir.2012) ("Dr. McDonald’s report indicates that only the Performance IQ portion of the Wechsler Adult Intelligence Scale was administered 'due to the constraints of time.' ”). This excuse is troublesome, and the district court should not have accepted it in the absence of some more compelling reason. The SSA’s regulations indicate that potentially disabled individuals may take more time than others to complete an IQ test administration, and the administrator of the test should plan accordingly. See 20 C.F.R. § 416.919n(a).
. Residual Functional Capacity (RFC) is the work that an individual is capable of performing in spite of her limitations. 20 C.F.R. § 416.945(a)(1). The Mental RFC Assessment form used by Dr. Middleton, Form SSA-4734-SUP, requires a reviewing expert to evaluate the degree of the subject's limitations in various aspects of (1) "understanding and memory," (2) "sustained concentration and persistence,” (3) "social interaction,” and (4) "adaption,” such as in responding to workplace hazards or navigating public transportation. Based on the evaluation of the subject’s limitations in each category, the reviewing expert then makes a general assessment of the subject's "functional capacity.”
. The Psychiatric Review Technique form used by Dr. Middleton, Form SSA-2506-BK, requires the reviewing expert to (1) summarize relevant documentation, such as IQ test results, (2) rate the subject’s "functional limitations,” and (3) provide additional notes in narrative form.
. Dr. Middleton used Form SSA-416, on which he listed "significant objective findings,” such as Garcia’s IQ test scores, her progress in school, and her depression.
. Dr. Murillo completed the same forms as Dr. Middleton: Mental RFC Assessment Form SSA-4734 — SUP and Case Analysis Form SSA-416.
. Dachelet refers to the DOT listing for a "fruit-grader operator.” One employed in this position "[tjends machine that grades fruit according to size: Changes chains and other driving gear according to type of fruit. Directs workers engaged in loading of elevator belt and removal of graded fruit. Cleans and lubricates chains, bearings, and machine gears, using rags and grease gun. Repairs, replaces, and adjusts malfunctioning parts of machine.” DOT 529.665-010, 1991 WL 674628.
. At the first step, the ALJ would have considered Garcia’s present work activity; however, this step does not apply to individuals whose disability determinations are being reevaluated because they turned 18. See 20 C.F.R. § 416.987(b). At the fourth step, the ALJ would have considered Garcia's past relevant work, id. § 920(a)(iv); however, the ALJ skipped this step because she concluded that Garcia did not have any past relevant work.
. The district court came to the same conclusion.
. We recognize that our holding here is contrary to Andrade v. Commissioner of Social Security, 474 Fed.Appx. 642 (9th Cir.2012). We are not bound by our earlier decision. See 9th Cir. R. 36-3(a).
. The dissent suggests that the harmlessness standard recognized in Tommasetti does not apply to cases in which the legal error at issue is a failure of the duty to develop the record. Citing McLeod v. Astrue, 640 F.3d 881 (9th Cir.2011), the dissent argues that in such cases we should turn our stringent harmlessness standard on its head and presume any error is harmless until the claimant or record demonstrates otherwise. See Dissent at 934, 937-38. McLeod provides no basis for us to create such a peculiar carve-out from our well-established rule. We have consistently treated an ALJ’s failure to adequately develop the record as reversible legal error. See Celaya, 332 F.3d at 1183. We have never suggested that failure to develop is somehow lesser error, or should be treated differently to other types of legal error. Indeed, often the same error can be characterized as either failure-to-develop or "normal” legal error depending on how it's described. Adopting a separate — and inverted — harmlessness standard for failure-to-develop cases would not only create confusion in our case law, but also hinge a great deal on a nebulous, and often unimportant, distinction.
McLeod concerned a disability claim by a veteran who argued on appeal that the ALJ had failed adequately to develop the record. We observed that there may be situations in which "further administrative review is needed to determine whether there was prejudice from the error [of not developing the record].” 640 F.3d at 888. However, contrary to the dissent’s assertion, we explicitly recognized that "it is quite clear that no presumptions operate, and we must exercise judgment in light of the circumstances of the case.” Id. We remanded to the ALJ for a harmlessness determination, even though it was not clear from the record that the potentially omitted evidence — a VA disability rating — even existed. McLeod is limited to situations where the record is insufficient for the court to make its own prejudice determination, and remand is for the ALJ to determine the harmfulness of the omission in the first instance. It makes good sense that, in such a situation, "mere probability” that hypothetical new evidence— like the potential disability certificate — may be influential is insufficient to support a remand. Because, here, we know precisely which evidence was omitted from the record and have no doubts about its significance in reaching an intellectual disability determination, we see no reason to depart from the harmlessness standard articulated in Tommasetti.
. The dissent argues we should ignore Listing 12.05(B) when reviewing for harmless error because Garcia "never claimed on appeal that she would have qualified under Listing 12.05 B.” Dissent at 25 (emphasis in original). Garcia’s opening brief, however, clearly raised the issue. Garcia argued that “[biased on the high correlation between the tests, the expected verbal IQ score supports the contention that the complete IQ test would result in IQ scores sufficient to meet or equal the Listing ... 12.00.” Listing 12.00 describes the evaluation process to determine whether an applicant’s impairment is a "mental disorder.” It expressly states: “If your impairment satisfies the diagnostic description in the introductory paragraph [of Listing 12.05] and any one of the four sets of criteria, we will find that your impairment meets the listing.” Listing 12.00 (emphasis added).