concurring in part and dissenting in part.
I agree with my colleagues in the majority only to the extent that the challenged tests did have a disparate impact. There is little doubt in my mind, however, that the majority’s question, whether “the employer[s] show[ed] that the challenged employment practice creating this disparate result is nevertheless job-related for the position in question and consistent with business necessity,” supra at 111, cannot be answered in the affirmative based on this record.17 To my view, the district court committed clear error in finding that the challenged tests were valid when placed under the legal prism of Title VII, 42 U.S.C. § 2000e et seq. M.O.C.H.A. Soc’y, Inc. v. City of Buffalo, 689 F.3d 263, 275 (2d Cir.2012); Ass’n of Mex.-Am. Educators v. California, 231 F.3d 572, 584-85 (9th Cir.2000) (en banc); Melendez v. Ill. Bell Tel. Co., 79 F.3d 661, 669 (7th Cir.1996); Hamer v. City of Atlanta, 872 F.2d 1521, 1526 (11th Cir.1989); Bernard v. Gulf Oil Corp., 890 F.2d 735, 739 (5th Cir.1989).
A review of the record shows that Boston18 did not, contrary to the district court’s finding and the majority’s assertion, “show[ ] that the content of the selection procedure is representative of important aspects of performance on the job for which the candidates are to be evaluated.” Supra at 111 (quoting 29 C.F.R. § 1607.5(B)); see also 29 C.F.R. § 1607.5(A). Because there is ample precedent on which to draw, see, e.g., Bos. Chapter, NAACP, Inc. v. Beecher, 504 F.2d 1017 (1st Cir.1974), I need not engage the majority’s emphasis on the non-binding nature of EEOC Guidelines, supra at 111— 12, nor rest my objection on what I would consider the Guidelines’ rather overwhelming persuasiveness vis-a-vis this case. Id. at 111-12 (citing Jones v. City of Bos., 752 F.3d 38, 50 n. 14 (1st Cir.2014)). It is enough to say that, based on our precedent and this record, there is a solid legal basis to find that the district court’s acceptance of Boston’s case for content validity is clearly erroneous.
The most significant flaws in Boston’s case for validity should each independently have been fatal to it: Boston failed to demonstrate (1) that the 1991 Validation Report and 2000 job analysis were applicable and reliable19 and (2) that the exams tested “representative” and critical knowledge, skills, and abilities (“KSAs”) necessary to quality for the position of police sergeant.
This first flaw stems from “the way in which the validation study was performed” and its effect on test validity. Beecher, 504 F.2d at 1025. The Validation Report and job analysis were defective. The district court acknowledged the “rule of *123thumb” that a job analysis should typically have been performed within the last five to eight years to be reliable. López v. City of Lawrence, No. 07-11693-GAO, 2014 U.S. Dist. LEXIS 124139, at *51 (D.Mass. Sept. 5, 2014). Yet, the 1991 job analysis and resultant Validation Report predate the first of the contested exams by fourteen years. Neither of the two conditions noted by the district court as potentially saving an older analysis from obsolescence — lack of change in job requirements or a later review updating the analysis — rescue the Report. Id.; cf. 29 C.F.R. § 1607.5(E) (explaining totality of circumstances should be considered in determining whether a validation study is outdated).
The Officers bolstered the presumption that a test more than eight years old is not reliable, and the common sense conclusion that a position changes over time, by pointing to specific evidence that defendants’ police departments changed practices since the Report and analysis were performed: testimony from Commissioner Edward F. Davis that Lowell implemented a community policing model and a 2002 Boston Commissioner’s memo referring to changes in policing policy and practice. While the district court was entitled to rely on Dr. Outtz’s testimony as to the unchanging nature of the position of sergeant, it clearly erred in doing so for the proposition it drew from his testimony, that the position of police sergeant in the defendant departments had not changed, as Dr. Outtz based his statement on “[his] experience generally” regarding the position in other municipalities, including those in other states.
The subsequent job analysis completed in 2000, within the time range to be presumed reliable, is unreliable by virtue of the way it was performed. The 2000 job analysis suggests that the eleven subject matter experts (“SMEs”), sergeants and detective sergeants, relied upon by the testing firm to evaluate KSAs and tasks for inclusion in the exam, were to do so individually; the analysis details procedures for reconciling disparate results to determine which items should make the cut. For example, “[f]or a KSA to be included as a[sic] important component of the Police Sergeant position, the KSA had to be rated by nine ... of the eleven ... SMEs” in a certain way across all five categories. Yet the eleven SMEs evaluating 160 KSAs each rated all 160 KSAs’ five attributes — job relatedness, time for learning, length of learning, differential value to performance, and necessity20 — in exactly the same way, although there were 72 possible ways to rate each KSA. The same was true of task ratings, wherein each SME was supposed to rate each of 218 tasks’ frequency, importance, necessity, relationship to performance, and dimensions,21 despite the fact that each of 218 tasks could be rated in 1,674 ways. I will not speculate as to how and why this total agreement occurred but only observe that *124an analysis that generates a result so unfathomably inconsistent with its proposed methods is not reliable.22 As such, it was clear error to find the 2000 job analysis supports the exams’ validity. Beecher, 504 F.2d at 1025.
Beyond these threshold issues, the resultant exams did not test a representative portion of KSAs. See 29 C.F.R. § 1607.5(B). Nor did they test critically important KSAs “in proportion to their relative importance on the job.” Guardians Ass’n of N.Y.C. Police Dep’t, Inc. v. Civil Serv. Comm’n of N.Y.C., 633 F.2d 232, 243-44 (2d Cir.1980) (citation omitted); see also Beecher, 504 F.2d at 1024 (noting district court did not err in finding that two significant correlations between exam and job performance components did not make “ ‘convincing’ evidence of job relatedness” (citation omitted)); see also 29 C.F.R. § 1607.14(C)(2) (an exam should measure “critical work behavior(s) and/or important work behavior(s) constituting most of the job”).
The 2000 job analysis identified 163 “important tasks” and 155 “important” KSAs. The district court acknowledged that the eighty-point multiple-choice portion of the exams tested primarily the “K” of the KSAs, knowledge, and failed to measure key skills and abilities, and thus would not be independently valid. López, 2014 U.S. Dist. LEXIS 124139, at *60-61. The E & E component that purportedly compensated for the “SA” deficit, edging the exams into the realm of validity, consisted of a single sheet requiring candidates to bubble in responses as to length of work experience in departmental positions by rank, educational background, and teaching experience. As the majority concedes, this component had a minimal effect on score. Supra at 113.
The conclusion that more than half, López, 2014 U.S. Dist. LEXIS 124139, at *54, or nearly half, supra at 115 n. 11, of applicable KSAs were or could be tested by the exams overestimates the number of KSAs tested by the E & E component. But even if that estimate were correct, relying upon this quantitative measure misses that representativeness is partly qualitative.
It is quite a stretch to conclude that the E & E’s bubbles incorporated measures of the majority of key skills and abilities. It is even more difficult to conclude from the record that the skills and abilities measured received representative weight. Supra at 113. How, exactly, could this worksheet test, as the testability analysis suggests, “[kjnowledge of the various communities within the Department’s jurisdiction and the factors which make them unique,” “[sjkill in perceiving and reacting to the needs of others,” or “[kjnowledge of the procedures/techniques when a major disaster occurs,”? And how, if it only affected the ultimate score by five to seven percent at most, supra at 113, could it be said that the KSAs for which the E & E ostensibly tested were adequately represented relative to those KSAs tested on the multiple-choice component?
The exam’s failure to include particularly significant KSAs also precludes representativeness. See Gillespie v. Wisconsin, 771 F.2d 1035, 1044 (7th Cir.1985) (“To be representative for Title VII purposes, an employment test must neither: (1) focus exclusively on a minor aspect of the posi*125tion; nor (2) fail to test a significant skill required by the position.” (emphasis added)); Guardians, 630 F.2d at 99. The exams here may have tested the knowledge a supervisor must have but omitted any meaningful test of supervisory skill, which is unquestionably essential to the position of police sergeant. López, 2014 U.S. Dist. LEXIS 124139, at *51. Written tests of supervisory skill have been found by other courts to be altogether inadequate to evaluate that attribute. See Vulcan Pioneers, Inc. v. N.J. Dep’t of Civil Serv., 625 F.Supp. 527, 547 (D.N.J.1985), aff'd on other grounds, 832 F.2d 811, 815-16 (3d Cir.1987); see also Firefighters Inst. for Racial Equal v. City of St. Louis, 549 F.2d 506, 513 (8th Cir.1977).
As in Beecher, “[t]here are, in sum, too many problems with the test ... to approve it here.” 504 F.2d at 1026. It cannot be anything but clear error, supra at 114, to find valid exams based on an outdated validation report and a facially flawed job analysis, exams that are not only unrepresentative but also omit critical KSAs for the position of police sergeant. To endorse the means by which these exams were created and the exams themselves here establishes a perilous precedent that all but encourages corner-cutting when it comes to Title VII.
On these grounds, I respectfully dissent.
. I would also have found the Officers established a prima facie case as to all defendants, but, as the majority does not address this question, supra at 107, I will focus on test validity.
. Like the majority, supra at 107, I will refer primarily to Boston for the sake of simplicity.
.As I would find neither sufficed to support the exams’ validity, it does not matter which Boston relied upon for each test, the 2000 job analysis or 1991 Validation Report. See supra at 109 n.. 3.
. Job relatedness could be answered "[y]es” or "[n]o”; time for learning, "[b]efore assignment” or "[a]fter assignment”; length of learning, ”[l]onger than brief orientation” or ”[b]rief orientation”; differential value to performance, "[h]igh,” ”[m]oderate,” or ”[l]ow”; and necessity, "[r]equired,” "[djesirable,” or "[n]ot required.”
. Frequency could be rated "[r]egular[],” "[p]eriodic[],” or "[o]ccasional[]”; importance, ”[v]eiy important,” "[¡Important,” or "[n]ot important"; necessity, ”[n]ecessary upon entry” or "[n]ot necessary”; and relationship to performance, "this task clearly separates the best workers,” "better workers seem to perform this better than poor or marginal workers,” or "[m]ost perform this task equally well.” Dimensions could be answered using any combination of "[o]ral [c]ommunication,” "[¡Interpersonal [s]kills,” "[pjroblem ID & [a]nalysis,” "[j]udgment,” and "[pjlanning and [organizing” or "all.”
. A second suspect aspect of this analysis, one that further clarifies how troubling the purported across-the-board agreement is, is in how the SMEs rated certain KSAs and tasks. For example, all eleven SMEs — including two assigned to administrative roles, — responded that ”[s]et[ting] up command posts at scenes of[ Robberies, homicides, fires, etc.,” was a “daily” task.