COMMONWEALTH OF PENNSYLVANIA et al.
v.
Joseph F. O'NEILL, et al.
Civ. A. No. 70-3500.
United States District Court, E. D. Pennsylvania.
January 23, 1979.*452 Henry W. Sawyer, III, Alan F. Klein, Philadelphia, Pa., for plaintiffs.
James M. Penny, Jr., Philadelphia, Pa., for defendants.
MEMORANDUM AND ORDER
FULLAM, District Judge.
This action, challenging the hiring and promotional practices of the Philadelphia Police Department on racial grounds, was instituted on December 21, 1970. As a result of various statutory changes, amendments to the pleadings, class-action rulings, and allowances of intervention, the case now includes as plaintiffs the Commonwealth of Pennsylvania, a class consisting of all applicants and would-be applicants for employment on the Police Force, a class consisting of all black police officers in the Philadelphia Police Department, and the Guardian Civic League of Philadelphia (an organization of black police officers). The defendants include the City of Philadelphia, the Mayor, the Police Commissioner, and the Fraternal Order of Police.
On July 7, 1972, after 5½ days of hearings on plaintiffs' application for preliminary relief, I concluded that plaintiffs had clearly established that the existing entrance and promotional examinations did *453 discriminate on the basis of race. Since no attempt had then been made to "validate" these tests as job-related, I entered an Order enjoining the defendants from hiring or promoting on the basis of these examinations, except in the same ratio (2-to-1) as the racial distribution of the applicant pool, until such time as the existing tests should be validated, or new tests developed.
On appeal from this ruling, there was little or no dispute about the discriminatory impact and lack of validation of the existing tests; the litigated issue was the scope of interim relief to be afforded. With respect to hirings, this Court's Order was eventually affirmed by an evenly divided Court en banc. With regard to promotions, the Order was vacated. Commonwealth of Pa. v. O'Neill, 473 F.2d 1029 (3d Cir. 1973). It should perhaps be mentioned that, throughout these appellate proceedings and parallel proceedings in this Court, it was made clear that the defendants, whose testing procedures had been under challenge for nearly two years, were confident that, by January of 1973, they would be able either to vindicate the existing examinations, or to supply new examinations. The decision of the Court of Appeals was rendered on February 8, 1973.
The case was scheduled for final hearing in this Court in April of 1973. At issue were (1) the validity of the entrance examination itself; (2) the validity of the "background investigation" screening process; and (3) the validity of each promotional examination. In addition, of course, there would have been subsidiary issues as to the scope of interim relief, in the event of a decision adverse to the continued use of any or all of these tests and procedures.
Instead of proceeding to the final hearing, the parties, on April 10, 1973, agreed to the entry of a Consent Decree. The principal features of this Decree have to do with the entrance examination (the defendants represented that they had retained Educational Testing Service, of Princeton, New Jersey, to devise a completely new set of entrance examinations, which were to be available by January of 1974); the background investigation process (the defendants agreed to revise this procedure, after obtaining recommendations from qualified consultants on the subject); the rights and remedies of persons adversely affected by the existing procedures (back pay and seniority adjustments); and interim hiring procedures (hirings only to fill existing vacancies, rather than for expansion of the Police Force; reconsideration of rejections based on background evaluations, by a new, impartial panel).
With respect to the promotional examinations, the Decree provided only as follows:
"7. The defendants have represented to the Court that they are in the process of revising all promotional examinations. The Court retains jurisdiction to dispose of any questions which may arise concerning promotions. Nothing in this Decree is directed to the subject of promotions."
No useful purpose would be served at this time in recounting the difficulties, disputes, and delays attendant upon implementation of the Consent Decree of April 10, 1973. It is sufficient for present purposes to state that (1) new entrance examinations were eventually prepared and have been administered, and litigation concerning their validity is under way, in a separate aspect of these proceedings; (2) the background evaluation process has been revised, and litigation concerning the validity of the new procedure is also under way, in a separate aspect of this case; and (3) the validity of the promotional examinations (whether "new" as contended by defendants, or essentially unchanged, as contended by plaintiffs) is now before the Court, and is the subject of this Opinion.
FINDINGS OF FACT
1. The racial breakdown for all ranks within the Philadelphia Police Department is as follows:
*454
PERCENTAGE OF
RANK BLACK OFFICERS
Chief Inspector 0%
Inspector 10.71
Staff Inspector 5.26
Captain 8.14
Lieutenant 6.42
Sergeant 12.58
Detective 15.43
Corporal 14.21
Policeman 18.14
2. Promotional examinations for the ranks of corporal, detective, sergeant and lieutenant are entirely written. Eligibility for promotion is determined by a final score derived 90% from the written examination, and 10% from seniority. For the ranks of captain through chief inspector, a combination of written and oral examination is employed; for those positions, the written examination accounts for 60% of the final score, the oral examination for 30%, and seniority for 10%. To be eligible to take the examination for corporal or detective, a candidate must have been a police officer for at least one year. To be eligible to take the examination for sergeant, a candidate must have two years service in a prior rank. For lieutenant through staff inspector, one year of service in the next lower rank is required. For inspector, two years experience as captain or staff inspector is required. For chief inspector, two years experience as inspector or staff inspector is required.
3. Until 1973, performance ratings were also taken into account in determining eligibility for promotion. Performance ratings are no longer considered.
4. The written promotional examinations for the positions of corporal, detective, and sergeant have a statistically significant disparate impact upon black applicants. For all examinations from 1966 to 1975, whites "passed" the written examinations for corporal at a rate of 1.71 to 1 relative to blacks; for detective at a rate of 1.78 to 1; and for sergeant at a rate of 1.65 to 1. The likelihood that such results could have occurred by chance is less than 1 in 1 million.
5. The disparate impact upon black applicants is greater with respect to the tests currently in use than was true of the earlier tests during the period referred to. For example, in the most recent tests for corporal, whites "passed" at a rate of 2.79 compared to blacks, and at a rate of 2.02 for the position of detective (the white sergeant pass rate for the current test is approximately the same as for the earlier tests, 1.61 versus 1.63).
6. It is probable that the foregoing figures substantially underestimate the true disparate racial impact of the tests, since all persons taking the test had previously "passed" an entrance examination which itself had a disparate racial impact. White applicants "passed" the entrance examination at a rate of 1.82 to 1 over black applicants. Commonwealth v. O'Neill, 348 F. Supp. 1084, at p. 1089.
7. There is no evidence that the promotional examinations for the positions higher than sergeant have a statistically significant disparate racial impact. That is, there is no evidence that the white applicants taking those tests "passed" at a higher rate than black applicants. Of course, the cumulative effect of the disparate impact of the entrance and sergeant examinations is to render a disproportionately small percentage of blacks eligible to take the examinations for promotion to the higher ranks.
8. In early 1973, trial on the merits of all of the issues in this litigation was scheduled to take place. Under challenge were the entrance examination (as to which a preliminary injunction had been entered by this Court, and affirmed by a divided vote of the Court of Appeals); promotional examinations to all ranks; and the background-screening process. On the eve of trial, an interim settlement was agreed upon, and embodied in a Consent Decree presented to, and approved by, this Court. Pursuant to the Consent Decree, the defendants obligated themselves to carry out a contract with Educational Testing Service, of Princeton, New Jersey, to develop new entrance examinations and validate them as job-related; and to revise the background screening process. The Consent Decree contained detailed provisions which *455 were to remain in effect pending completion of these tasks. The Consent Decree, by agreement of counsel, did not contain any injunctive provisions covering use of the promotional examinations. The Decree did contain the following recital:
"7. The defendants have represented to the Court that they are in the process of revising all promotional examinations. The Court retains jurisdiction to dispose of any questions which may arise concerning promotions. Nothing in this Decree is directed to the subject of promotions."
9. On July 31, 1974, in the course of a hearing in which plaintiffs sought an adjudication of contempt, claiming that the defendants had failed to comply with the Consent Decree in various respects, the City Solicitor of Philadelphia stated to the Court that Educational Testing Service had been retained for the purpose of revising the police promotion examinations. The City Solicitor further stated that a "detailed report" setting forth the status of Educational Testing Service's activities with respect to the promotional examinations would be submitted to the Court by August 5, 1974.
10. In fact, the City never retained Educational Testing Service or any other outside firm in connection with the promotional examinations. At a hearing on October 17, 1974, it was revealed that the only action taken pursuant to the representations about promotional examinations contained in the Consent Decree was that City personnel were continuing to "try to improve the tests;" and that a City employee was in the process of working on a validity study covering those tests. However, even at that date, the promotional examinations administered a few months earlier had not yet been analyzed for racially disparate impact.
11. The "revised" promotional examinations which are the subject of this Opinion do not differ in any material respect from the promotional examinations given during earlier years.
12. In 1970, a confidential report of the Commission on Human Relations had criticized the written tests employed by the Personnel Department (including the police tests), and had recommended that the City:
"[Establish] a new section within the Personnel Department to perform ongoing validation studies. This section . . . should be headed by a psychologist qualified in the area of examinations. This section will conduct validation studies of every selection standard involved in every City position.
"Research in the field of human rights indicates that black persons particularly and other groups are unable to compete in these areas although they may have the ability to do the job in question as well or better than candidates who score higher or who would not be eliminated by the above factors . . .." (At pp. 10-11.)
13. From the outset of this litigation in 1970, and on the basis of the evidence presented at earlier hearings on the issues of preliminary relief, it has been quite clear that the examinations used by the Philadelphia Police Department do have disparate racial impact.
14. The defendants have made no effort to determine whether the cause of disparate racial impact can be eliminated without affecting the usefulness of the examination.
15. It would be relatively easy and inexpensive to perform a differential item analysis of these examinations, to determine the relative scores of whites and blacks on each item of the test. Depending upon the results, such an analysis might demonstrate that the difference in final test scores by race is attributable to a few unimportant items, or to ambiguity in particular questions; or it might demonstrate a pervasive phenomenon, and thus tend to justify continued use of the tests. No such analysis has ever been attempted by the defendants.
16. The principal thrust of defendants' efforts throughout this litigation has been to avoid any significant changes in the hiring and promotional practices of the Philadelphia Police Department. The defendant Police Commissioner is of the view that the best interests of the Police Department and *456 of the public generally would be neither benefitted nor harmed by increased minority representation in the various ranks within the Department. The overall goal of his administration of the Department is to guard against "outside interference," and to preserve the previously accepted ways of doing things.
17. For the most part, the initial assignment of newly appointed sergeants is to command a squad in the Uniform Patrol Bureau. The primary purpose of the written examination for sergeant is to measure and predict the relevant capabilities of applicants to function as squad leaders in the Uniform Patrol Bureau.
18. Mr. Robert Haney, an employee of the Personnel Department of the City of Philadelphia, conducted a validity study of the various sergeants' examinations administered from 1960 through 1971, using the standardized test score of each of 178 available sergeants in the Uniform Patrol Bureau as the predictor variable. This study, entitled The Ability of the Written Examination for Police Sergeants to Predict Performance on the Job as a Sergeant is in the record as Exhibit D-31.
19. A total of 12,900 candidates took the five sergeant examinations between 1960 and 1971. Of these, 2,423 (approximately 19%) passed and were listed as eligible for promotion. However, during that period, only 735 of the applicants actually were promoted to sergeant. Thus, only about 5.7% of the candidates actually were promoted (the "selection ratio"); and it was possible to evaluate the actual performance in the job of sergeant of only 178 of those promoted.
20. Mr. Haney's validity study, D-31 (hereinafter, the "Haney study") was designed to establish criterion-related validity of the sergeant examinations. A criterion-related validity study identifies the tasks or "performances" which are important to the job in question, and attempts to establish whether or not the test accurately predicts how well the various candidates will perform on the job. A criterion-related validity study is preferable to a content-validity study (designed to determine whether the test results accurately reflect the extent to which the applicants possess the knowledge required in the job) or construct-validity (which, generally speaking, has to do with personality traits, attitudes, psychological make-up, etc.).
21. In studying an examination for criterion-related validity, it is important to select the criteria carefully and accurately, to assess the extent to which the criteria are independent or inter-related, and to obtain an accurate measure of how well the sample group of job holders actually perform on the job, with reference to these criteria.
22. The Haney study proceeded essentially as follows: The job performance of the 178 available sergeants was evaluated by the officers (captains) responsible for supervising them. Each sergeant was rated with respect to four specific criteria, plus an overall rating. The results of these evaluations were then compared with the final test score each sergeant had achieved in the sergeant examination, for the purpose of determining whether those who had achieved high scores on the examination performed better on the job than those who had achieved lower scores.
23. The performance criteria used in the Haney sergeant study were (1) effective supervision, (2) response to complex street situations, (3) interaction with the public, (4) periodic inspection of the work of his men, and (5) overall performance as a sergeant.
24. The Equal Employment Opportunity Commission guidelines and the Federal Executive Agency guidelines recommend the acceptability of a test which demonstrates correlation between the test and criterion performance at a level of significance of .05 or less, using a one-tailed test. The Haney study purports to show correlations at the level of significance of .05 or less with respect to two criteria, interaction with the public and periodic inspection of work of his men, and at a level of significance of approximately .07 with respect to the effective supervision criterion. The study shows *457 negative correlations between test results and performance with respect to "complex street situations" and with respect to overall performance. Other expert witnesses for the defendants, Drs. Bartlett and Schmidt, tended to substantiate Mr. Haney's conclusions. As discussed below, the soundness of these conclusions is sharply challenged by plaintiffs' experts, Drs. Barrett and Siskin.
25. The Haney study was conducted under difficult conditions. The defendants, and particularly the intervening defendants Fraternal Order of Police, were very reluctant to cooperate in the special evaluation of the 178 incumbent sergeants. There was widespread apprehension that the results of the ratings might be disclosed, or might have an adverse impact upon the careers of the sergeants being evaluated. It was necessary to impose stringent conditions in order to overcome the FOP's objections. FOP representatives were present as the evaluation forms were filled out by the raters; FOP representatives took physical custody of the forms and locked them in a safe; FOP representatives were present when Mr. Haney noted on each form the race and test score; and FOP representatives then physically excised the name of the individual before delivering the forms to Mr. Haney for purposes of the study.
The raters were not told the purpose of the study, but merely that it was part of a research project; and they were given adequate instructions concerning the method of performing the evaluations and of the need for impartiality and objectivity; and the raters did not know the test scores. In view of the publicity surrounding this litigation, however, it is extremely probable that the raters did know, at least in a general way, the purposes of the research project.
26. The major difficulty with the Haney study is that the performance of only 178 sergeants, approximately 1.4% of those taking the examinations, was evaluated. That is, evaluation of performance was limited to persons who had scored at the very high end of the scale of test scores. Moreover, within that restricted sample, not only was the difference between lowest and highest relatively small, but there was lack of normal distribution the progression from low to high. In order to project the results of the performance evaluations of this sample and reach permissible conclusions concerning the relationship between test scores and performance for the entire universe of test-takers, very complex, and sharply disputed, mathematical calculations and corrections are required.
27. A further problem with the Haney study is that the evaluations of 17 sergeants (which would have brought the total evaluated to 193) were discarded, solely because their test scores fell at the median of the range of test scores of the officers being evaluated. This was done because of the nature of the mathematical process, described below, by which the correlation coefficients were computed. In essence, Mr. Haney's attempt was to divide the individuals into two groups with respect to their performance evaluations (i. e., whether they were rated above or below the median) and into two groups with respect to their test scores (i. e., whether above or below the median), and to determine whether there was a correlation between the two measures (i. e., whether those in the high group on test scores were likely also to be in the high group of performance ratings). It was, understandably, not feasible to categorize those whose test scores fell precisely at the middle of the range; on the other hand, as plaintiffs point out, excluding those persons from the sample ruled out what might have been a useful comparison between their test scores and their performance ratings (i. e., their performance ratings might have been either higher or lower than the median, and this might have affected the overall results of the study).
There was, however, uncontradicted evidence to the effect that it is probable that the range of performance scores for the excluded group would have been such as to render it unlikely that their exclusion significantly distorted the sample.
*458 28. The "alpha level" is that level of statistical significance at which the null hypothesis will be rejected. The null hypothesis is that the test does not acceptably predict job performance. To reject the null hypothesis means to conclude that the test is an appropriate predictor of job performance. The alpha level recommended by both the FEA and EEOC guidelines is .05.
29. The appropriate statistical test is that statistical test for overall significance in a particular study which is valid at the .05 level (i. e., will improperly reject the null hypothesis not more than 5% of the time), and will give the researcher the most power in resisting false acceptance of the null hypothesis.
30. In the context of validation studies, when the correlation between test score and criterion score is computed based on sample data, the correlation coefficient will be smaller the more unreliable the criterion measure is. This reduction in the magnitude of the correlation coefficient due to criterion unreliability is called attenuation due to unreliability.
31. In the context of validation studies, test scores and criterion scores have a bivariate normal distribution when the test scores are normally distributed, criterion scores are normally distributed, and the relative frequencies of particular test scores being achieved by individuals with particular criterion scores are also normally distributed. They are normally distributed if, when plotted on a graph, in which scores are plotted along one axis and relative frequencies are plotted along another axis vertical to the first, they produce a "bell-shaped" curve.
32. The Pearson product-moment correlation coefficient is the statistical measure of the extent of linear relationship between two sets of related numbers.
33. The phi-coefficient is the Pearson product-moment correlation coefficient where both variables are inherently dichotomous.
34. The power of a statistical test of significance is its ability to detect that the null hypothesis is false when it is, in fact, false (i. e., the ability to avoid committing Type II error).
35. The term "restriction in range" in this context, refers to the difference between the highest and lowest test scores in the applicant population.
36. Standard deviation is a measure of the average amount by which scores in a group differ among themselves.
37. The tetrachoric correlation coefficient is the correlation between variables which have a bivariate normal distribution computed from numbers in a four-fold table which have resulted from artificially dichotomizing the variables.
38. Type I error, in this context, consists of falsely determining that a test is valid, when in fact it is not. Type II error, as mentioned above, is the error of concluding that a test is not valid, when in fact it is.
The utility of a test or other selection procedure is its ability to enable the employer to make better selections than would be obtained through random choice.
39. The validity of a test is the extent to which the rank order of test scores of a group of individuals is similar to the rank order of job performance or criterion scores.
40. Reliability in this context means consistency of measurement.
41. The values of two variables are linearly related within a certain range when an increase or decrease in the value of one of the variables is accompanied by a proportional increase or decrease in the value of the other variable.
42. A one-tailed test is designed to detect significant deviation from the null hypothesis in the positive direction. A two-tailed test is designed to detect significant deviation from the null hypothesis in either the positive or negative direction.
43. The defendants retained Dr. Claude Bartlett, Professor of Psychology and Chairman of the Psychology Department at the University of Maryland, and Dr. Irwin L. Goldstein, Professor of Psychology at the University of Maryland, both of whom are *459 experts in employee selection and test validation, and who together are the principals in Training and Educational Research Programs, Inc. (hereinafter TERP). These gentlemen performed three separate validity studies for the position of police sergeant in the Philadelphia Police Department. TERP Project "A" was an analysis of the Haney sergeant study discussed above. TERP Project "B" is a study of the validity of the sergeant promotional examinations by TERP itself, using two measures of the criterion of "promotability," related to comparisons between test scores on the sergeant examination and test scores on the 1971 lieutenant examination. TERP Project "C" is an analysis of the relationship of the sergeant's examination to three factors of sergeant job performance devised by TERP, namely, a "high relevance item criterion," a ranking criterion, and a total score criterion.
In TERP Project "A", TERP analyzed the Haney study, reworked the data in order to eliminate certain errors in the Haney study which had been pointed out by plaintiffs' expert, Dr. Siskin, and concluded that Haney's results were sound.
The methodology for TERP Project "B" was to correlate, using the Pearson product-moment correlation, the sergeant's promotional examination with the performance on subsequent lieutenant examinations, using the 1971 lieutenant examination.
44. The 1971 lieutenant examination contained 120 questions, of which 90 were identical to items found in a contemporaneous sergeant's examination (but none of which had appeared, at least in identical form, in earlier sergeant examinations). Thirty of the questions on the 1971 lieutenant examination were unique to that examination.
45. Not surprisingly, TERP Project "B" concluded that there is a statistically significant positive correlation between scores on the sergeant examination and scores on both the total lieutenant examination, and on the 30-question subpart unique to the lieutenant examination. Given the fact that all of the persons whose scores on the lieutenant examination were being analyzed had previously scored at the extreme upper range of a sergeant examination similar to the 90-question subpart of the lieutenant examination (i. e., three-fourths of the questions on the lieutenant examination) the only remarkable aspect of TERP Project "B" is that the correlation was considerably less than was to be expected.
46. TERP Project "C" compared the examination scores of 170 sergeants (converted to standard score) with three factors of sergeant job performance devised by TERP by means of the following process: From a review of job analysis for the position of sergeant prepared earlier by Mr. Haney, in the normal course of his duties, some 91 "task statements" were prepared, and submitted to the Police Department for editorial review and comment by knowledgeable persons. In final form, these 91 task statements were then submitted to 133 police lieutenants, in the form of a behavioral description check list, for the purpose of determining the applicability of each task statement to the job of sergeant, and the relevance of each task statement in discriminating between good and bad performers. Thereupon, a factor analysis was performed, and the most significant task descriptions were obtained. The end result of this process was the identification of six criteria: supervisory support, inter-action with the public, and supervisory structure; a group of six high-relevancy indices; a rating comparison between a particular sergeant and other sergeants being supervised by the rating officer; and finally, the composite total score of the first three factors.
Thereupon, the 170 sergeants were rated by means of a questionnaire, under conditions similar to those involved in the Haney study.
TERP Project "C" concluded that statistically significant, uncorrected Pearson product-moment correlations were established with respect to two of the criteria, factor 3 and total criteria. If one were to indulge the unrealistic assumption that the criteria were totally independent, the average uncorrected Pearson product-moment coefficient for all six criteria would be .115.
*460 47. In both the Haney sergeant study and the TERP projects, it was necessary to engage in a series of statistical corrections in order to arrive at the ultimate conclusions. Certain adjustments were necessary because of the extreme restriction of range in the sample, because of artificial dichotomization and the absence of normal distribution in the sample; because of the limited number of raters; and because the performances were being evaluated within each of many squads of officers (i. e., the rating of each officer was in relation to other officers in his squad, rather than in comparison with all of the other officers being evaluated). In these circumstances, the phi-coefficient constitutes an inaccurate estimate of the Pearson product-moment. The tetrachoric coefficient would constitute an accurate estimate of the Pearson product-moment if bivariant normality were present. Artificial dichotomization produces an inaccurate tetrachoric coefficient, but this inaccuracy is least when the dichotomization occurs at the median (as it did here). The squad comparability problem introduces an element of random error; the consequence of random error is to depress the correlations found, if there is true positive correlation.
In fact, each of the corrections and adjustments made in the Haney and TERP studies would tend to reduce the possibility of type I error.
48. The Brogden Utility Model is an approach to interpreting the utility of a text which was developed by Hubert Brogden of Purdue University in 1946. On this model, test utility is linearly related to the validity coefficient; that is, a test with a validity coefficient equal to X has twice the utility of a test with a validity coefficient of one-half X, and is X percent as good as a perfect test on that criterion.
49. On the basis of TERP Project "C," the Pearson product-moment coefficient, after correction for restriction in range, is .21 for factor 3 and .21 for total criteria. Applying the Brogden approach to practical utility, this suggests that, as to these two criteria, the test is 21% as good as a perfect test would be.
50. Taylor-Russell Tables are a set of utility tables which are an accepted method of determining the utility of a test or other selection mechanism. In order to use such tables, it is necessary to estimate the percentage of job applicants who would perform adequately if selected at random. For a given correlation coefficient (accurately reflecting the Pearson product-moment for the applicant population) and a given selection ratio, the tables disclose the percentage of applicants who would be likely to perform at or above the desired level if the test were used to select them. If the test will produce substantially better results than random selection, the test has "utility." If the improvement over chance is only slight, the utility of the test is slight. The greater the utility of the test, the greater the likelihood that the employer would find it advantageous to use the test; but the ultimate decision as to whether the utility of the test is sufficient to justify the expense and other burdens associated with administering it is up to the employer.
51. All of the pertinent guidelines, and all of the expert witnesses in this case, recognize that the smaller the selection ratio, the less of a margin of utility is required. For example, if only 5% of the applicants are to be hired, and the employer wishes to hire the persons who would rank in the top 5% of performers, his chances of achieving that result through random selection would be 5%, whereas a test with a reliability of .40 would provide a 19% chance of success. If he desired to hire the top 10%, he would have a 10% chance of achieving that goal through random selection, but the test would increase his chances to 31%. At 25%, the test would only improve his chances to 39%. Thus, as the percentage of applicants to be hired increases, the difference between random selection and a .40-utility test decreases.
52. Mr. Haney also conducted a validity study of the examinations which were given from 1966 through 1974 for the position of detective, using the standardized test score for each of the 176 available detectives (152 *461 white, 24 black) in the field divisions as the predictor variable (hereinafter, the "Haney detective study").
53. A total of 7,554 candidates took the four detectives examinations from 1966 through 1974, of whom 1,973 (26%) passed and were listed as eligible. A total of 574 promotions to the rank of detective were made from these eligible lists. Thus, the selection ratio for the position of detective was 7.6%.
54. On the basis of job analyses and consultation with command personnel in the Philadelphia Police Detective Bureau, Mr. Haney established a list of 10 performance criteria. The basis of selection was that the behavior should be important to the function of a detective in the field division, that the behavior should be performed with reasonable frequency; and that the behavior be capable of being rated with reasonable accuracy by both the sergeant and the lieutenant who supervised the detective. The criteria selected were: (1) interviewing, (2) preservation of physical evidence, (3) prompt submission of 75-49 reports, (4) utilization of sources of information, (5) personally interviews known criminals, (6) promptness and frequency of 75-52 reports, (7) identification of a suspect, (8) physical arrest by detectives, (9) completeness, thoroughness and comprehensiveness of 75-49 and 75-52 reports, (10) obtaining multiple clearances.
55. 176 detectives were actually subjected to analysis in the Haney detective study. The statistical power of the study was approximately .50; that is, there was only a 50% chance of detecting a statistically significant relationship between test score and criterion measure when there actually was such a relationship.
56. The Haney study concluded that correlation coefficients at the required significance level of .05 or less were obtained with respect to three of the criteria, "utilizing sources of information," "Promptness and frequency of 52s" and "physical arrest by detective."
57. After correction for the three sources of error, namely, restriction in range, rater reliability, and non-comparability across squad, the Haney detective study concluded that the corrected Pearson product-moment coefficient for these three criterion measures were .39, .45, and .33, respectively.
58. For the reasons discussed in connection with the Haney sergeant study, it is probable that the Haney detective study somewhat understates the correlations between test score and criteria, if there are true correlations.
59. The average corrected correlation of all of the obtained correlations in the Haney detective study is .268.
60. If one were to assume that 50% of all applicants would perform at the desired level of performance if randomly selected, and if one were to assume further that the test has a validity coefficient of .25, the test would have some utility, in that, applying the Taylor-Russell Tables, such a test would select persons, 68% of whom would perform above the 50% level.
61. The defendants presented no evidence in support of the criterion-related validity of the examination for corporal. The defendants did conduct such a study, but when an initial review of the collected data did not show statistically significant correlations, the study was abandoned; its results have not been reported.
62. According to defendants' witnesses (primarily Dr. Schmidt), a criterion-related validity study for the position of corporal was not technically feasible, as that term is used in the profession and in the applicable guidelines, because the sample was too small, and because it was impossible to obtain adequate and reliable ratings of corporals' performance on the job. This latter difficulty was attributed in large measure to the fact that much of a corporal's work is done out of the presence of his immediate superiors.
63. Dr. Schmidt's views concerning the size of the sample needed for a technically feasible validity study are not shared by most other experts in the field.
*462 64. While the problems of sample size and difficulty in obtaining reliable ratings of performance would reduce the statistical power of a criterion-related validity study for the position of corporal, the record does not justify a finding of technical infeasibility.
65. Mr. Haney and his staff performed content-validity studies of the examinations for all of the positions discussed above, corporal, sergeant, and detective. On the basis of job analysis and classification questionnaires, duty statements and audit reports from personnel technicians who conducted interviews in the field (with incumbents, their immediate supervisors, and high-level supervisors), job content domains were established. Each of the promotional examinations was then analyzed for the purpose of establishing whether or not the test items corresponded to important aspects of the job in question. These analyses establish that there is a high degree of correlation between the test items and important aspects of the job.
66. All of the written promotional examinations discussed above tend to be rather poorly constructed. A great many of the questions are excessively wordy; a high percentage of the questions (all of which are multiple choice) have non-parallel distractors or other flaws; and the tests contain an excessively high percentage of multiple-keyed items (i. e., items where more than one answer would be correct).
Overall, approximately 15% of the questions proved to have been multiple-keyed. Moreover, in each instance the fact of multiple keying was established only by reason of complaints from one or more of the test-takers; it is conceivable that more of the questions should have been multiple-keyed, but were not the subject of protests.
67. The tests were further flawed by the inclusion of too many items which were too easy (questions which every test taker answers correctly do not help to discriminate between applicants; they may be useful in a test to determine minimum qualifications, but not in a test which is used to rank applicants).
DISCUSSION
Before discussing the ultimate factual findings and conclusions which may properly be drawn from the evidence in this case, certain threshold arguments made by the defendants must be considered. Briefly, the defendants appear to contend that, by reason of such cases as Washington v. Davis, 426 U.S. 229, 96 S. Ct. 2040, 48 L. Ed. 2d 597 (1976), plaintiffs must be denied relief unless they can show that specific individuals have intentionally been discriminated against on grounds of race. That argument must be rejected in a Title VII context. A prima facie case is made out by a showing of disparate impact, and may be established by statistical evidence alone. Hazelwood School District v. U. S., 433 U.S. 299, 97 S. Ct. 2736, 53 L. Ed. 2d 768 (1977); Dothard, et al. v. Rawlinson, et al., 433 U.S. 321, 97 S. Ct. 2720, 53 L. Ed. 2d 786 (1977).
The defendants have also mounted a Tenth Amendment argument on the basis of National League of Cities v. Usery, 426 U.S. 833, 96 S. Ct. 2465, 49 L. Ed. 2d 245 (1976). I reject that argument also. See Fitzpatrick v. Bitzer, 427 U.S. 445, 96 S. Ct. 2666, 49 L. Ed. 2d 614 (1976) (Brennan, J. concurring). The Congressional interest in preventing racial discrimination is markedly different from the Congressional interest in enforcing minimum wage legislation; state and local governments' interest in determining wage levels for their own employees differs markedly from their (totally non-existent) interest in perpetuating practices, even facially neutral practices, which have discriminatory racial impact.
Finally, I believe there can be no question concerning plaintiffs' standing to challenge the continued use of the promotional examinations. I do not believe this issue requires discussion.
Decision of the merits of this case has been rendered more than usually difficult by the sharp disagreements among experts in the arcane field of statistical analysis and test measurement, an area which this Court enters with great trepidation. This is an *463 uncertain and rapidly developing area of expertise. Practitioners in the field appear to fall into two camps, those who strongly favor affirmative action and thus would require fairly stringent proof of validity and utility in order to justify continued use of a selection process which has racially discriminatory impact, and those who feel that standard selection procedures which are not intentionally discriminatory should not lightly be discarded merely because they do have discriminatory impact. The EEOC guidelines tend to reflect the views of the former group, the FEA guidelines tend to reflect the views of the latter group, and the American Psychological Association appears to be divided on the subject.
This Court's decision of the factual issues in this case has been impeded by the tendency of the expert witnesses on both sides, all of whom are honest and well-intentioned, to overstate their positions.
Everyone agrees that the Philadelphia Police Department should strive to promote only the best qualified candidates, regardless of race. There is no suggestion that highly qualified whites should be passed over in favor of less well-qualified blacks. The problem is to determine just who is truly well-qualified and who is not as well qualified.
The defendants are required by law to base promotions upon competitive examinations; and there is a natural tendency to assume that those who achieve the highest scores in such examinations are in fact "better" than those who do not score quite as high. But if the difference in scores stems from test items which are only slightly relevant to the job, a modest difference in test scores really does not mean that the higher-scoring applicant is better qualified for the job.
The racially discriminatory impact of the tests here under consideration results from the fact that, although those blacks who scored in the top 5% and were therefore promoted scored just as well as whites in that top 5%, and performed just as well on the job, a smaller proportion of blacks achieved a top score than in the case of whites. Interestingly enough, there is reason to believe that, if the selection ratio were increased to include the top 40-50% on the test scores, blacks would be represented as well as, and probably somewhat better than, whites.
It must be remembered that no test even approaches total accuracy of prediction. That is, the tests here in question do not produce assurance that all, or even a majority, of the top 5% test scorers will actually rank among the top 5% of performers. In short, use of these tests, or any other selection mechanism, excludes many persons from promotion who would do as well or better on the job than those who are promoted.
Another point to be emphasized is that the probability is very great that the overwhelming majority of all of the applicants taking the tests would perform reasonably well if promoted. Indeed, there is some evidence, in the form of the testimony of defendants' employee Ms. Evans (later slightly modified) to the effect that all but about 15 of the 3800-odd persons who took the 1975 sergeant examination scored well enough to be regarded as having "passed" that examination. That is, the cutoff point which differentiates between those who are placed on the eligibility list and those who are not is governed by the estimated number of vacancies to be filled.
I agree with plaintiffs' argument that the disproportionate impact upon blacks in these promotional examinations would be greatly reduced if the defendants were to fix a reasonable "passing," or cut-off, score to establish a pool of qualified applicants, and choose those persons to be actually promoted by random selection, or on the basis of seniority; and that, if that procedure were adopted, there would be very little likelihood of an appreciable decline in performance levels of those promoted, as compared with the performance level of those promoted under the present system.[1]*464 But if selection by rank order of test scores can be shown to increase, to a statistically significant extent, the likelihood of improving the quality of performance, this Court is not empowered to mandate the change suggested by plaintiffs, however sensible that might seem.
After carefully considering the mass of testimony and exhibits in this case, I have concluded that the evidence as a whole slightly preponderates in favor of a finding that criterion-related validity has been established with respect to one or more (see individual findings above) of the criteria, in the sergeant and detective examinations, and a similar showing of content validity with respect to the corporal, sergeant, and detective examinations. In each case, however, the correlations are quite small, with the result that the utility of the examinations is relatively low.
The EEOC guidelines (§ 1607.5(c)(2)), while requiring that "in addition to statistical significance, the relationship between the test and criterion should have practical significance" also state that "the larger the proportion of applicants who are hired for or placed on the job, the higher the relationship needs to be in order to be practically useful. Conversely, a relatively low relationship may prove useful when proportionately few job vacancies are available" and that "the smaller the economic and human risks involved in hiring an unqualified applicant relative to the risks entailed in rejecting a qualified applicant, the greater the relationship needs to be in order to be practically useful. Conversely, a relatively low relationship may prove useful when the former risks are relatively high." The FEA guidelines counsel a similar approach.
In the present case, the selection ratio is extremely small. And, while I agree with plaintiffs that the defendants' argument tends to exaggerate the risk factor, I must nevertheless recognize that there are serious risks attending a failure to achieve a high level of performance throughout the Police Department.
None of the administrative guidelines referred to in this case have the force of law. They are, however, entitled to deference, by reason of the high degree of expertise which has entered into their formulation. While the EEOC guidelines mandate criterion-related validity unless that approach is technically infeasible, the FEA guidelines accept validation on the basis of criterion-validity, content-validity or construct-validity. As set forth above, with respect to the corporal examination, I am not satisfied that the technical infeasibility of criterion-validity studies has been established, although I have no doubt that the performance variable would be difficult to establish. But I find it difficult to justify imposing upon these defendants more stringent requirements than are mandated for federal executive agencies, and I have therefore concluded that the content-validity of the corporal examination, minimally established on this record, must be deemed acceptable.
With respect to all of the tests, then, we have (1) rather low correlations, (2) very low practical significance, (3) very high risks associated with promoting unqualified persons, and (4) substantial risks associated with wrongful rejection of qualified applicants.
It is appropriate, also, to consider carefully the consequences of rejecting the tests as a selection mechanism. There are undeniable institutional advantages in preservation of a "merit selection" system; indeed, it is required by law in this instance. Of course, Civil Service regulations cannot be permitted to preserve a selection process which violates Title VII. But in interpreting and applying Title VII and the pertinent guidelines, doubtful issues should be resolved in favor of preservation of the *465 traditional ranking approach contemplated by Civil Service requirements. Moreover, it appears probable that, in the long run, the functioning of the Police Department and the morale of its personnel would be adversely affected if selection of persons to be promoted to supervisory positions were to be based, even partially, upon random selection. And it is at least arguable that if seniority in rank were to play a decisive role in selection from among a pool of qualified applicants, younger officers might be reluctant to apply for promotion, and there would be less incentive among the applicants to prepare themselves for the promotional examination. In short, there are decided institutional advantages in the competitive process itself.
All of the foregoing discussion assumes, of course, that racial bias plays no role in the design and implementation of the promotional examinations. There is no assertion that conscious racial prejudice is involved at any stage. There is, however, room for the argument that, when performance criteria are defined in terms of the perceptions of white incumbent high-level supervisors, the standards which emerge may be unwittingly slanted. For example, in connection with the third performance criterion used in the Haney sergeant study, "interaction with the public," the criterion was defined as follows:
"Please rank your sergeants on their ability to effectively interact with the public. Examples of this are handling complaints and keeping merchants happy. It is an ongoing behavior to develop and maintain good, effective relations with the public he serves."
The question of whether this truly represents a complete and accurate description of good "interaction with the public" is fairly debatable (the correctness of this Court's findings in Goode (COPPAR) v. Rizzo, 357 F. Supp. 1289 (1973), aff'd 506 F.2d 542 (3d Cir. 1974), rev'd, sub nom. Rizzo v. Goode, 423 U.S. 362, 96 S. Ct. 598, 46 L. Ed. 2d 561 (1976) was not challenged on appeal). But for present purposes, the point to be made is that "keeping merchants happy" might be understood as importing the racial attitudes of that segment of the community. A white rater might subconsciously assume that a white sergeant would be better able to meet that criterion.
It does appear, however, that there was no disparity in any of the ratings, as between white and black incumbents, and it is therefore reasonable to conclude that even subconscious racial bias did not taint the rating process.
It is beyond the province of a federal district court to impose its own perceptions of what constitutes good performance as a police officer, corporal, sergeant or detective, or any other rank. It is equally beyond the province of a federal district court to control management decisions concerning the acceptability of risks involved in alternative methods of selection for promotion within the Police Department, so long as such decisions do not serve to mask racial discrimination. If I were free to do so, I would agree with the plaintiffs that strict compliance with the EEOC guidelines ought to be required of any governmental agency before it is permitted to utilize selection procedures which have discriminatory impact. But that battle has, at least temporarily, been lost at the appellate level. See, e. g., Washington v. Davis, supra.
I am likewise persuaded that this Court's power to require the defendants to improve the quality of their testing procedures and analyses of their testing procedures is limited to matters affecting racially discriminatory impact.
Balancing all of the foregoing considerations, I have concluded that, insofar as plaintiffs seek to restrain the defendants from effectuating promotions from eligibility lists established pursuant to examinations heretofore administered, the Complaint must be dismissed. But in view of the rather obvious fact that these tests do serve to disqualify blacks at a disproportionately high rate compared with whites, and in view of the closeness of the question as to whether the minimal showing of job-relatedness has been made in this case (i. e., the sharp conflict among experts as to *466 whether the guidelines have been complied with, and the distinct possibility that this Court's assessment of the evidence may later prove to have been erroneous), I have concluded that certain minimal steps should be required of the defendants with respect to future tests and future eligibility lists.
As set forth in the Findings of Fact, no differential item analysis has ever been performed with respect to any of the written examinations, for the purpose of determining whether particular questions may be the cause of a substantial portion of the racial differences in test scores. That omission, if uncorrected, might very well tip the scale in favor of a finding of discriminatory policy. It must be emphasized that this Court makes no such finding at this time, nor have the plaintiffs urged such a finding. But if discriminatory impact can be reduced, by minor clarifications or modifications of particular questions or phraseology, without any material decrease in the utility of the test as a screening device and at little or no cost, it is reasonable to conclude that the defendants should be perfectly willing to undertake such differential analyses. Performance of differential item analyses on all of the tests here in question, and upon any similar tests which may be used in the future for the purpose of determining eligibility for promotion, would thus serve two significant purposes: It would be useful as a check upon the validity of this Court's validity and utility findings (that is, it would either reinforce the validity of the tests, or demonstrate what changes would make them have less adverse racial impact), and it would also help further to rule out any successful charge of an unlawfully discriminatory policy.
ORDER
AND NOW, this 23rd day of January, 1979, it is ORDERED that:
1. Except as set forth in this Order, insofar as plaintiffs seek to enjoin promotions within the Philadelphia Police Department on the basis of existing eligibility lists established through written examination, the Complaint is DISMISSED.
2. As a condition of the continued use of existing written examinations, or written examinations similar thereto for the establishment of future eligibility lists for such promotions, the defendants shall perform, or cause to be performed, by competent personnel, racial differential item analyses of such examinations, and shall report the results thereof to plaintiffs' counsel and to this Court. The Court reserves jurisdiction to consider and determine such other or further relief as the parties may contend would be appropriate in light of said differential item analyses.
3. Nothing in this Order is intended to preclude the defendants from taking steps to improve the quality, validity, and utility of any such examination, either through the use of their own personnel, or through consultation with outside experts.
NOTES
[1] Plaintiffs' witnesses appear to have somewhat overstated their argument in this respect.
I agree with the defendants' contention that a propos the performance variable, to say that a particular criterion-statement describes an incumbent "moderately well" does not mean that, with respect to that incumbent, there is no significant risk of harm from his failure to perform with respect to that criterion. In short, it would be impermissible to conclude that the defendants ought to be satisfied with any selection mechanism which minimizes the likelihood of selecting candidates falling below that standard of performance.