Fun With Unweighted Interview Counts

January 10, 2005October 1, 2019 Mark Blumenthal34 Comments

[UPDATE: The original version of this post included a small error that may have added to the confusion on this issue. See my corrections below]

After the appearance of newly leaked exit poll data last week, one commenter on this site asked about some apparent conflicts in the number of interviews the National Election Pool (NEP) reported doing in their "national" sample on Election Day. The answer – chased down in part by another alert MP reader/commenter – is that a complex wrinkle in the weighting procedure created an artificially high "unweighted" total that slipped past the exit poll webmasters at CNN and CBS. As such, while the exit poll tabulations on those sites are correct, the total number of interviews that appears for the national survey is in error.

Let’s start with a statement I made last week:

Keep in mind that the 7:33 PM sample from election night was incomplete. It had 11,027 interviews, but the next day NEP reported 13,660. The missing 2,633 interviews, presumably coming mostly from states in the Midwest and West, amounted to 19% of the complete sample [emphasis added].

Alert commenter "Luke" pointed out that my conclusion seemed to conflict with information provided in the official methodology statement on the national exit poll on the official National Election Pool (NEP) website:

The National exit poll was conducted at a sample of 250 polling places among 11,719 Election Day voters representative of the United States. In addition, 500 absentee and/or early voters in 13 states were interviewed in a pre-election telephone poll.

As Luke points out, those numbers clash with those in the pdf files: 11,719 + 500 = 12,219. That total is quite a bit less than the 13,660 interviews in the "final" tabulations available either in the pdf on the Scoop web site and displayed in the "final tabulations" now available on the CNN and CBS websites. It is also less than the 13,047 interviews in the screen shot of the national survey taken by Jonathan Simon from CNN at 12:23 a.m. on November 3. Luke suggested that these totals indicate "much interview-stuffing on Nov3 to produce the requisite swing to Mr Bush."

What’s the story?

After having a busy morning away from the blog, I put in a call to Edison/Mitofsky Research (EMR) to see if they could help clear up the mystery. It turns out that Rick Brady, the author of the blog Stones-Cry-Out and a ubiquitous presence in MP’s comments section, had already emailed EMR with the same question, and posted the answer in the comments section of Friday’s post. I have copied the email to Rick after the jump, but let me try to explain. Unfortunately, it gets a bit technical and confusing, so bear with me, as it does explain the apparent inconsistency.

Here’s the gist: The confusion stems mostly from the way NEP applied the split sample design used on the national sample to the 500 interviews done by telephone among early and absentee voters. If you look at the questionnaire PDF for the national exit poll, you will notice that it has four different versions. The obvious questions (the vote for president, for Congress and the basic demographics) appear on all four versions, while other questions (such as those about Iraq, gay marriage or cell phone usage) appear on only one form or perhaps two. This is a common survey practice that allows the pollster to ask more questions while keeping to a reasonable average questionnaire length. The main disadvantage of "splitting the form" that way is greater sampling error for those questions asked of a random half or a quarter of the sample. In this case, however, even the quarter samples (with over 3,000 respondents) were large by conventional polling standards.

All of this is reasonably clear from the official NEP documents. The confusion arises from the way they handled the 500 interviews done by telephone among those who voted early or by absentee ballot. For the telephone surveys, NEP did not use a split form. These 500 respondents had a very long interview that asked every question that appeared on any of the four forms.

Why would this inflate the sample size? The programmers at NEP used a shortcut to enable separate tabulations of the split form questions: They replicated the data for each of the telephone interviews four times in the data file, so that every telephone respondent had a separate record to match each of the four forms used for the in-person data.

Follow that? If not, just understand that in the unweighted data file, the 500 telephone interviews were quadruple counted. This procedure did not throw off the tabulations, because when they ran tabulations for all forms in the full national sample, they weighted the telephone interviews down by the value of 0.25. However, the PDF crosstabs (as reproduced on the CNN and CBS websites) show the unweighted number of interviews, without labeling them as such.

Now those of you who are following this discussion with calculator in hand may notice that the numbers still do not quite add up. They did ~~11,027~~ 11,719 interviews at polling places and 500 interviews by telephone for the national exit poll. The unweighted total would be ~~11,027~~ 11,719 + (500*4) = ~~13,027~~ 13,719. The total in the Scoop PDFs (and on CNN and CBS) was 13,660. ~~That still leaves the total 633 interviews short of the number listed on the presidential cross tabs.~~

UPDATE: Let’s try that again. I originally used the wrong number in the paragraph above. NEP reported doing 11,719 interviews plus the 500 done by telephone. Thus, their "unweighted" total was 13,719 — that’s 59 respondents short of what appears on the Scoop, CNN and CBS crosstabs. Now..continuing with what I originally wrote…

Turns out that the difference is the very small number of respondents who left the question on the presidential vote blank (or who would not provide an answer on the telephone survey). Unfortunately (for an analysis of "undervoting" – a topic for another day) we do not know how many of these 633 missing respondents were from polling place interviews and how many were the quadruple counted telephone respondents. The email to Rick Brady has one clue: They say, as an example, that in Alabama 4 of 740 respondents (0.5%) – all interviewed at their polling place – left the presidential vote question blank.

UPDATE – NEP emails with this additional bit of information: "The difference of 59 respondents comes from 31 respondents who answered that they did not vote for president and 28 respondents who omitted that question. This is 0.4% of the respondents which is very similar to the 0.5% omits that we had in Alabama for example."

So, what we have is a very confusing bit of data processing – so confusing that it also fooled the folks at CBS and CNN that put the numbers online. Much as some would hope otherwise, it is not evidence of anything sinister.

UPDATE: The confusion was compounded by my own human error. Responsibility for this last goof is mine alone. Apologies to all and thanks to Luke (see the comments section) who ultimately caught my error.

A big hat tip to Rick Brady for taking the initiative on this issue. As a result, this is not news to those of you who have been reading the comments. The full text of the email he received from NEP follows after the jump. Tomorrow, I’ll try to take up the issue of the confusing interview counts in the regional crosstabs…

UPDATE: Make that Wednesday. I’ve got to stop posting late at night without a thorough proof-reading.

Rick, The CNN web site is displaying the number of unweighted cases that are being used in the crosstab for the presidential vote.

The methodology statement includes the actual number of respondents who filled out the questionnaire.

These two numbers can differ for two technical reasons:

The first reason is that some respondents fill out the questionnaire but skip the presidential vote question. For example in Alabama the CNN site shows 736 respondents. The methodology statement show 740 respondents. This is because 4 respondents chose not to answer the question on how they voted for president and are not included in those crosstabs but they are still included in the data file because they may have filled out how they voted in other races that day such as the Senate race.

The second reason is that respondents from the national absentee/early voter telephone survey received all four versions of the national questionnaire while election day respondents only received one version of the national questionnaire. Thus, these respondents are included 4 times in the unweighted data (once for each version of the questionnaire) but their survey weights are adjusted down so that in the weighted data each national absentee/early voter telephone survey respondent only represents one person.

Again the methodology statements state the correct number of total respondents interviewed.

I hope that this explains the differences in how these numbers were reported.

Jennifer

Mark Blumenthal

Mark Blumenthal is political pollster with deep and varied experience across survey research, campaigns, and media. The original "Mystery Pollster" and co-creator of Pollster.com, he explains complex concepts to a multitude of audiences and how data informs politics and decision-making. A researcher and consultant who crafts effective questions and identifies innovative solutions to deliver results. An award winning political journalist who brings insights and crafts compelling narratives from chaotic data.

34 thoughts on “Fun With Unweighted Interview Counts”

nashuaadvocate says:

January 11, 2005 at 1:51 am

Mark,
Hi! I’ve been reading up on exit-polling methodology, as suggested. I really do want to learn more about all this, rather than merely continue to speculate blindly (I know there’s more than enough of that going around, and I’m sure I’ve been guilty of it myself at times — likely because of the unnecessary NEP secrecy you’ve already opined about, justifiably). Anyway, so I was reading up on the nuts-of-bolts of exit-polling when I came across this most recent post from, I gather, just a few hours ago.
Bearing in mind that I’m an attorney, not a statistician/pollster — indeed, was an English major in college, and a horrible math student throughout my schooling — isn’t there still something odd about the Mitofsky/Edison Media explanation you’ve received for the National Exit Poll sample-size anomaly?
If 0.5% of Alabamans left the presidential vote query blank, what are the chances — statistically speaking — that 663 of 13,660 exit-poll respondents *nationally* would do so as well? That seems to me to be an approximately 5% rate of “undervote” (as it were), which is literally ten times the reported “undervote” (perhaps I should say “under-response”) in Alabama.
[And frankly, if a heavily-Republican state such as Alabama showed *one-tenth* the under-response of the National Exit Poll sample — and if that trend says anything about the enthusiasm of Bush voters, which I concede it may or may not — doesn’t that potentially contradict, at least somewhat, Mitofsky’s own explanation for the “exit poll anomaly”: that Bush voters were *less* likely to agree to answer questions about their presidential preferences?
[Bush carried Alabama by 26 points].
I know I may be making a fool of myself here — probably missing something quite obvious — but I thought I’d weigh in anyway, in the hope of some self-edification, at the very least.
— The News Editor
nashuaadvocate says:

January 11, 2005 at 2:01 am

Mark,
1. Sorry, there’s a typo above. I meant to say “633”; instead, I wrote “663.” Proof that I am not entirely at home with numbers.
2. While I recognize at least some of the “under-response” in the National Exit Poll could be from telephone interviews — which would be quadruple-counted, if I understand correctly — if there were 500 such calls out of a 13,660 sample-size (3.7% of the total), we would expect, on average, only 23 of the 633 “under-responses” to be from telephone interviews, right? If we multiply that by four, we get 92 “under-responses” from telephone calls — leaving us with, on average (all things being equal), 541 “under-responses” from a sample size (minus telephone interviews, of course) of around 13,047.
Which is still a 4.1% “under-response” rate nationally, as compared to only 0.5% in Alabama.
So I suppose I would still wonder, what are the chances that Alabama fell so far outside the 95% MoE for an “average state’s” “under-response”? 0.5% is obviously less than one-eighth the “under-response” level nationally.
Hmm, not sure if I’m making any sense, but in the event I am…
…perhaps you or someone else can help.
— The News Editor
Smarty says:

January 11, 2005 at 4:07 am

Mark, your commentaries are so rife with inaccuracies, it’s difficult to know where to begin.
First off, the state by state data trump the national data. The margin-of-error for the state by state date is smaller.
I saw the late evening wave of exit polls for Ohio on election day. It indicated a 4-point lead for Kerry. In fact, CNN.com updated its exit poll screen for Ohio past midnight, and it was still showing a 4-point lead for Kerry.
You may not be aware of this, but Mitofsky and others were actually using partial actual tallies and mixing them with exit polls to make them “accurate.” Mitofsky himself emailed me to that effect and has also repeated it in interviews on the subject. Does that strike you as a sound method of “polling”? Hey, just wait for the results and mix them in. Hell, why even do exit polls in the first place.
Did you know that according to the last wave of exit polls, Kerry’s lead was bigger in both New Hampshire and Minnesota than it was in California? Did you know that? And yet, why did the networks waste no time in projecting California? I asked Mitofsky this and he never answered me on this point. Low and behold, I came across an article that had been written on Nov. 3rd stating how the Bush campaign had been lobbying the networks to not project certain states.
Now, regarding your statements about the national data, you are factually incorrect. There was a late-evening update that included about 13,000 respondents that still showed Kerry ahead by three points nationally. Then suddenly, a few hundred more respondents, and the race shifts by 6%? Once again, the NEP fudged the numbers to conform to the “results.”
Please, try getting your facts straight sometime, it’s really fun.
Rick Brady says:

January 11, 2005 at 7:40 am

Okay Mark…
What does this do to Dr. Freeman’s paper and his calculations of standard error by state? He simply took the standard error of each state per SRS and multiplied by 1.3.
I guess we have yet another reason why the data in the public realm are too fuzzy for precise statistical analysis!
I’ll have to update my Z-tests to see how this news affects the ranges of p-values.
Mark Blumenthal says:

January 11, 2005 at 2:03 pm

Rick – To clarify: As I understand it, the quadruple counting happened on the state and regional tabulations, but not the statewide surveys. If it did, it would only have pertained to the 500 telephone interviews that were included on the national sample. I agree with your conclusion about “fuzziness” in public data that has been extrapolated from vote by gender crosstabs.
Brian Dudley says:

January 11, 2005 at 2:20 pm

Mark,
This is slightly off-topic, but do you expect the exit polls done in 2006 will be any more reliable?
Mark Blumenthal says:

January 11, 2005 at 2:44 pm

Nashua,
Regarding the reported undervote in Alabama and your calculations of it, you’re making sense, and your second post shows you follow my point about quadruple counting.
One thing to consider: Respondents are more likely to refuse to provide a response to a stranger who calls on the telephone than they are to skip an item on a “secret ballot” at the polling place. They fill out the polling place questionnaire out of sight of the interviewer and place it, without any personal identification into a large ballot box.
Having said that and having played with the numbers again, I can see your point: It is very unlikely that the national rate of non-response for the presidential vote question could be anywhere as low as it apparently was in Alabama. Even if we assume that 10% of the telephone respondents refused to divulge their choice, that only accounts for 200 of the “unweighted” missing interviews. The other 433 would require a 3.9% missing rate for the polling place interviews.
I’m not sure if we can get more clarification on this point, but I’ll ask…
Thanks for your post!
Mark Blumenthal says:

January 11, 2005 at 2:45 pm

Brian,
Good question — at some point I’d like to post something on what they might do/should do in the future…Hopefully, we all learn more soon about what happened *this* year.
Rick Brady says:

January 11, 2005 at 3:52 pm

“Rick – To clarify: As I understand it, the quadruple counting happened on the state and regional tabulations, but not the statewide surveys.”
I think you meant – “on the national and regional tabulations, but not the statewide surveys” right?
My question to Jennifer didn’t mention the national or regional surveys, but used a few states with small and very large discrepancies. It was my statewide examples that provoked her e-mail.
Mark, true I’ve not spent much time thinking about this. I have a spreadsheet from a while back that includes the sample sizes of the Freeman data, the NEP methods statement, and the final 11/3 CNN weighted data. I’ll check that.
Another thing to consider about these sample sizes. Merkle said that the telephone interviews probably had a slightly smaller design effect than the exit interviews. He said that the 1.5 to 1.8 desr’s apply only to the personal interviews. However, in my post critical of the Simon/Baimon paper, I mention an e-mail from Jennifer that appears to contradict Merkle. Although…I’m not entirely sure she understood my question… More fuzziness issues…
luke says:

January 11, 2005 at 6:22 pm

Mark – thanks for chasing this up, and your response (and thanks to Rick).
You said “the numbers still do not quite add up. They did 11,027 interviews at polling places and 500 interviews by telephone for the national exit poll. The unweighted total would be 11,027 + (500*4) = 13,027… That still leaves the total 633 interviews short.” Actually the 11,027 number was the (incomplete) 7.33pm count, the actual number of interviews seems to be 11,712 – adding (4*500) would give us 13,712. This number is now just 52 interviews away from the 13660 – with the added wrinkle that we now have too many. The undervote issue seems to have somehow become an overvote issue. confusing.
cheers again.
John Goodwin says:

January 11, 2005 at 7:29 pm

So let’s look at the PDFs. Specifically, how big are the samples? “National Region” has, consistently, the biggest “(n= … ” value, since all the interviewers know where they are. What is interesting, is (1) that the population classification for 5 categories and for 3 don’t align, but are off by consistent amount for all three data sets. Delta = NR-Pop3 = 2000 for the presidential tabulation, and 1812 for the House one. It is as though those got added into the sample in the first batch and then carried onwards. (2) NR – Pop5 is always 8, for both House and Presidential tabulations. Now I can understand that 8 people failed to report their population, but how did these same 8 end up in both the House and NR data sets?
Looks like a fabricated data set to me, or a buggy tabulation program. Call QA. Wait, I am QA. Show me the code and the raw data, and I’ll debug it for you. 🙂
National Region – Population (5) – Population (3)
11/2 3.50pm delta 1812
3970/PRES/V 8,349 – 8,341 – 6,349
3970/PRES/H
3970/HOU/V 7,736 – 7,728 – 5,924
3970/HOU/H
11/2 7.33pm delta 1782
3798/PRES/V 11,027 – 11,019 – 9,027
3798/PRES/H
3798/HOU/V 10,223 – 10,215 – 8,411
3798/HOU/H
11/3 1.24pm delta 1812 again
3737/PRES/V 13,660 – 13,652 – 11,650
3737/PRES/H
3737/HOU/V 12,649 – 12,641 – 10,837
3737/HOU/H
John Goodwin says:

January 11, 2005 at 9:28 pm

Ignore the “delta 1782”. The patterns in the data are so regular, I used them to double-check my keyboarding. I had keyed 8,441 instead of 8,411 and so got a different delta in my cross-check spreadsheet.
John Goodwin says:

January 11, 2005 at 9:48 pm

So if my point wasn’t sufficiently obvious– I think MP has it wrong. The 2000 are in all the samples, and we also can guess 1812 [divisible by 4] was added into the House samples, and they were added in early, and are in all 3 PDFs.
luke says:

January 11, 2005 at 9:56 pm

Mark, when you say the high totals “slipped past the exit poll webmasters at CNN and CBS”, isn’t it true that it was actually EMR who made the mistake in passing on inaccurate numbers? The Scoop PDFs have the same numbers – and they are apparently from a print subscriber.
You also say that “the total number of interviews that appears for the national survey is in error” – the same is true for the regional numbers (I havent checked the individual states)
Separately, it seems poor form that EMR hasnt alerted their major clients (CBS/CNN) that there’s an error in their numbers, 10 weeks after the election – especially given the intense focus (and suspicions).
Mark Blumenthal says:

January 11, 2005 at 10:16 pm

Luke was right. It was my error that made the numbers not add up in the original post. See my corrections above. Apologies to all for the confusion.
Rather than goof it up again by posting in the wee hours, I’ll take up the topic of the regional cross-tabs tomorrow.
John Goodwin says:

January 11, 2005 at 10:16 pm

So continuing my analysis, which tabulated the National Region, Population(3) and Population(5) data sets we can recover the exact regional distribution of the 2000 telephone persons:
3737/PRES/V
NR POP5 POP3
East 2888 2888 2888 0
Midwest 3676 3676 3520 156
South 4456 4456 3748 708
West 2640 2632 1504 1136
None 13660 13652 11660 2000
the other 5 data sets are left as an exercise for the reader. 🙂
In any event, we also determine that all the “8” respondents who known their population within 5 categories but not 3 are in the West, as well, and we can decode how this data set was constructed.
John Goodwin says:

January 11, 2005 at 10:21 pm

And I think the introduction of telephone polling in a ratio East=0, Midwest =156, South=708, West=1136 blows this whole “Random Sample” and “Design Effect” discussion to effing hell. That means telephone polls have a differential red state vs. blue state (if midwest is 50% blue and east is 100% blue and south and west are 100% red, thats 78 blue vs. 1844 red on the phones), which just might be a teensy little problem. G’nite y’all.
luke says:

January 11, 2005 at 10:35 pm

John, if you look at the presidential race by region, only in the East does n = Pop(5) = Pop(3).
Exit-poll.net tells us that the phone interviews were only in “Arizona, California, Colorado, Florida, Iowa, Michigan, Nevada, New Mexico, North Carolina, Oregon, Tennessee, Texas and Washington State.” (none of which are in the East as far as i can tell)
n Pop(5) Pop(3)
East 2888 2888 2888
MW 3676 3676 3520
Sth 4456 4456 3748
West 2640 2632 1504
Nat. 13660 13652 11660
So that seems to partly confirm one thing or other.
Mark, speaking of the East, did you have any insight into how/why the number of interviews in the East increased by 40% *after* the 7.33pm report?
cheers
John Goodwin says:

January 11, 2005 at 10:46 pm

And indeed, comparing the first set of PDFs to the last, we see that the phones were added in all data sets, and the distribution never changed. So it is heavily weighted towards red-state phone interviews in all data sets available to us:
Dataset 3970/PRES
NR POP5 POP3
East 2888 2888 2888 0
Midwest 3676 3676 3520 156
South 4456 4456 3748 708
West 2640 2632 1504 1136
None 13660 13652 11660 2000
Dataset 3737/PRES
NR POP5 POP3
East 1746 1746 1746 0
Midwest 2069 2069 1913 156
South 2787 2787 2079 708
West 1747 1739 611 1136
None 8349 8341 6349 2000
nashuaadvocate says:

January 11, 2005 at 10:55 pm

Mark,
I think we’re getting close — but I’m still a little confused.
CBS has a published “final” draft of the Mitofsky Methods Statement on its website which says 11,903 non-telephone and 500 telephone interviews were conducted on Election Day, which gives us a sample size (I think we can now agree) of 13,903 (11,903 + [500*2]). This is actually 243 responses higher — not 59 — than the reported 13,660-voter sample size for the National Exit Poll.
Using these numbers, the “under-response” rate would be 243/13660, or 1.8%, would it not? — still roughly 4.5 times the Alabama “under-response.”
Why would CBS, an NEP member, have different data than its data-supplier, Mitofsky/Edison Media, more than two months after the general election? And why would that incorrect data be in the form of a *Mitofsky*-produced Methods Statement?
The Staff of The Nashua Advocate deal with this issue on The Advocate website, at
http://www.nashuaadvocate.blogspot.com/
I also can’t for the life of me figure out how the Washington Post was reporting 13,047 total responses on November 4th, 2004 — especially when this figure was given to the public by The Post pursuant to an “official correction”(!) [See The Advocate site for the link to The Post article].
Usually “official corrections” in The Washington Post can be trusted, right?
You’d think.
Now, if only 13,047 individuals actually answered the original Exit Poll — without any weighting or (though I know you find this highly, highly unlikely) so-called “phantom” voters — and the total sample size is the CBS-reported 13,903, we could still have an “under-response” rate of as high as 856/13903 (6.2%), which would be fully *15.5 times* the Alabama “under-response” rate(!)
[Granted, if some, or many, of these under-responses were the quadruple-counted telephone interviews, the total number of “unique” under-responses would be much less — but *still* would be, I imagine, 10-12 times the Alabama sample].
Which numbers are the right ones? If indeed there were 59 under-responses, as Mitofsky/Edison Media is now claiming, then *both* CBS *and* The Washington Post are still reporting faulty data more than two months after the election, the latter of these two (generally) reliable news sources having reached its final statement of data after an “official correction” — !
So Mitofsky is actually saying now that two of the news sources *it* provided data to are incorrectly reporting that same data to the public?
Is that likely, really?
— The News Editor
John Goodwin says:

January 11, 2005 at 11:09 pm

Thanks Luke, with that information and making the reasonable assumption that within regions and between states the samples are distributed according to weights given by “number sampled” that we supposedly know from the Freeman data, and given the known election results, and using the decoded regional weights of 0.568 West, 0.354 South, and 0.078 Midwest, 0 East, I get an expected contribution of Bush votes of 52.0% for the telephone polling. [Using Bush 49.8% in the West, 56% in the South, 49% in the Midwest, weighted for the states and sample sizes]
luke says:

January 11, 2005 at 11:16 pm

john – (my 10.35pm post was in response to your comments prior to your 10.16pm post. )
here is the full quote from NEPs faq re phone interviews:
“Which states will have absentee voter surveys?
The states where absentee/early voters will be interviewed are: Arizona, California, Colorado, Florida, Iowa, Michigan, Nevada, New Mexico, North Carolina, Oregon, Tennessee, Texas and Washington State. Absentee voters in these states made up 13% of the total national vote in the 2000 presidential election. Another 3% of the 2000 total vote was cast absentee in other states in 2000 and where there is no absentee/early voter telephone poll.”
http://www.exit-poll.net/faq.html#a13
given that 16% of votes in 2000 were absentee, and given HAVA, whether it is appropriate to only do 4.1% (500/12212) this time, i’ll leave for others to discuss…
to complicate the picture, theres this from Gallup Oct31 “In Florida, 30% of registered voters said they already had cast their ballots, using early voting sites and absentee ballots. They supported Kerry 51%-43%.” (but as joshmarshall said at the time “shouldn’t the number add up to 100% or close to it?”) http://www.usatoday.com/news/politicselections/nation/polls/2004-10-31-poll-x_x.htm
clear as mud.
luke says:

January 11, 2005 at 11:34 pm

Nashua, I agree that its odd that there are apparently 3 (or more) different sets of numbers. I’m not sure that we all agree that CBS’s number is the accurate/best one… I have been using MER’s 12,212 figure. That’s where we came up with the 59, rather than the 243.
Also, you said “the “under-response” rate would be 243/13660″ – as far as i can tell, this is actually an “over-response” now, not an under-response, which seems to render Jennifers explanation invalid – i.e. we have a completely new problem.
Cheers.
John Goodwin says:

January 12, 2005 at 11:29 am

So summarizing last night’s work 🙂 We have the leaked PDFs and as we all know, one source is no source. They seem to be a hoax. If they are authentic, then the exit polls are a hoax too. I am not in a position to authenticate the documents, but any investigative reporter with access to any similar documents can, so I suppose someone will.
In the mean time, what does the internal evidence show? Tabulations are aggregations of rows in a database. We now know that all the data sets in the PDFs have 2000 records in them that are different from the rest, namely the have Pop(5) responses without Pop(3) responses. Furthermore, there are 8 responses in the west that have National Region without Pop(5) responses. These 8 records appear in both
A few deductions: It is possible that the same 2000 records show up because the same precincts were used in all data sets, that would prove that no re-sampling occured between data sets, at least with regard to the special 2000 records, and suggest that the results are cumulative, with older data being rolled up into later sets. It also would imply that the same precincts were used for House and President tabulations.
Conversely, the simplest hypothesis is that the same 2000 records are in all the data sets, and Occam’s razor prevents us from assuming they are different from the 2000 alleged telephone rows, unless there are 4000 rows of phantom data–around 1/2 in the early data. Naturally, the PDF hoax could be designed to help us discredit the exit polls, so we shouldn’t swallow any of this whole.
If any of these conclusions conflict with the declared NEP methodology, we have a scandal–a lie about how exit polls are conducted. Or we have a faulty tabulation program, and the exit poll data we have are worthless (or falsely discredited). If the PDFs are not authentic, we have a hoax. In no case do we any longer have a random subsample of a larger, valid, data set, unless we have no valid data at all, which is quite possible.
Ergo, (for the logic impaired), we have a hoax. The only remaining mystery is… wait, I’d better say this carefully:
pro doctis, mysterium remanens solum, Mitofsky laborem praefert, an lucidem.
John Goodwin says:

January 12, 2005 at 12:50 pm

So we’ve failed to notice that since 156, 708, and 1136 are not all simultaneously divisible by 4, that we can prove that the 2000 records are not 500 records multiplied by 4? Coffee helps. zzzzz
John Goodwin says:

January 12, 2005 at 1:45 pm

Extra credit for the fully awake and mathematically adept (like statisticians are alleged to be): prove that it is impossible to construct a data set of 2000 records with exactly three regions, which consists of a 4x replication of the same data.
So we do not need the data to prove the explanation is false. All we need to know is none of the states interviewed in the 2000 are in the East, and all the remaining regions are populated.
John Goodwin says:

January 12, 2005 at 3:48 pm

And for my next observation, you can prove that there *is* a data set with multiples of 4 added into to the synthetic one in the PDF, since if you look at the undercounts for age and sex, too often by region they are off by exactly a multiple of 4.
There are thirty columns (2 for House/pres x 3 data sets x 5 regional breakdowns). Each row has a question and tabulate just “N=” Now subtract rows from each other and look for patterns. E.g., subtract an Age row from a Number in Region row. If a row somes up all multiples of 4, then there were a few non-responses in the “x4” data set.
So… we have two provably distinct data sets–one multiplied by 4, and the other of size 2000 that isn’t. Hmmm.
John Goodwin says:

January 12, 2005 at 4:12 pm

And, final comment for now, correcting some of statements about divisibility by 4 which will be obvious to anyone who checks the work,
— In the House data set, there is a composite set of 1812 rows, distributed East=0, MW=148, South=676.
— In the Presidential data set, this is supplemented by an additional MW=8, South=32, and West=1128;
all divisible by 4, bringing the total
to 2000. All this was in the the first set of PDFs already, and didn’t change.
Futher decomp is possible, now that we know the relevant statistics is the kind used in cryptanalysis, not data analysis.
luke says:

January 12, 2005 at 4:29 pm

john – hmmm, you seem to be getting ahead of me a bit…
We appear to know that:
A. The leaked Scoop exit polls appear to be authentic because i) they conform to the CBS/CNN websites ii) Warren Mitofsky didn’t debunk them
B. The Pop(5) and the Pop(3) don’t seem to have been questions per se, but rather they appear to have been ‘tags’ added by the interviewer.
C. For some reason, the phone interviews included the Pop(5) tag, but didn’t include the Pop(3) tag
D. If we look at the regional numbers, ((Pop5-Pop3)/4) appears to give us the following numbers of phone interviews W=282, MW=39, S=177, E=0 for a total of 498 interviews. These interviews appear to have been completed before Nov2, and were included all the data sets.
E. In all cases n= Pop(5), except in the West, where n-Pop(5)=8 (in all data sets) – which appears to suggest that a few early interviews in the West weren’t ‘tagged’ for one reason or other.
John Goodwin says:

January 12, 2005 at 5:01 pm

Generally agree with you Luke. Also, the last line in every data set is the same in the House and President filters, either an off-by-one programming bug or they don’t apply weights to that line for some reason. Also, that line agrees with the age breakdown lines in all sets.
John Goodwin says:

January 12, 2005 at 6:21 pm

In addition, we have three independent pieces of information that suggest “house” and “president” PDFs are filters of the same underlying data set and are in fact additive relative to each other (so subtracting yields information about what was “added in” to get to “president”):
1. the “8” in West n-pop(5) appears in both sets. This is not conclusive alone but supports…
2. the identity of N for the last question in every region (which could be a programming or query error)…
3. if you subtract N for “sex” from N for “age” you get a difference between persons who don’t like to report their sex and those who don’t like to report their age, which should really be pretty random. This statistic, distributed by region, correlates between “house” and “president” data sets, case by case.
Rick Brady says:

January 12, 2005 at 7:39 pm

I hear MP typing…
Mark Blumenthal says:

January 13, 2005 at 12:10 am

Rick’s right. I’m a few minutes away from another post.
But John? You’re not just getting ahead of Luke. It took me three read-thrus before I finally figured out your point.
I’ll give you credit for discovering that somehow, the three-category cross-tab for “population size” omitted the telephone interviews (2000=500*4), while the 5 category version of the same variable did not. I did not notice and you seem to be right about that. However, you lose me on just about everything else.
A few points to consider:
1) Of course both sets of PDFs are based on the same sampled precincts and underlying data. You really think they did two seperate 12K interview exit polls, one for the presidential question and one for the US House question?
2) You might want to re-read that NEP methodology statement, as you’re skipped over a lot of basic and obvious information:
http://www.exit-poll.net/election-night/MethodsStatementNationalFinal.pdf
Most notably: “In addition, 500 absentee and/or early voters in 13 states were interviewed in a PRE-ELECTION telephone poll” [Emphasis added].
You didn’t think they were trying to reach early or absentee voters by telephone at home in the middle of Election Day, did you? They called before, presumably on Monday and probably Sunday nights. As Luke pointed out, the full 500 interviews (or 2000 records) were in the national/regional datafile from the very first tabulation.
So given that the 3-way population cross-tab seems to filter out those 2000 records that were in the data from the first run, *of course* every PDF was off by exactly 2000 on every single presidential table.
Why does the 3-way pop. crosstab for the US House question show a consistently smaller number missing on every table? Because fewer votes are cast for the U.S. House than for President. You could look that up (or wait for my next post).
So what does “Occam’s Razor” tell us is a more likely about the apparent error in the 3-way pop. crosstab? That a programmer made a simple error constructing that table, or that you have uncovered evidence massive hoax involving he exit pollsters and all five NEP partner networks?
I’ll go with the former.
Having programmed hundreds of crosstab tables like this myself over the years on projects far less complex, here’s my best guess: The person that constructed the specs for these tables mistakenly put a filter on the 3-way population variable that created the error, and nobody caught it on Election Night. It happens.
Rick Brady says:

January 23, 2005 at 1:19 am

http://www.rickbrady.net/archives/2005/01/abigail_brayden.html

Abigail Brayden is back and she uses her first post in over a month to draw attention to Captain Ed’s January 12 World Relief Day. Joe Carter, whose Symposium received 66 entries, issues another challenge for faithbloggers: The ‘Jesus the…

Comments are closed.

Fun With Unweighted Interview Counts

More Stories

MysteryPollster Is Back! (Sorta)

Another ‘Phantom Swing’? Investigating Differential Nonresponse in 2018

MysteryPollster is Back!