Search interesting materials

Monday, December 16, 2013

Small samples from big populations shouldn't bother us

by Rajeeva Karandikar, Director, Chennai Mathematical Institute.

In the last few weeks, there has been a lot of discussion about opinion polls. Some people have questioned if these have a scientific basis. Indeed, each time we disclose our findings based on an opinion poll, someone raises this question.

In this article, I offer a simple explanation of the scientific basis of an opinion poll. The key result is this: If the methodology is sound, an opinion poll based on a sample size of 25 thousand respondents in our country, where there are over 500 million voters, can yield surprisingly good projections of the vote shares of major parties.

Consider a lottery. Suppose you are told that a box contains lottery tickets and that each ticket has a number written on it: 1 or 1000. You can pay Rs 100, and then put your hand inside the box and draw one ticket from the box. The prize would be the amount written on the ticket (in Rupees). Most people would not agree to play unless they are told how many tickets in the box have the number 1 or 1000 written on them. However, if they are told that 99 percent of the tickets have the number 1000 on them, many may be willing to play. Indeed, even if the cost of playing the game was Rs 900, many would opt to play if 99 per cent of the tickets have 1000 written on them.

Suppose, instead, you are told that host of the casino will put his hand in the box and draw a ticket. There are still 99 percent tickets with the number 1000 and only 1 percent with the number 1. And you have ascertained that all tickets are identical in all aspects other than the number written on it. Even then, you would be a bit apprehensive, as the host might have put the tickets with the number 1 at the bottom of the box, and given a chance the host can dig deep in and draw a ticket from the bottom. If you are allowed to shake the box and mix the tickets well, you would probably still play.

Now let us consider another scenario. A political party has two claimants for a Lok Sabha constituency, say Raghu and Prasad. Suppose the constituency has 5 lakh voters. Let us imagine that we have lottery tickets with the following characteristics: Each ticket has the names of 2501 voters from the constituency and also that the ticket is coloured Red if 1251 or more voters on that list prefer Raghu over Prasad and the ticket is coloured Blue if 1251 or more voters on that list prefer Prasad over Raghu. Suppose all such lists are written out on otherwise identical lottery tickets.

Let us assume that there is at least a 5 percent gap in the support level of the two candidates. It can then be shown that over 99 percent of the tickets will have the name of the candidate with more support. This is just a question of counting and is purely arithmetical- no element of probability or statistics enters here. Thus it is a matter of fact and not of belief! Indeed, 99.3939507 percent of the tickets will have the colour of the candidate with more support.

Return to our example of two candidates with a gap of 5 percent or more. If the party draws a ticket out of the box after mixing it well, it will end up knowing which candidate is more popular. Here the logic is that since 99 percent of the tickets have the colour of the more popular candidate, we can assume that the colour of the ticket drawn has the winner's colour. Once again the decision maker should ensure that the tickets have been mixed well.

Here are the percentage of tickets that will have the colour of the winner for different combinations of population sizes and sample sizes. In each case, we assume that there is a 5 percent gap in the vote shares of the two candidates.

Sample size Population size (Total number of voters)
500000 1000000 2500000 5000000 10000000 25000000
1001 94.35 94.34 94.34 94.33 94.33 94.33
1201 95.87 95.87 95.86 95.86 95.86 95.86
1401 96.96 96.96 96.95 96.95 96.95 96.95
1601 97.75 97.75 97.74 97.74 97.74 97.74
2001 98.75 98.75 98.74 98.74 98.74 98.74
2501 99.39 99.39 99.39 99.38 99.38 99.38
3201 99.77 99.77 99.77 99.77 99.77 99.77

The remarkable thing about this is that while accuracy increases as sample size increases, the population size (total number of voters) has only a negligible influence on the accuracy. This is somewhat counter intuitive but true. A sample of size 2501 will give the same accuracy when the population size is 1 million or 25 million!

The following table gives the percentage of lists which have the colour of the winner when the gap between the winner and loser is only 2 percent. Here again we see that sample size determines the accuracy and population size has very little effect on it.

Sample size Population size (Total number of voters)
500000 1000000 2500000 5000000 10000000 25000000
1601 78.86 78.85 78.84 78.83 78.83 78.83
2501 84.2 84.17 84.16 84.15 84.15 84.15
3601 88.58 88.54 88.52 88.51 88.5 88.5
5001 92.24 92.19 92.16 92.15 92.15 92.14
8001 96.44 96.38 96.34 96.33 96.33 96.32
10001 97.83 97.78 97.75 97.74 97.73 97.73
15001 99.36 99.32 99.3 99.29 99.29 99.29

At the bottom of this article is a computer program written in Python which does these computations. You have to believe me or have a mathematical expert confirm the accuracy of the program and then run the same on a computer with Python installed (which is available freely at http://www.python.org/). You can change the population size, the sample size and the gap between the support levels to get the accuracy level of the corresponding sampling scheme.

The same situation applies when we conduct an opinion poll. We select a group of 2501 voters, and ascertain the opinion of this group, called a sample. It is the percentage of votes for a party in this chosen sample that we report as the estimated vote share of the party. The crucial thing is that our choice should be as if we have written all possible lists on lottery tickets and put them in a box, mixed them well and then drew one and the names on the ticket constitute the group. This is what is called random sampling. One can use random number generators to generate such a random sample from any list of voters.

Colloquially, most people think that random means arbitrary. This is far from true in the scientific setting. Random sampling refers to the methodology of choosing a sample. In this context, it means choosing one list out of all possible lists as if we are drawing a lottery ticket (in the scenario described above). What I have described is the simplest sampling scheme. There are variations which may be more appropriate in a given situation.

Suppose we have access to a list of all telephone numbers in use in a constituency. We can use a computer program to generate a list of 2501 randomly generated phone numbers from this list. We can then call these numbers and ascertain the view of the owner. In this case we could estimate the opinion of the group of people who have phones. In this case, richer, urban, educated class will be over represented and our estimate could be biased. This methodology is used in the US and seems to work well (at least over the last 50 years, while it did not work in the '30s and '40s even in the USA when the telephones were not ubiquitous all across the country).

Thus the most important ingredient in the opinion poll is the methodology of sample selection. One must be sure of getting opinions from a representative sample. Unless the sampling is done properly, there is no statistical guarantee that the estimate would fall within 2.5 percent of the true vote share (with 99 percent probability for the sample size of 2501).

Readers can experiment with the program and obtain accuracy of random sample based prediction for a given sample size, population size and gap in the support for the winning candidate and the losing candidate. The program prints the total number of lists, number of lists where winner has majority and then the last line is the accuracy (in percent).

Python code:
g=5#Gap in percent support for winning candidate and the losing candidate
psize=500000
#population size
ssize=2501
#sample size
def binomlist(N, R):
    '''Return [binom(N,0), ... , binom(N, R-1)]'''
    a=[1]
    for k in range(1, R):
        a.append((a[k-1]*(N-k+1))//k)
        assert((a[k-1]*(N-k+1))%k==0)
    return a

n=psize
#Population size
print('Population Size :')
print(n)

m=(100+g)*n//200
print('Gap in the level of support between the two candidates 
(in percent):')
print(g)

#Total number of supporters of the winning candidate
print('Total number of supporters of the winning candidate :')
print(m)

k=n-m
#Number of supporters of the losing candidate
print('Total number of supporters of the losing candidate :')
print(k)

r=ssize
#Sample size
print('Sample Size :')
print(r)

s=1+(ssize)//2
#Majority mark in the sample
print('Majority mark in the sample :')
print(s)

t=r-s
b=binomlist(n,r+1)
c=binomlist(m,r+1)
d=binomlist(k,t+1)
print('Total number of lists :')
print(b[r])

z=sum([ c[r-k]*d[k] for k in range(0,t+1)])
print('Total number of lists in which winning candidate has majority support:')
print(z)

y=(z/b[r])*100
print('Percentage of lists in which winning candidate has majority support:')
print(y)
Output: Population Size: 500000
Gap in the level of support between the two candidates (in percent): 5
Total number of supporters of the winning candidate: 262500
Total number of supporters of the losing candidate: 237500
Sample Size: 2501
Majority mark in the sample: 1251
Total number of lists: 62231690581446480003124486564603608079722664287780679850769754811742042826440472887015830702924480575139486249657512804993096017025966527240485971677012460101302514218686266609441052100836909464169270524814906289825323267820948737888768638306721657325213500920099906234174550459916676877801122648015241862393226740611391693419690393435279384448846498164611917690485938916309022444186853678716540339720996823920632761895486203438380430254590374925296252761868287613362669749365125454631374879693160142819869304875906654921349095055838442562414668977024766179959130011021610575662910956134247564521738477313446196261604802302543410146068132670342155475007095024743323867045795400143176727384029281976933600168079297510291849445093067071083684685003730058946519710247034945376030279821029701472923740192102205025797475452531004667596413727636670465215729867754283833374385303145387948051359404453403594361525378558410033629759275932498192096982291800849470571518287063229431447959133385792138084490304666939123657615189822099874121079295131987178206767084477208423116361539422938568526859676309130466065888802081248462657939570182699815625453901386358318350022709995625288828603793916108904428008609734299699221437566336835240257534085393479491186665079655190103428237800738888006964700812940498236110822184478021780415260136866672164326231310650895521248121755859107866938779565130334913321094933601029730436184082485079029558170569819165053571542795991727217330966100414527221364686964529920726163238492293892228326948001117293468138858023516939457994664567261850006311933756947304285561086248788200803564375093003772848775681842197209982100478555863846338584281906599009475583222084487980818308040033671587984993515974558684475022277901970099053541223542134155842504516032224804445183451317380149589970032212804575628082098277081463957839077920898869597586620515995970008514248167231810555336158368760540858584640697240880859068759980546301544945173321069553721350972811702465776038261514751366432380505653269990734628912787921196828014924849957148325947484479478464943528525829530723712207177801854498505379313242978072796608415660424672105172137755545024900415945428256536045336980540671661266557344947764836566529722714879021182976140139129856559145427658178495007317534394739394235188377026548923486253173751616379130541552155758837114472809783850427754469844587936072840642351778220558057232669828423498123063458914684776588125631122103174618980604765576707899260689467306579408356058711623351399260178055917659963408455273580719144814560484832919904878765961591225700327651593973860864438116094850680456864518725262740928051582341539254987525019787256071659676928373298021282046954594050755383326687971809253561980088484298073161856459452325694037739274837365216230582263924667803583485781780218253284218391730866178085262881994066033816393829013721311131748672183078728900933581558734405974104874756534969248232592313582502037142672654121741282578372979805465908127950187274075397490336844923270342615399463969649554126623283032616619860556580636328136753628805680852851407963270089714073626621120839871909711594994253666342272359471320655869954863460407565986906474615595552330627592635455243206575046823339900668550971352633374485688587126074795047908585598712155945668417607256395345722088439538463447076588501001219788542858153049781833657942552401464715635534332260515218505056897905685877043364499338188335401451485749380823361891367165284575198795795706325965459245970820182968416588569262205864560967474390631523041120889653260589129456152566220058515917669342363795423128034942492225269840456119934077650384674330202082758512409968193166923108572334911544560155774100118425500347543423269124631844558114435125175017555500011451376956685419999212743508413967187310121355328643814596714505035428570563667905329220023568761920372318672748066491538386362503932919067623642006192288629490332239924529275660392604135364518178709645251568947829790968485350757937039473403892835221019104946380501456170261792541779653905610112609257982069288434883821539348456638978706277490266096785801912450185230638539829701924352942262894855270142930259031972046282586563892827754397261394135669853192369909536949252496489905384890519693644219374285311950869838840999862576449738089616498010247917038541443671710835200209258303429618341172828857210172546352033108320714086475275833013658816172562851640153013201594935921805444858511862222986921611649620787025702764928938030416878469784542178572899283792381712549547989957375712162715834971052306908424777553250398897933660688114626175343338296459105052106907815907800493632106153130371390265741118879621773827599268210510657654647567961805679873227774988494734117015705540753609174031970835793477945867907801468391170032234998449398312805494784365015754905261103496492914475522013784154069747889118167745622648817823453096293834552709425038099410751211010184684088478062229567458545385381873514215152003667097109597790568765198065077427946525045384254967053032767662079442599022863825992255697642670195087368061303544182026383279381793756537585195072034501032282072218014776155536686226067590207105059978962856264152015857329823054618874575107943555769313879322107342773045700458905454789107511870461578164581868782223102778218653491871855192270951818997164294958334974045773678167469423485625177317370915848177336736332836777996361609707902987046694214863048890113554854413050567534204094725686390362155721963672141096706581649992815877681713676063878647949708426533918893860041211338666717678506988337856766016039452351562182705851879530723219770661660390683134598036166887734683508116125393617837521663026650845446856218974787651129036755981552785539813024142173328570948759613346757263838152602944234205086327543051588254089090386593113031841672362875582419951045117676214677983892982215768091890388410459618648394701970636811984634408588565396887808598116836276413546396278428362439496050403104026679130181272599410000361010047778136678130879383747883747846895444550308537252212467158365879188315673992959209980186727731070450877815644643251764435971587530519135727567687253468334262068241333011975327403420996864840461485541038137569067936744784955690032673877577363436808601455485916733141230626610962300477240992891148835574452260329156066010179688692107195572679448377022660187493117482552354887255967959473828708949465982835085658015438685159219153348125861982966902706795929866032389611325518509983815234810190419913364567450264118122918625091636534133701922567413915709199174234722642774022748876761798838932368596471473383761819955993306073194192511980655731511111418734812039174839069481997922657860771600177158258301963426575629546453731005562602020307021534742971566271133060854173312518512084425581890020765636041299469386187750782069745075999996018174251440607754688000000
Total number of lists in which winning candidate has majority support: 61854535859474557855990802105237752997003079226915266398295566709786571570865126944390968210637518423542079527102570022999166688984274296235885139162890371514814845152010622805158620168912468690693188245682966954787357816511116779763733163402700476933630959164843305433201850578845370091288748854061390074747439388865845974202161930262599395731727353191068011590081152770196877407971636426951731994085907267804768852937825049803838100473521131063052926267944191522907342658317551764904960334072386631507754440746171433527467144733182012255305326199455319159497759562823152663968262865777457103271700128546028508185840232611523695405934728964035118237960868014973326447922706116544354336034347614678384022162974527037904321191181398534575569848882630055799910749797172304564491772423199336789851612216070203204275524094233112452914612289125662370386046257715156289616617574543451300214501841247096787674912979683070061713605543496243219427259075034171350439369320584086710980928379366489759203262100385124489235015029050551301074751781445992781257981654435667272527606697678067098831709443929240127929128547394461174871163509626564598556318324969348111671167120110424508155676842784775760059444981204834187739812753861969928971222420761131855788632604940532247374491023157683617994324064023325600495426562474895522481817088290306507602098761586556424183165192703406960550038339797967819973579603029824346542759436150333392596915015063750963940450511380855736879240949203056727957820806654033532382652697944335187351538193502302752677438979716069371383839023319581409715533431261519318056963567358615092916567918646344809017192689254441873194607032799176580005518847744513314944645132471407540662504364341425558288361717082939160655446029064670317732228963561205409737330114951013218590804768437809071092058235440817461936484573823922609954200258697063724572845222547780365519185476138723059149923014644555837546472430304327776379190867288941326831870645002682813073558114144397020856651396098641428245880407763399111302626215417904725128805326719787159608089308759310539659291511216704403271866844555932756610742675279329626214802593821101127278170409131325586212124151710165142757100487590767687382044871782315436075121171542572626492638808471680968040729208050469140926077912584273576340641257014821813020214765524809778864954507078512918888714254859177939535150020680594439865146499021168025610541634139341501484522062829556587659324452928788309544618343094453013332661828069966660143355367154612846409938357974206448260728642168060123578255631198332463974385754771418670196985850132396338939942461308023817769221799318621949923468227274010642413121733050935121767411991316887764568127280843446615938837129446630717759517950376477713807259688797866576252414083146600686650880207899667891677604126827061960251420067738418355966327207257392917436336731865485642950327130932493289464011591491047704107981756427219291055264143312206202230577972263205514048071843127797269511291476129899775547198248411450106880937713920234740112362636474232781683958078134701589591040282984364866200009733100910271820163097453414872456237992771088179902024383111236689464059435281063899471133588050088649308537531041274287153585336369394790451199357846698579951685792888991089679835339060335104126017261132825267018640531309505427477537620870911733521553658152454976725660963872168787051805718895517444599140932756344873557326427855400701615942270826310406658514734551562588662900750800729828396653844953945482326221733257969297201087018127132328768798046176023879900010060373993216522794747523289640979321071773794218866605017475730371283611610761350911346861500008729919019301067072919496561285987332861930046572860663808696120651875913544097783371879746140834717719015688160225361094599128341000683910024247665018933294496488786624226050404864707017014357988489253639376601921455654391172067437596345359240709330773941058600329053762687941644065922633145644130469148317865760411103692624307884683174819618671442069152710707134162672027346237374834097350336170928453416600244355207200811054739419529397124154126071051792133199895128726950214690751373964902723667255890924491013232603657299792685025004023713057058893428464868779315295174749486468550601318702477239163614341044641062996932687024872027811548077618001234615864548746268130250171192778617488138004967564306698248535187717421076365106348806106821537346949643846700449330358599365475382470246918611292871944809121969111017864598545829051438437850288221899022100184712042404754086803996345120912030499710004313022194682886744221708406796242511777371707223033713474614368940615606518938162404539717599572710179404281043850851420088512734477536989718707445595772395424620311444817067973609967777834700409297277415549465074115328057064442133675519804938678457212846870080100109455964899729071965181905612555945488493560180366058606748476317280119865991331975082418726690147297736709796082622562532706341896303468093693062647926409712747591212083935703258289194014940709114745569037161608333832469544837654632228442721822702724003627170996788824619104568972975934911932378037473120556765765762195941632676171512945909129242448444638553636536609594131129282152539412714009585868892428320440895663351867733881652239580949142684683481353248562798873757287916757580468627389991155852813550448917969327892911097933990419196015960374273051473175438993206323137256403186109363448282668586166311354485384932931804144339988774403004402405896945245725422702431733690124017022924657227323047668961814477218714072568186313359079987411940130396681799715540521231268441953447718529859895803980262710401387658027606656500057205126849851478255459358279161783808824363571309794095088890379247630444890577923100756584545549108006677956146637604172798742819492521507826898758791892006636607596635419708170718743967976837221333720083327132366807991394197404515600528862762389248905808587268155994320338244998724029406492065210164089814201328490050287743177666822378851971606713160006189128166640525745262825626356517895066557744597519977449314900187385212625277914375251622206262429685098796590400768356218087685302088179153086859311291523273637686891639986221244405181640484253991269897217169690609115907769063954029261659568602193333309944736964180426430695238571165245318661728551017255810873045429123026855411671804358355994634963743745030745365862888759330245538534977291471181991934485986474528502644452587454925992820425856177106615897073589580463926178212465336178760260176190531289730768305813149258314589213372104868616854925650428977268668149331635952396377069894057586620585416152265929820699770161781555434581354014880088969190425658318935898315061156729251402541740877913593377241246796287235439071594744201325364245711979195094423501588318728952447886937824706628041515791435885246207092104530611200000


Percentage of lists in which winning candidate has majority support: 99.39395070510206.

1 comment:

  1. Interesting indeed! Thank you.

    A couple of suggestions:

    1) The script works fine with Python 3.x, but for 2.x, one might need to add the following line in the beginning so the percentage calculation in the last few lines doesn't truncate to zero:

    from __future__ import division


    2) "Let us assume that there is at least a 5 percent gap in the support level of the two candidates. It can then be shown that over 99 percent of the tickets will have the name of the candidate with more support."

    This part needed a little more illustration for me. While the python script explains the details, the calculation below helped me understand it easier: Say the population size is 100 and sample size is 3. Then there are 100-choose-3 = 161700 total lists possible. If there is a 2 percent difference in support levels, 51 people favor Raghu and 49 favor Prasad. There are two main ways in which Raghu shows up as the winner in the opinion poll of 3 people:
    1) If all 3 are drawn from the 51 voters favoring Raghu (51-choose-3 = 20825 ways of doing this)
    2) If 2 are from the 51 voters favoring Raghu and 1 is from the 49 voters favoring Prasad (51-choose-2 * 49-choose-1 = 1275*49 = 62475 ways of doing this)
    So, in a total of 62475+20825=83300 lists, (out of the total 161700 lists or 51.51%) Raghu appears as the winner in the opinion poll

    ReplyDelete

Please note: Comments are moderated. Only civilised conversation is permitted on this blog. Criticism is perfectly okay; uncivilised language is not. We delete any comment which is spam, has personal attacks against anyone, or uses foul language. We delete any comment which does not contribute to the intellectual discussion about the blog article in question.

LaTeX mathematics works. This means that if you want to say $10 you have to say \$10.