Example 2 - Campaign Finance¶
In this example we’ll introduce a few more key Inferno concepts:
- Inferno rules with multiple keysets
- field_transforms: input data casting
- parts_preprocess: a pre-map hook
- parts_postprocess: a post-reduce hook
- column_mappings: rename columns post-reduce
Inferno Rule¶
An Inferno rule to query the 2012 presidential campaign finance data.
(inferno/example_rules/election.py
):
import re
from inferno.lib.rule import InfernoRule
from inferno.lib.rule import Keyset
from inferno.lib.rule import chunk_csv_stream
# an example field_transform
def alphanumeric(val):
return re.sub(r'\W+', ' ', val).strip().lower()
# an example parts_preprocess that modifies the map input
def count(parts, params):
parts['count'] = 1
yield parts
# an example parts_preprocess that filters the map input
def candidate_filter(parts, params):
active = [
'P20002721', # Santorum, Rick
'P60003654', # Gingrich, Newt
'P80000748', # Paul, Ron
'P80003338', # Obama, Barack
'P80003353', # Romney, Mitt
]
if parts['cand_id'] in active:
yield parts
# an example parts_postprocess that filters the reduce output
def occupation_count_filter(parts, params):
if parts['count_occupation_candidate'] > 1000:
yield parts
RULES = [
InfernoRule(
name='presidential_2012',
source_tags=['gov:chunk:presidential_campaign_finance'],
map_input_stream=chunk_csv_stream,
partitions=1,
field_transforms={
'cand_nm':alphanumeric,
'contbr_occupation':alphanumeric,
},
parts_preprocess=[candidate_filter, count],
parts_postprocess=[occupation_count_filter],
csv_fields=(
'cmte_id', 'cand_id', 'cand_nm', 'contbr_nm', 'contbr_city',
'contbr_st', 'contbr_zip', 'contbr_employer', 'contbr_occupation',
'contb_receipt_amt', 'contb_receipt_dt', 'receipt_desc',
'memo_cd', 'memo_text', 'form_tp', 'file_num',
),
csv_dialect='excel',
keysets={
'by_candidate':Keyset(
key_parts=['cand_nm'],
value_parts=['count', 'contb_receipt_amt'],
column_mappings={
'cand_nm': 'candidate',
'contb_receipt_amt': 'amount',
},
),
'by_occupation':Keyset(
key_parts=['contbr_occupation', 'cand_nm'],
value_parts=['count', 'contb_receipt_amt'],
column_mappings={
'count': 'count_occupation_candidate',
'cand_nm': 'candidate',
'contb_receipt_amt': 'amount',
'contbr_occupation': 'occupation',
},
)
}
)
]
Input¶
Make sure Disco is running:
diana@ubuntu:~$ disco start
Master ubuntu:8989 started
Get the 2012 presidential campaign finance data (from the FEC):
diana@ubuntu:~$ head -4 P00000001-ALL.txt
cmte_id,cand_id,cand_nm,contbr_nm,contbr_city,contbr_st,contbr_zip,contbr_employer,contbr_occupation,contb_receipt_amt...
C00410118,"P20002978","Bachmann, Michelle","HARVEY, WILLIAM","MOBILE","AL","366010290","RETIRED","RETIRED",250...
C00410118,"P20002978","Bachmann, Michelle","HARVEY, WILLIAM","MOBILE","AL","366010290","RETIRED","RETIRED",50...
C00410118,"P20002978","Bachmann, Michelle","BLEVINS, DARONDA","PIGGOTT","AR","724548253","NONE","RETIRED",250...
Place the input data in Disco’s Distributed Filesystem (DDFS):
diana@ubuntu:~$ ddfs chunk gov:chunk:presidential_campaign_finance:2012-03-19 ./P00000001-ALL.txt
created: disco://localhost/ddfs/vol0/blob/1c/P00000001-ALL_txt-0$533-86a6d-ec842
Verify that the data is in DDFS:
diana@ubuntu:~$ ddfs xcat gov:chunk:presidential_campaign_finance:2012-03-19 | head -3
C00410118,"P20002978","Bachmann, Michelle","HARVEY, WILLIAM","MOBILE","AL","366010290","RETIRED","RETIRED",250...
C00410118,"P20002978","Bachmann, Michelle","HARVEY, WILLIAM","MOBILE","AL","366010290","RETIRED","RETIRED",50...
C00410118,"P20002978","Bachmann, Michelle","BLEVINS, DARONDA","PIGGOTT","AR","724548253","NONE","RETIRED",250...
Contributions by Candidate¶
Run the contributions by candidate job:
diana@ubuntu:~$ inferno -i election.presidential_2012.by_candidate
2012-03-19 Processing tags: ['gov:chunk:presidential_campaign_finance']
2012-03-19 Started job presidential_2012@533:87210:81a1b processing 1 blobs
2012-03-19 Done waiting for job presidential_2012@533:87210:81a1b
2012-03-19 Finished job presidential_2012@533:87210:81a1b
The output in CSV:
candidate,count,amount
gingrich newt,27740,9271750.98
obama barack,292400,81057578.81
paul ron,87697,15435762.37
romney mitt,58420,55427338.84
santorum rick,9382,3351439.54
The output as a table:
Candidate | Count | Amount |
---|---|---|
Obama Barack | 292,400 | $ 81,057,578.81 |
Romney Mitt | 58,420 | $ 55,427,338.84 |
Paul Ron | 87,697 | $ 15,435,762.37 |
Gingrich Newt | 27,740 | $ 9,271,750.98 |
Santorum Rick | 9,382 | $ 3,351,439.54 |
Contributions by Occupation¶
Run the contributions by occupation job:
diana@ubuntu:~$ inferno -i election.presidential_2012.by_occupation > occupations.csv
2012-03-19 Processing tags: ['gov:chunk:presidential_campaign_finance']
2012-03-19 Started job presidential_2012@533:87782:c7c98 processing 1 blobs
2012-03-19 Done waiting for job presidential_2012@533:87782:c7c98
2012-03-19 Finished job presidential_2012@533:87782:c7c98
The output:
diana@ubuntu:~$ grep retired occupations.csv
retired,gingrich newt,8810,2279602.27
retired,obama barack,74465,15086766.92
retired,paul ron,9373,1800563.88
retired,romney mitt,12798,6483596.24
retired,santorum rick,1752,421952.98
The output as a table:
Occupation | Candidate | Count | Amount |
---|---|---|---|
Retired | Obama Barack | 74,465 | $ 15,086,766.92 |
Retired | Romney Mitt | 12,798 | $ 6,483,596.24 |
Retired | Gingrich Newt | 8,810 | $ 2,279,602.27 |
Retired | Paul Ron | 9,373 | $ 1,800,563.88 |
Retired | Santorum Rick | 1,752 | $ 421,952.98 |