Distributed Computing and MapRejuice

August 31st, 2010 Tony No comments

I like to say I’m into quantitative methods, Data visualization, Machine learning, and Data mining, but the truth is that I only know enough to scratch the surface on any one of these.  However, I am learning as quickly as I can.  And a big part of learning is keeping up with the communities.  I’ve hailed @DrewConway as my hero, but there are a fair number of folks just over my head in all of the above fields.  Whenever I find such a person, I do two things: Subscribe to their blog, and follow them on Twitter.  One such fellow, @DataJunkie, just posted something remarkable.

One of the biggest problems in modern data processing is the amount of information.  We can collect a metric boatload every second, but we can’t process and analyze it nearly so quickly.  One method of handling this problem is called distributed computing.  Basically, it means you recruit other people to donate their computer’s downtime to whatever noble calculation.  It has enormous applications everywhere in Computer Science, Bioinformatics, Cyber security, Economics, Meteorology…  The list would end when I ran out of distinct University departments.  One of the more famous of such processes is called SETI@home which recruits your computer to sift through a constant stream of radio transmissions, seeking potential contact attempts.

Traditionally, you would download a program which would run your processor to full capacity while it was running, filling all unused resources with its arcane calculations.  Well, now there’s a new way to do it.  It’s called MapRejuice, and its a simply extraordinary concept.  While everyone else is moving towards putting all content and programs in Browsers instead of Explorers, Distributed processing was basically stuck.  It required intimate information about your CPU in order not to step on your toes.  With MapRejuice, they’ve worked around that problem to build a distributed processing program that works out of a browser, through some JavaScript magic.  Totally Brilliant.

Oh, and if you’ve been reading this at my site (i.e. not through an RSS reader), you’ve been running the program, donating a sliver of your computer’s untapped potential to solve some big problem.  Thanks for helping to make the world a better-analyzed place.

Newcomb’s Paradox and Rational Choice

August 17th, 2010 Tony 1 comment

William Briggs, Statistician, posted an interesting decision theory problem a few day ago.  As if you didn’t guess from the title, it’s called Newcomb’s Paradox, and as Briggs laid it out, it goes a little something like this:

“You will play a game against an evil entity that can perfectly predict your future actions. Just for the sake of giving the evil one a name, let’s call him “Bill”. Bill is evil because he likes to taunt you with the possibility of making money, only to jerk that chance away from you at the last moment.

“Here’s what Bill does, as traditionally described. Later, we add clarifications and twists. Bill puts $1,000 in a clear box, and then also presents you with an opaque box inside of which might contain $1 million. You will be allowed to take both boxes, or just the opaque one. To clarify: you may not take the $1,000 without also taking the opaque box. Obviously, you may keep the contents of whatever box or boxes you take.

“Now for the evil part. If Bill predicts—and remember, he’s never wrong—that you will take just the opaque box, he will leave it full of money. But if he predicts that you will greedily take both boxes, then he will put nada in the opaque box[...]“

This game is totally bizarre from a game theoretic standpoint, because Bill has no incentive to play such a game, but that does not mean the tools of game theory cannot shed a little light on the issue.

Basically, there are two things you can do: Take the opaque box, or take both boxes.  Because Bill can perfectly predict what you will do, if you take the opaque box you will get $1 million, but if you take both you’ll only get $1000.  Now, there is a single way to rationally take both: greed, paired with a belief in free will.  If you believe you can walk into the room planning to take only the opaque box, the opaque box should be full.  But if the opaque box is full, why not take both boxes?  After all, once you’re in the room, Bill can’t change his mind about you and magically decide to remove the million.

This perspective is the Nash equilibrium.  However, it fundamentally underestimates Bill’s powers.  When it is stated his predictive powers are 100%, that means there’s no way around them.  The person who takes both boxes by planning to trick Bill will always get the sucker’s payoff.  But what if there was a way to get both boxes out of the room, with the million in the opaque?  You’d need to plan to follow the determination of a random process.  Bill can read you perfectly, but he can’t, for instance, predict the outcome of a coin flip.

So, you decide to walk into the room with the boxes, and flip a coin: heads, you take both; tails, you take only the opaque.  Because both are equally likely, Bill doesn’t know what to do.  He could put the million in every time, but then you’d just take both every time.  He could take the million out, but then you’d just take both every time.  The best Bill can do is devise a strategy to compel you to flip a coin, so he pulls out his own coin.  Heads, he pulls the million; tails, he leaves it.

The payoffs, (as seen in the matrix to the right) are all equally likely.  That means you can assign a .25 probability to each, and calculate the expected value of using this coin-flipping method.

EU(coin) = .25(1000) + .25(0) + .25(1,001,000) + .25(1,000,000) = $505,000

So the expected value of using the coin is $505,000.  Not too shabby, but a real downer when you could have an expected value of $1,000,000 just by committing to picking only the opaque box.  So your optimal strategy is to just pick the opaque box.

Or is it?

The coin has a fixed probability of .5 for each outcome {heads, tails}.  Perhaps there is some probability which will allow you to force your expected value over $1,000,000.  That would definitely be a superior strategy.  Briggs explained this a Quantum Number Maker Upper (QNMU).  While the name is cute, I’ll stick with the more orthodox Random Number Generator (RNG).  It’s basically a machine that you tell the probability with which you want to defect (take both boxes), and it will tell you to do so, or not.  In other words, tell it .99, and it will tell you to take the boxes 99 times out of 100.  Tell it .5, and it starts doing exactly the same thing the coin did: take both half the time, and opaque half of the time.

In order to check whether there is any value which will give us a better expected value than just ignoring this whole randomness business, we need to define a probability p as the probability that you’ll defect and take both.  Since Bill’s strategy is to use the same p-value as you to determine whether he should include the money or not, we’ll not be coy and just call both probabilities p.  The probability matrix includes terms of p (probability that you will take both, probability Bill will empty the box), and 1-p (the probability the you and he won’t, respectively).

If we couple this probability with the payoffs for each, we get a fascinating expected utility function:

EU(p) = p2(1000) + p(1-p)(0) + (1-p)(p)(1,001,000) + (1-p)2(1,000,000)

Which, if we expand, gives us:

EU(p) = 1000p2 + 1,001,000p – 1,001,000p2 + 1,000,000 – 2,000,000p + 1,000,000p2

And, canceling terms out, ultimately yields:

EU(p) = 1,000,000 – 990,000p

So, if we wanted a value of p which made the expected utility rise above 1,000,000, We just have a simple inequality to tackle:

1,000,000 < 1,000,000 – 990,000p

0 < -990,000p

0 > 990,000p

0 > p

Hold up!  p is a probability!  And a probability cannot be negative!  Clearly there does not exist any value of p which satisfies the constraints of our argument, meaning that no value of p will help you expect to get better than the million.  Thus the optimal strategy is to enter 0 in the RNG, or just let the bloody thing collect dust, while you collect your cool, even million dollars.

It turns out that most people get this.  I did a little survey experiment to see how people tended to respond to each of these situations (pure choice, coin tossing, RNG).  The survey wasn’t terribly scientific, but I think it served its purpose.  Of 24 respondents, 5 selected both boxes straightaway, meaning that these (approximately) 20% believed they could trick Bill, or that because the boxes were already filled or not, they had nothing to lose by taking both.

The coin option attracted only one respondent, while 18 chose to go straight for the opaque box, and again, five took both.  Suffice to say there was only one gambler in the crowd.  Likewise, when given the choice to set the probability of taking both boxes, only one person took the chance, while 16 still chose the opaque box and five picked both.  This suggests both popular risk-aversion, and trust for the game.  On that topic, one respondent wrote back: “if schrodinger’s cat is in the box instead, I’m going to be really upset…and slightly impressed[sic]“.  Very like Schrodinger’s cat, indeed.

Ultimately, I assert that people will tend to favor the safer behavior.  The Nash equilibrium (remove the million, take both boxes) is inadequate for predicting the probable response, either because of the fallacious existence of a perfect predictor, or the absent understanding and consideration of Bill’s preferences, or perhaps because of the predictive insufficiencies of the Nash equilibrium in the real world.  Whatever the reason, most people’s inclination to take the money and head for the hills is spot on.  Incidentally, if you ever should find yourself faced with this game, kindly remember who taught you this optimal strategy.

(Big shout-out to all my dear friends who braved the survey.  Thanks a lot, guys!)

Categories: Offbeat Tags: ,

Agent-Based Modeling makes the news!

July 27th, 2010 Tony Comments off

The Economist (via Computational Legal Studies) has run an excellent, albeit brief, article on the use of Agent-Based Modeling in macro economics.  The gist of the story is that the widely prevailing model of macroeconomics (dynamic stochastic general equilibrium) is not only a poor predictor of economic growth or shrinkage, but that its commonality amongst bankers and governments makes it dangerous.  If we all enact policies and behaviors based on the same expectations and those expectations are dead wrong, its easy to see how a panic can ensue.

The answer to this problem is to adapt some more flexible models.  Here’s where the ABMs come in.

Agent-based modelling does not assume that the economy can achieve a settled equilibrium. No order or design is imposed on the economy from the top down. Unlike many models, ABMs are not populated with “representative agents”: identical traders, firms or households whose individual behaviour mirrors the economy as a whole. Rather, an ABM uses a bottom-up approach which assigns particular behavioural rules to each agent. For example, some may believe that prices reflect fundamentals whereas others may rely on empirical observations of past price trends.

Crucially, agents’ behaviour may be determined (and altered) by direct interactions between them, whereas in conventional models interaction happens only indirectly through pricing. This feature of ABMs enables, for example, the copycat behaviour that leads to “herding” among investors. The agents may learn from experience or switch their strategies according to majority opinion. They can aggregate into institutional structures such as banks and firms. These things are very hard, sometimes impossible, to build into conventional models. But in an agent-based model you simply run a computer simulation to see what emerges, free from any top-down assumptions.

[...]The markets that [ABMs] generate are more like a turbulent river or the weather system, subject to constant storms and seizures of all sizes. Big fluctuations and even crashes are an inherent feature. That is because ABMs contain feedback mechanisms that can amplify small effects, such as the herding and panic that generate bubbles and crashes. In mathematical terms the models are “non-linear”, meaning that effects need not be proportional to their causes.

It’s wonderful to see Agent-based modeling getting some street cred–while the Statistics of the Social Sciences have become wildly common, other less popular techniques which give nonlinearities their due have fallen by the wayside.  While some scientists are trying to change this (Robert Axelrod comes to mind), I suspect it will be a long time before we see an entire issue of a Political Science Journal devoid of a paper that isn’t predicated entirely on some regression model.

Although, for anyone who is interested in ABMs as a tool for Political Science, seriously, check out Axelrod’s papers.  They’re awesome.

Graph Algebra 2: Time Operators

July 10th, 2010 Tony Comments off

I have just a little stub today.  There are some other things you can do with Graph Algebra, like measure things over time.  People who are familiar with modeling in the social sciences will readily recognize the distinction between discrete and continuous time, and these terms are no different for Graph Algebra.

Discrete time, E^-1

In discrete time, events are evenly spaced.  This is not to say that events occur on an inflexible schedule, but that you may be measuring things that happen periodically, and measuring them at the time of occurrence.  The ideal example is elections.  In America, they occur every two years.  However, in the United Kingdom, elections are called periodically but not on a fixed schedule.  You can still use the discrete time operator for British elections because you’re just measuring results or popularity or whatever at the time of the election and not measuring anything about the time in between elections.  Just to illustrate, elections in which a President is not being elected tend to be unpopular.  There’s an interesting dynamic model to show that.

From the points on this graph we can see voter turnout at various election years.  The oscillating line comes from the model we generate, based on code from the earlier posts on dynamic modeling.  Now, the voter turnout between elections is zero, but we don’t need to show that because it’s obvious.  If there’s nothing to vote on, no one will show up  to vote.  That’s why this is discrete time.  We simply jump from one period to the next.

In graph algebra, we have specially designated operators called time operator to deal with the passage of time.  The discrete time operator looks like a capital E raised to the negative first.  In short, it means that we step back one unit of time in order for the model to work.

What if we wanted to measure something that doesn’t jump from one point to the next?  For instance, what if we wanted to describe Presidential Approval ratings?  People judge the president every time they hear anything about him, so approval ratings can fluctuate literally from moment to moment.  Thus we measure Presidential approval in continuous time, and infer that in the times in between our measurement, Presidential approval fell somewhere in between the two values, or nearby.  This is continuous time.

Continuous time  Δ^-1

Continuous time is a little more complicated.  There’s calculus involved.  I won’t go into it, but for the calc-inclined, think of continuous time as the limit as the distance between points closes to zero.  We represent differences (as most sciences do) with the Greek letter delta, Δ, like so:

ΔY(t) = Y(t) – Y(t-1)

However, in order to really sink our teeth into continuous time models, we need a very different operator.  If Δ is a difference operator, then we need Δ^-1 (reads “delta inverse”).  This is sometimes called a summation operator, because it undoes the difference operator, exactly the way addition can undo subtraction.

At this point, I would normally introduce some model to demonstrate the use of these tools, but there’s something I haven’t really explained yet, and in order to make this stuff meaningful, we’ll really need to know about it.  It’s called feedback.  Stay tuned.

Categories: Polisci Tags:

Probability of World Wars

July 5th, 2010 Tony Comments off

This is not a graph algebra post.  I was hoping to run straight through my graph algebra series, but this cropped up, and I just had to say something.  The Pew Research Center performed a survey polling people about the fate of the world in 2050. The results are fascinating, but there’s one in particular I’d like to pull apart. 58% of people believe there will likely be another world war between now and then.

Here’s the thing, though: We can calculate the probability that such a war will occur within this time frame.  Back in the 40’s, a Meteorologist named Lewis Fry Richardson figured out that armed conflict follows a Power Law Distribution.  The bigger the war, the less likely such a war will ever occur.  Thus we can calculate the probability that some random war meets the scale of a world war (more than 10 million casualties).  Harvard’s Lars Erik Cederman replicated this result in a paper in which he took the Correlates of War for an even more accurate fit than Richardson had the resources to even dream about.  He generated this graph to describe it.

So, let’s assume that the term “World War” means a war in which the scale of casualties is measured in the tens of millions, like the prior two world wars.  That means that for the purposes of the above equation, log s = 7ish.  Plugging this into our equation:

log P(S>s) = 1.27 – 0.41 log s
log P(S>s) = 1.27 – 0.41 (7)
log P(S>s) = -1.6
P(S>s) = 0.0251188643

Thus the probability that any random war will reach the scale of a World War is 2.5%.  OK, so what about those random wars?  The Correlates of War project lists 86 interstate conflicts that had 1000 or more fatalities, over the course of 185 years (1816-2001).  That’s a rate of approximately (86/185 =) .465 wars per year.  So we should expect (.465 * 40 =) about 18 or 19 wars.  Now, what is the probability that one of those will be World War III?

This is an interesting variation on the Birthday Problem.  We can’t calculate the probability that the big war is in our upcoming wars directly.  However, we can calculate the probability that our set of predicted wars does not contain WW3.  Here’s how it works:  The probability of the first war being the big war is approximately 2.5%.  So the probability of the second war being the big war is 2.5%.  And so on.  Now, the probability that the first war or the second war is the big war is not 2.5% + 2.5% (as is the intuition).  If that were the case, we would definitely see a world war within 40 wars, and that is hardly the case.  However, the probability that neither the first war nor the second war is the big war is 97.5% * 97.5% = 95.1%.  So, a generic equation for the probability that a World War does not exist within n wars would be

P(World War not in n wars) = .975^n

Now, from here it’s just a small logical step to the probability that a World war occurs within n wars.  It either doesn’t happen (the probability we just calculated), or it does (the probability that’s left).  So a generic equation for the probability that some war in n wars is a World War is

P(World War in n wars) = 1-(.975^n)

We can use this equation straight up to figure out if our 18 or 19 wars contain a world war.  Just set n = 18 or 19!  In this case, the actual n is about 18.6, but you try to fight .6 wars and tell me how it works out.  However, I’ll stick to that 18.6 for the equation, and run with it.  So there’s a 1-(.975^18.6) = 37.4% chance we will see an earth-shattering world war in any given 40 year span, the next 40 years included.  Personally, I’m of the camp that thinks it won’t happen.  But then, going with your gut is just punditry.  Better to say I’m 62.6% certain it won’t happen.  58% of the population is probably wrong.

References:

  1. Cederman, L. E. (2003). Modelling the size of wars: from billiard balls to sandpiles. American Political Science Review, 97, 135–150.
  2. Richardson, L.F. (1960). Statistics of deadly quarrels. Pacific Grove, CA: Boxwood Press.

Graph Algebra 1: An Introduction

July 4th, 2010 Tony Comments off

Graph Algebra is, at its core, a visual method of representing social systems.  So why muck about with it?  We could just as easily draw a conventional flowchart.  Well, flowcharts don’t give us a precise mathematical equivalence.  In other words, we can translate Graph Algebraic diagrams into equations that model social systems.  And that’s ballin’.

Graph Algebra works like this:  Start with inputs. Inputs are kind of like independent variables.  Kind of.  For instance, if I wanted to model the creation of mud, I would have two inputs: water and dirt.  If you add these things together, you get mud.  A very basic Graph Algebraic representation would look like this:

Note we have the two inputs on the left, and the output (Mud) on the right.  This is merely a convention which allows us to read a graph algebraic system in much the same way we would read a line of text, from left to right.  The middle figure should be relatively self-explanatory.  It’s addition!  OK, not terribly impressive.  As an equation, this diagram is equivalent to Rain + Dirt = Mud.  Qualitatively, this makes sense.  However, we can do better.

Before we do, however, a note on operators: the number of things which you can add with an addition operator is not limited to two.  With this simple fact, I have told you everything you need to know in order to represent Multivariate Regression in Graph Algebra.  That’s right: the tool the great Steven Levitt called the “Economists’ Favorite Trick” can be represented with the paltry amount of Graph Algebra that you, my friend, can now scrawl on the back of anything.  It looks a little like this:

Cool.  Now back to our mud.  What if I wanted to measure the amount of mud to expect?  Well, we’re going to need a recipe.  Let’s say that for every cubic meter of dirt, we need 50 litres of water to generate a cubic meter of good mud.  Well, the ratio  is a constant.  So, If I multiply the amount of water I have (in litres) by 1/50, then I can calculate exactly how much mud I’ll get, based on the amount of rainwater and mud.  We denote multiplication as another value on the same path as the original value, like this:

In this case, 1/50 is what we call an operator of proportional transformation.  These take units that are incompatible (litres/water, cubic metres/dirt) and transform them to compatible measures (cubic metres of mud’s worth of water, cubic meter of dirt).  But frequently we don’t actually know what the operators of proportional transformation are when we are drawing our graph algebraic diagrams.  So, instead of putting in the number, we would normally represent this value with a lowercase, italicized letter.  I’ll call it a.  Thus the equation this diagram generates is Dirt + a(Water) = Mud

To clean up the conventions a little, Dirt is actually the amount of dirt, Water is the amount of water, and Mud is the amount of mud.  Better to represent these as the single letter variables D, W, and M respectively.  So, our final diagram and equation, 

D + aW = M

I realize this is all hardly Earth-shattering, but what if I said you could scale these techniques for far more complex processes?  Next time: how feedback loops give Graph Algebraic models relevance, and why that dynamic modeling primer was important.

Dynamic Modeling 3: When the first-order difference model doesn’t cut it

June 13th, 2010 Tony Comments off

Data must be selected carefully.  The predictive usefulness of the model is grossly diminished if outliers taint the available data.  Figure 1, for instance, shows the Defense spending (as a fraction of the national budget) between 1948 and 1968.

Note how the trend curve (as defined by our linear difference model from the last post: see appendix for a fuller description) is a very poor predictor.  Whatever is going on here isn’t a first-order process.  So, let’s neglect the model entirely for a moment.  The huge variations in spending between 1950 and 1952 indicate there were years within the selected time span for which the defense spending dramatically increased because of some exogenous shock, and then spending trended downwards.

One way we can judge the usefulness a little more scientifically is to run a regression on the differences.  In other words, Plot Y(t+1) on top of Y(t), and insert the regression line.  Check it out:

In general, the regression is a pretty good fit for the clustered points in the middle.  However, we also have some nasty outliers.  Something about this data isn’t a first-order process.  Wikipedia to the Rescue.  It is history that inconveniently interrupts our model, causing these outliers.  President Harry Truman cut back military spending in the wake of World War II.  However, any hope Truman may have had for shifting the national focus away from foreign military affairs was ruined by the onset of the Korean War late in 1950.  Thus we see the large spike in spending evident in Figure 1.  A predictive first-order model which spans this event with just these data will be limited in its effectiveness.  The grossly underfitted line of Figure 1 is little better than useless for indicating the spending during any given year.  A much more effective model would start by following the onset of the Korean War (a completely unpredictable event if prediction were solely based on these data), and trace the evolution of the expenditure as it decreased from its start-of-conflict high in 1952, as we see in Figure 3.

It ain’t perfect, but it’s definitely much better.  So, dynamic modeling with first-order linear difference equations has an enormous array of applications.  However, it is easy to be seduced by the numbers without carefully considering the data and the substantive implications of these findings.  Any inattentiveness in this respect may easily lead to meaningless, incorrect, or downright silly conclusions.

Next Time: Saving the code, changing the method.

References:

Code adapted from http://www.courtneybrown.com/classes/ModelingSocialPhenomena/Assignments/Assignment2NationalDefenseOutlaysCourtneyBrownMathModeling.htm

Dataset taken from http://www.courtneybrown.com/classes/ModelingSocialPhenomena/Assignments/Assignment2CourtneyBrownMathModeling.htm

Appendix:

This is the R code to generate these Graphs:

df <- read.csv(file="http://nortalktoowise.com/code/datasets/Defense.csv", head=TRUE, sep=",")
attach(df)
lagvar <- function(x,y){return(c(rep(NA, y),x[-((length(x)-y+1):length(x))]))}
lagvar1 <- lagvar(defensespending,1)
model <- lm(defensespending ~ lagvar1)
y2 <- 0
t <- 0
y1 <- .3
a <- model$coefficients[[2]]
b <- model$coefficients[[1]]
timeserieslength <- nrow(df)
for (i in 1:timeserieslength) {
	y2[i] <- (a*y1[i])+b
	t[i] <- i
	if (i < timeserieslength) y1[i+1]=y2[i]}
plot(t, defensespending, xlab="Years", ylab="Defense Spending as Fraction of Budget", main="Figure 1: U.S. Defense Expenditure, 1948-68", pch=19)
lines(t, y2, lwd=2)
windows()
plot(lagvar1, defensespending, xlab="", ylab="", pch=19)
title(xlab="Y", ylab="Y(t+1)", main="Figure 2: Plot of the first differences", cex=1.5, col="black", font=2)
abline(model, lwd=2)
windows()
newdf <- df[which(year>1951, arr.in=TRUE),]
lagvar2 <- lagvar(newdf$defensespending,1)
model2 <- lm(newdf$defensespending ~ lagvar2)
y2 <- 0
t <- 0
y1 <- .68
a <- model2$coefficients[[2]]
b <- model2$coefficients[[1]]
timeserieslength <- nrow(newdf)
for (i in 1:timeserieslength) {
	y2[i] <- (a*y1[i])+b
	t[i] <- i
	if (i < timeserieslength) y1[i+1]=y2[i]}
plot(newdf$year, newdf$defensespending, xlab="Years", ylab="Defense Spending as Fraction of Budget", main="Figure 3: U.S. Defense Expenditure, 1952-68", pch=19)
lines(newdf$year, y2, lwd=2)

Please forgive my poor style (reusing variable names and whatnot).  It works, and that’s enough for me at the moment.

Dynamic Modeling 2: Our First Substantive Model

May 30th, 2010 Tony 2 comments

(This is the second of a series of ongoing posts on using Graph Algebra in the Social Sciences.)

First-order linear difference equations are powerful, yet simple modeling tools.  They can provide access to useful substantive insights to real-world phenomena.  They can have powerful predictive ability when used appropriately.  Additionally, they may be classified in any number of ways in accordance with the parameters by which they are defined.  And though they are not immune to any of a host of issues, a thoughtful approach to their application can always yield meaningful information, if not for discussion then for further refinement of the model.

Let’s look at that example from the last post:

OK, time to reveal the secret.  Here’s how to do it:

df <- read.csv(file="http://nortalktoowise.com/code/datasets/Electricity.csv", head=TRUE, sep=",")
attach(df)
lagvar <- function(x,y){return(c(rep(NA,y),x[-((length(x)-y+1):length(x))]))}
lagvar1 <- lagvar(electricalusage,1)
model <- lm(electricalusage ~ lagvar1)
y1 <- 290
y2 <- 0
t <- 0
a <- model$coefficients[[2]]
b <- model$coefficients[[1]]
timeserieslength <- nrow(df)
for (i in 1:timeserieslength) {
y2[i] <- (a*y1[i])+b
t[i] <- i
if (i < timeserieslength) y1[i+1]=y2[i]}
plot(year, electricalusage, xlab="Year", ylab="Electricity Usage (in per capita KWh)", main="Figure 1: Annual Electricity Usage, 1920-70", pch=19)
lines(year, y2, lwd=2)

Alright, this probably needs a little bit of explaining.  This first few lines are just getting the data from the URL, and attaching the set so we can call the variable names directly.  I’ll have to justify that weird function.  Instead of breaking it down character for character, I’m just going to get away with explaining what it does.  lagvar is function that takes a table, looks at the variable you tell it to (in this case, electricalusage), and copies it into a new array.  Why would you want that?  Well, it has the added flexibility of delaying the variable by a number of rows (which you can readily specify.  This one, for instance, takes our dataset:

> df
year electricalusage
1  1920             339
2  1921             347
3  1922             359
4  1923             368
5  1924             378
...
47 1966            5265
48 1967            5577
49 1968            6057
50 1969            6571
51 1970            7066

and generates lagvar1, which “lags” electricalusage by one row:

> newdata
year electricalusage lagvar1
1  1920             339      NA
2  1921             347     339
3  1922             359     347
4  1923             368     359
5  1924             378     368
...
47 1966            5265    4933
48 1967            5577    5265
49 1968            6057    5577
50 1969            6571    6057
51 1970            7066    6571

This is extremely useful for calculating differences over time.  Then we run a simple linear regression: electricalusage as explained by lagvar1.  However, the regression is not the end-all of this process (and a million freshman statistics students gasp in horror).  We just run the regression to estimate our slope and intercept (a and b respectively).  The rest of the code if effectively identical to the example from my last post, though I must make one small confession: you might have noticed that y1 is a constant, while the other parameter variables are called from the dataset or linear model.  The truth is that I don’t know an easy mathematical way to pick a perfect y1.  It should be pretty close to electricalusage at the beginning of the graph, which means 339-ish.  It’s actually a bit lower (290) because you must pick a value which represents the electricalusage in a place where the line doesn’t go, just the tiniest bit behind the graph.  In other words, just estimate it.  If your curve is off-target, just fiddle around with the y1 value until it looks right.  But if your y1 is crappy, your prediction curve is still going to beat the hell out of the linear model:

abline(model, lwd=2)

So, what’s Y* in this graph?  Mathematically it is b/(1-a) = 64, but the substantive implication is silly.  The close fit seen in figure 1 is compelling for predictive accuracy (and this may indeed be a fine graph for future predictions), but taking this model at face value still leaves us with a completely incorrect substantive conclusion: Human beings back to the beginning of time have had about 64 kWhrs at their disposal annually.  While this may not play a meaningful part in any model which incorporates these data, it is an example of the need to at least be aware of the substantive implications of a model.

Next time: When a first-order linear difference equation doesn’t cut it.

References:

Code adapted from http://www.courtneybrown.com/classes/ModelingSocialPhenomena/Assignments/Assignment2CourtneyBrownMathModeling.htm

Dataset from http://www.courtneybrown.com/classes/ModelingSocialPhenomena/Assignments/Assignment2ElectricEnergyCourtneyBrownMathModeling.htm

Dynamic Modeling 1: Linear Difference Equations

May 28th, 2010 Tony Comments off

(This is the first in a series on the use of Graph Algebraic models for Social Science.)

Linear Difference models are a hugely important first step in learning Graph Algebraic modeling.  That said, linear difference equations are a completely independent thing from Graph Algebra.  I’ll get into the Graph algebra stuff in the next post or two, but for now bear with me.  Linear Difference models are very cool.

Linear difference models are basically recursively-calculated functions.  They generally take the form:

Y(t+1) = aY(t) + b

Where t indicates a period in time, t+1 is the next period in time, and a and b are constants.  This is strikingly similar to a much more familiar equation to anyone with high-school Algebra I.  Does Y = mx + b ring any bells?  It’s the same idea here.  However, here’s the weird thing: this method draws curves.

Wait, What?

Yup, curves.  Despite the name, we can use linear difference equations to generate the above graph (and many other cool ones).  The idea is that because you’re recursively calculating the operative equation, the equation represents a straight line between two discrete points, t and t+1.  And the line between the next two points will be different.  Thus, when we literally connect the dots, we extrapolate a curve-like picture.  (Those with a background in Calculus will recognize how this is simply an approximation, and that as the distance between the points is shrunk towards zero, a true curve will emerge.  However, in order to do this, we’ll need to talk a little about the calculus-equivalent of Difference equations, which are of course Differential Equations–another topic for later).

Here’s how you do it in R.

a <- -.6 #slope
b <- .3  #intercept
timeserieslength <- 10 #Number of iterations
y1 <- .3 #initial Value
y2 <- 0
t <- 0
for (i in 1:timeserieslength) {
     y2[i] <- (a*y1[i])+b
     t[i] <- i
     if (i < timeserieslength){y1[i+1]=y2[i]}}
plot(t, y2, type="o", ylab="Y", xlab="time", pch=20, lwd=2)
title(main = list("Figure 1: Y over time", cex=1.5, col="black", font=1))

If you run the code as-is, you’ll notice something interesting:  The code doesn’t match the graph shown above.  In fact, this is a completely manufactured example, with no substantive basis.  This is a by-product of the parameters you set the model to run with.  The variables you can alter to play around with different graphs are pretty obvious: a, b, timeserieslength, and y1.

  • a is the same a as in the equation.  It’s simply the slope. When we start using this technique to derive substantive information, we can parse this value out of a linear model of the data.
  • b is also the same as is seen in the equation.  It is the “y-intercept,” as every high-school algebra student knows.  This can also be automatically extrapolated from a linear model.
  • timeserieslength is the number of values for which you want this model to run.  If you’re measuring annual GDP over 30 years, this value should be 30.  If you’re measuring weekly rainfall over a year, this should be 52.  Easy, right?
  • Y(1) is the initial condition.  Frequently equal to the b value, but not always.  Not equal to b when you want to start the calculation at a t value that is not zero.

So this is a good start.  But let’s think about a a little bit.  If a is negative, what happens?  Well, if a > -1, you’ll have an interesting little oscillation that eventually hits an equilibrium.  If a < -1, the oscillation increases as t increases, meaning you can follow t backwards to equilibrium, or forwards to increasing divergence.  Now, the special case: what about a =  -1?  This one is cool.  It bounces back and forth between two constant values.  Very unusual, very useful. Here are some examples with their parameters:

a = -.6

a = -1.6

a = -1

A note about that equilibrium I just mentioned.  We can calculate it deterministically!  Check this out:

Equilibrium is the value Y* such that Y* = Y(t) = Y(t+1).
Substituting Y* into the first equation, Y* = aY* + b
Solving for Y* gives us Y* = b/(1-a)
Thus, for any linear difference equation (such that a ≠ 1), the limit as Y(t) approaches equilibrium is the Y* shown above.

Stay tuned.  Next time I’ll show how to parse useful information out of a bivariate regression to make substantive models.

References:

Code Adapted from http://courtneybrown.com/classes/ModelingSocialPhenomena/Assignments/Assignment1CourtneyBrownMathModeling.htm

Brown, Courtney. “An Introduction to Linear Difference Equations.” Presentation for POLS 490, Emory University, Spring 2010.  Available Online, http://courtneybrown.com/classes/ModelingSocialPhenomena/Presentations/DifferenceEquationsIntroduction.ppt

Huckfeldt, R. Robert, C. W. Kohfeld, and Thomas W. Likens. 1982. Dynamic Modeling: An Introduction. Newbury Park, California: Sage Publications. Series: Quantitative Applications in the Social Sciences, Number 27.

Categories: Polisci, R Tags:

Good data, bad visualization.

May 24th, 2010 Tony Comments off

So I ran into a post on Nukes of Hazard that bugged me.  The title, specifically, gives me a headache: A Pretty New Pie Chart.  Every data junkie and his/her dog have posted this video, but I contend it can’t be seen widely enough.

Take particular note of the presenter’s claim, “Pie Charts suck.”  There are a lot of opinions to back up this belief, mine included.  The “Pretty New Pie Chart” NoH highlighted came from Arms Control Center, to describe the proportion of global defense spending.

There are plenty of problems with this portrayal of the data, but the one I want to rail on is the fact that Global defense spending isn’t a fixed asset to be divided up.  A much more significant visual effect would be a bar graph (or more akin to my own sense of style, a horizontal bar graph).

Exact same data, only much clearer.  I don’t even claim this is perfect (could adjust the x-axis a little, format text better, etc…), but it takes no more know-how than the pie chart.  Five minutes in Excel and the world is a much clearer place.

Categories: Offbeat Tags: