USA: The Perfect Milk Machine: How Big Data Transformed the Dairy Industry

By Alexis Madrigal – Alexis Madrigal is a senior editor at The Atlantic. He’s the author of Powering the Dream: The History and Promise of Green Technology.

Dairy scientists are the Gregor Mendels of the genomics age, developing new methods for understanding the link between genes and living things, all while quadrupling the average cow’s milk production since your parents were born.

cowheads_615.jpg

Reuters.

While there are more than 8 million Holstein dairy cows in the United States, there is exactly one bull that has been scientifically calculated to be the very best in the land. He goes by the name of Badger-Bluff Fanny Freddie.

Already, Badger-Bluff Fanny Freddie has 346 daughters who are on the books and thousands more that will be added to his progeny count when they start producing milk. This is quite a career for a young animal: He was only born in 2004.

There is a reason, of course, that the semen that Badger-Bluff Fanny Freddie produces has become such a hot commodity in what one artificial-insemination company calls “today’s fast paced cattle semen market.” In January of 2009, before he had a single daughter producing milk, the United States Department of Agriculture took a look at his lineage and more than 50,000 markers on his genome and declared him the best bull in the land. And, three years and 346 milk- and data-providing daughters later, it turns out that they were right.

“When Freddie [as he is known] had no daughter records our equations predicted from his DNA that he would be the best bull,” USDA research geneticist Paul VanRaden emailed me with a detectable hint of pride. “Now he is the best progeny tested bull (as predicted).”

Data-driven predictions are responsible for a massive transformation of America’s dairy cows. While other industries are just catching on to this whole “big data” thing, the animal sciences — and dairy breeding in particular — have been using large amounts of data since long before VanRaden was calculating the outsized genetic impact of the most sought-after bulls with a pencil and paper in the 1980s.

Dairy breeding is perfect for quantitative analysis. Pedigree records have been assiduously kept; relatively easy artificial insemination has helped centralized genetic information in a small number of key bulls since the 1960s; there are a relatively small and easily measurable number of traits — milk production, fat in the milk, protein in the milk, longevity, udder quality — that breeders want to optimize; each cow works for three or four years, which means that farmers invest thousands of dollars into each animal, so it’s worth it to get the best semen money can buy. The economics push breeders to use the genetics.

The bull market (heh) can be reduced to one key statistic, lifetime net merit, though there are many nuances that the single number cannot capture. Net merit denotes the likely additive value of a bull’s genetics. The number is actually denominated in dollars because it is an estimate of how much a bull’s genetic material will likely improve the revenue from a given cow. A very complicated equation weights all of the factors that go into dairy breeding and — voila — you come out with this single number. For example, a bull that could help a cow make an extra 1000 pounds of milk over her lifetime only gets an increase of $1 in net merit while a bull who will help that same cow produce a pound more protein will get $3.41 more in net merit. An increase of a single month of predicted productive life yields $35 more.

When you add it all up, Badger-Fluff Fanny Freddie has a net merit of $792. No other proven sire ranks above $750 and only seven bulls in the country rank above $700. One might assume that this is largely because the bull can help the cows make more milk, but it’s not! While breeders used to select for greater milk production, that’s no longer considered the most important trait. For example, the number three bull in America is named Ensenada Taboo Planet-Et. His predicted transmitting ability for milk production is +2323, more than 1100 pounds greater than Freddie. His offspring’s milk will likely containmore protein and fat as well. But his daughters’ productive life would be shorter and their pregnancy rate is lower. And these factors, as well as some traits related to the hypothetical daughters’ size and udder quality, trump Planet’s impressive production stats.

One reason for the change in breeding emphasis is that our cows already produce tremendous amounts of milk relative to their forbears. In 1942, when my father was born, the average dairy cow produced less than 5,000 pounds of milk in its lifetime. Now, the average cow produces over 21,000 pounds of milk. At the same time, the number of dairy cows has decreased from a high of 25 million around the end of World War II to fewer than nine million today. This is an indisputable environmental win as fewer cows create less methane, a potent greenhouse gas, and require less land.

At the same time, it turns out that cow genomes are more complex than we thought: as milk production amps up, fertility drops. There’s an art to balancing all the traits that go into optimizing a herd.

While we may worry about the use of antibiotics to stimulate animal growth or the use of hormones to increase milk production by up to 25 percent, most of the increase in the pounds of milk an animal puts out over the pastoral days of yore come from the genetic changes that we’ve wrought within these animals. It doesn’t matter how the cow is raised — in an idyllic pasture or a feedlot — either way, the animal of 2012 is not the animal of 1940 or 1980 or even 2000. A group of USDA and University of Minnesota scientists calculated that 22 percent of the genome of Holstein cattle has been altered by human selection over the last 40 years.

In a sense that’s very real, information itself has transformed these animals. The information did not accomplish this feat on its own, of course. All of this technological and scientific change is occurring within the social context of American capitalism. Over the last few decades, the number of dairies has collapsed and the size of herds has increased. These larger operations are factory farms that are built to squeeze inefficiencies out of the system to generate profits. They benefit from economies of scale that allow them to bring in genomic specialists and use more expensive bull semen.

No matter how you apportion the praise or blame, the net effect is the same. Thousands of years of qualitative breeding on family-run farms begat cows producing a few thousand pounds of milk in their lifetimes; a mere 70 years of quantitative breeding optimized to suit corporate imperatives quadrupled what all previous civilization had accomplished. And the crazy thing is, we’re at the cusp of a new era in which genomic data starts to compress the cycle of trait improvement, accelerating our path towards the perfect milk-production machine, also known as the Holstein dairy cow.

There are no more famous experiments in genetics than the ones undertaken by the Austrian monk Gregor Mendel on five acres in what is now the Czech Republic from 1856 to 1863. Mendel bred 29,000 pea plants and discovered the most basic rules of genetics without any knowledge of the underlying biochemical mechanics.

Smack dab in the middle of Mendel’s experiments, Charles Darwin’s Origin of Species was published, but we don’t have any record of intellectual mingling between the two men. Even the idea of a gene as an irreducible unit of inheritance wasn’t presented until 30 years after Mendel began his experiments. The term and field of genetics would not be fleshed out until William Bateson and company came along in the early 1900s. And its form, DNA, would not be proposed by James Watson and Francis Crick with indispensable help from Rosalind Franklin until 90 years after his last pea plant died. All this to say: Mendel was ahead of his time.

What he had going for him was a dedication to data, to quantification. His fundamental insight was statistical.

Here’s the simple version of what he did. Mendel took pea plants that reliably produced purple or white flowers when they self-pollinated. Then he crossbred them, carefully controlling how the plants reproduced. Now, one might expect that if you breed a pea plant with a purple flower and a pea plant with a white flower, you’d get progeny that were sort of mauve, a mix of the two colors. But what Mendel found instead is that you either got purple flowers or white flowers. Even more amazingly, sometimes breeding two purple flowers would yield a white flower. Among the first generation of crossbreeds, the mix of flower colors occurred at a roughly constant ratio of about 3:1, purple to white. If the traits of two plants were being mixed to generate the next generation, how could two purple flowers yield a white flower? And why would this ratio arise?

Mendel took a conceptual leap and hypothesized that the plants had two possible copies of its plans (i.e. genes) to make flower color (or any of six other traits he analyzed). If the plant received two of the dominant plan (purple), the flowers would, of course, be purple. If it received one of each, the dominant plan would still reign. But if the plant received two recessive plans, then the flowers of that pea would be white.

The monk turned out to be right. For traits controlled by a single gene, things really do work as he predicted. Mendel’s insights became part of the central dogma of genetics. You can use the statistical method he used to calculate how likely someone is to get sickle cell anemia from her parents. In most genetics classes, Mendel is where it all starts and for good reason.

But it turns out that Mendel’s version of things doesn’t actually give a very clear picture of the kinds of things we care about most. “Mendel studied a few traits that happened to be controlled by a single gene, making the probabilities easier to figure out,” the USDA’s VanRaden said. “Animal breeders for many decades have used models that assume most traits are influenced by thousands of genes with very small effects. Some [individual] genes do have detectable effects, but many studies of plant and animal traits conclude that most of the genetic variation is from many little effects.”

For dairy cows — or humans, for that matter — it’s just not as simple as the dominant-recessive single-gene paradigm that Mendel created. In fact, Mendel picked his model organism well. Its simplicity allowed him to focus in on the simplest possible genetic model and figure it out. He could easily manipulate the plant breeding; he could observe key traits of the plant; and these traits happened to be controlled by a single gene, so the math lay within human computational range. Pea plants were perfect for studying the basics of genetics.

With that in mind, allow me to suggest, then, that the dairy farmers of America, and the geneticists who work with them, are the Mendels of the genomic age. That makes the dairy cow the pea plant of this exciting new time in biology. Last week in the Proceedings of the National Academy of Science, two of the most successful bulls of all time had their genomes published.

This is a landmark in dairy herd genomics, but it’s most significant as a sign that while genomics remains mostly a curiosity for humans, it’s already coming of age when it comes to cattle. It’s telling that the cutting-edge genomics company Illumina has precisely one applied market: animal science. They make a chip that measures 50,000 markers on the cow genome for attributes that control the economically important functions of those animals.

genomic_illumina_615.jpg

A snippet from Illumina’s animal science fact sheet.

***

Mendel may have worked with plants, the rules he revealed turned out to be universal for all living things. The same could be true of the statistical rules that dairy scientists are learning about how to match up genomic data with the physical attributes they generate. The statistical rules that reflect the way dozens or hundreds of genes come together to make a cow likely to develop mastitis, say, may be formally similar to the rules that govern what makes people susceptible to schizophrenia or prone to living for a long time. Researchers like the University of Queensland’s Peter Visscher are bringing the lessons of animal science to bear on our favorite animal, ourselves.

Want to live for a very long time? Well, we hope to discover the group of genes that are responsible for longevity. The problem is that you have genomic data over here and you have phenotypic data, i.e. how things actually are, over there. What you need, then, is some way of translating between these two realms. And it’s that matrix, that series of transformations, that animal scientists have been working on for the past decade.

It turned out they were in the perfect spot to look for statistical rules. They had databases of old and new bull semen. They had old and new production data. In essence, it wasn’t that difficult to generate rules for transforming genomic data into real-world predictions. Despite — or because of — the effectiveness of traditional breeding techniques, molecular biology has been applied in the field for years in different ways. Given that breeders were trying to discover bulls’ hidden genetic profiles by evaluating the traits in their offspring that could be measured, it just made sense to start generating direct data about the animals’ genomes.

“Each of the bulls on the sire list, we have 50,000 genetic markers. Most of those, we have 700,000,” the USDA’s VanRaden said. “Every month we get another 12,000 new calves, the DNA readings come in and we send the predictions out. We have a total of 200,000 animals with DNA analysis. That’s why it’s been so easy. We had such a good phenotype file and we had DNA stored on all these bulls.”

They had all that information because for decades, scientists have been taking data from cows to figure out which bulls produced the best offspring. Typically, a bull with a promising pedigree would reach sexual maturity and his semen would be used to impregnate a selection of about 50 test cows. Those daughters would grow up and start producing milk a few years later. The data from those cows would be used to calculate the value of that now “proven” bull. People called the process “progeny testing” and it did not require that breeders knew the exact genetic makeup of a bull. Instead, scientists and breeders could simply say: We do not know the underlying constellations of genes that make this bull so valuable, but we do know how much milk his kids will produce. They learned to use that data to predict who the best bulls were.

That meant that some bulls became incredibly sought after. The number two bull of the last century, Pawnee Farm Arlinda Chief, had more than 16,000 daughters, 500,000 granddaughers, and 2 million great granddaughters. He’s responsible for about 14 percent of all the genetic material in all Holsteins, USDA scientists estimate.

“[In the past], we combined performance data — milk yield, protein yield, confirmation data — with pedigree information, and ran it through a fairly sophisticated computing gobbledygook,” another USDA scientist Curt Van Tassel told a group of dairy farmers. “It spit out at the other end predicted transmitting ability, predicted genetic values of whatever sort. Now what we’re trying to do is tweak that black box by introducing genomic data.”

There are many different ways you could model the mapping of 50,000 genetic markers onto a dozen performance traits, especially when you have to consider all kinds of environmental factors. So the dairy breeders have been developing and testing statistical models to take all this stuff into account and spit out good predictions of which bulls herd managers should ultimately select.The real promise is not that genomic data will actually be better than the ground-truth information generated from real offspring (though it might be), but rather that the estimates will be close enough to real but save 3 to 4 years per generation. If you don’t have to wait for daughters to start cranking out milk, then you can shave those years off the improvement cycle, speeding it up several times.

Nowadays breeders can choose between “genomic bulls,” which have been evaluated based purely on their genes and “proven bulls,” for which real world data is available. Discussions among dairy breeders show that many are beginning to mix in younger bulls with good-looking genomic data into the breeding regimens. How well has it gone? The first of the bulls who were bred from their genetic profiles alone, are receiving their initial production data. So far, it seems as if the genomic estimates were a little high, but more accurate than traditional methods alone.

The unique dataset and success of dairy breeders now has other scientists sniffing around their findings. Leonid Kruglyak, a genomics professor at Princeton, told me that “a lot of the statistical techniques and methodology” that connect phenotype and genotype were developed by animal breeders. In a sense, they are like codebreakers. If you know the rules of encoding. it’s not difficult to put information in one end and have it pop out the other as a code. But if you’re starting with the code, that’s a brutally difficult problem. And it’s the one that diary geneticists have been working on.

Their work could reach outside the medical realm to help us understand human’s evolution as well. For example, Kruglyak said, human population geneticists want to figure out how to explain the remarkable lack of genetic variance between human beings. “The typical [genetic] variation among humans is one change in a thousand,” he said. “Chimps, though they obviously have a much smaller population now, have several fold higher genetic diversity.” How could this be? Researchers hypothesize that human beings once went through a bottleneck where there were very few humans relative both to the current human population and the chimp population. Few humans meant that the gene pool was limited at some point in the pre-historical but fairly recent past. We’ve never recovered the diversity we might have had.

***

badger-fluff.jpg

The number-one ranked bull in the world. Kathy DeBruin.

It might seem that Badger-Bluff Fanny Freddie is the pinnacle of the Holstein bull. He’s been the top bull since the day his genetic markers showed up in the USDA database and his real-world performance has backed up his genome’s claims. But he’s far from the best bull that science can imagine.

John Cole, yet another USDA animal improvement scientist, generated an estimate of the perfect bull by choosing the optimal observed genetic sequences and hypothetically combining them. He found that the optimal bull would have a net merit value of $7,515, which absolutely blows any current bull out of the water. In other words, we’re nowhere near creating the perfect milk machine.

The problem, of course, is that genomes cannot really be cut and pasted together from the best bits. “When you go extremely far for one trait, you’re going to upset some of the other traits,” Vanraden said. Breeding is a messy (i.e. biological) process, no matter how technologically sophisticated the front end. After decades of breeding cows for milk production, people realized (to their dismay) that the ability to generate milk and the ability to have babies were negatively correlated. The more milk you tried to order up, the less babies your herd was likely to have. While we’re nowhere near the hypothetical limit for Holstein bull value, we do now know that nature is not so easily transformed without some deleterious effects. We may have factory farms, but these machines are still flesh and blood.

Except for Badger-Fluff Fanny Freddie and his fellow bulls, that is. Freddie is a disembodied creature, an animal that is more important as data than as meat or muscle. Though he’s been mentioned in thousands of web pages and dozens of trade industry articles, no one mentions where he was born or where the animal currently lives. He is, for all intents and purposes except for his own, genetic material that comes in the handy form of semen. His thousands of daughters will never smell him and his physical location doesn’t matter to anyone. He will be replaced very soon by the next top bull, as subject to the pressures of our economic system as the last version of the iPhone.