Google swallows 11,000 novels to improve AI’s conversation

As writers learn that tech giant has processed their work without permission, the Authors Guild condemns blatantly commercial use of expressive authorship

When the writer Rebecca Forster first heard how Google was using her work, it felt like she was trapped in a science fiction novel.

Is this any different than someone using one of my books to start a fire? I have no idea, she says. I have no idea what their objective is. Certainly it is not to bring me readers.

After a 25-year writing career, during which she has published 29 novels ranging from contemporary romance to police procedurals, the first instalment of her Josie Bates series, Hostile Witness, has found a new reader: Googles artificial intelligence.

My imagination just didnt go as far as it being used for something like this, Forster says. Perhaps thats my failure.

Forsters thriller is just one of 11,000 novels that researchers including Oriol Vinyals and Andrew M Dai at Google Brain have been using to improve the technology giants conversational style. After feeding these books into a neural network, the system was able to generate fluent, natural-sounding sentences. According to a Google spokesman who didnt want to be named products such as the Google app will be much more useful if they can capture the nuance of language better.

For the moment, the research is just a proof of concept, the spokesman continues via email, but these methods could help Google understand and produce a broader, more nuanced range of text for any given task.

We could have used many different sets of data for this kind of training, and we have used many different ones for different research projects, he adds. But in this case, it was particularly useful to have language that frequently repeated the same ideas, so the model could learn many ways to say the same thing the language, phrasing and grammar in fiction books tends to be much more varied and rich than in most nonfiction books.

The only problem is that they didnt ask. The Google paper [PDF] says that the novels used in this research were taken from the Books Corpus, citing a 2015 paper by Ryan Kiros and others [PDF] which describes how the authors collected a corpus of 11,038 books from the web, describing them as free books written by [as] yet unpublished authors. Its a collection that has been used by other researchers working in artificial intelligence and which is currently available for download in its entirety from the University of Toronto.

Forster says that she always appreciates an interesting use of words, but while Hostile Witness is available to download for free, no one asked her permission to use her novel as raw material to train a computer.

Perhaps Im still thinking in the old way, that a reader will read my book it didnt even occur to me that a machine could read my book. What I found curious was that these were referred to as free books written by as yet unpublished authors because my state is very different, she says.

Like many of the novels in the Book Corpus collection, the edition of Hostile Witness used in the research was published on Smashwords and includes a copyright declaration that reserves all rights, specifies that the ebook is licensed for your personal enjoyment only, and offers the reader thanks for respecting the hard work of this author. While Forster says shes no lawyer, the spirit of this declaration is clear you hope that your work would be respected by readers.

I take great pride in my craft, and perhaps it was chosen because of that. Which would be great. Or perhaps it was chosen because it was there, because it was free?

Another writer whose work has been used in the Google Brain research is Erin McCarthy, the author of more than 28 novels. The first volume of her Fast Track series, published by Penguin Random Houses Berkley Books imprint, is also available for free online, but McCarthy says that Google didnt get in touch with her or ask for permission to use Jacked Up in their research into AI. Shes fascinated to hear that romance novels are being used to improve the search conglomerates ability to speak.

There is a reason they are the bestselling genre in the US and I believe its because they feel conversational themselves, McCarthy says. Its real life turned up a notch. Realism overlying a fantasy.

The flow of the dialogue is very important, she continues. I am very cognizant of using modern diction and age-appropriate word choices. If my female character is 24 shes not going to speak in a formal manner. Conversations between the hero and heroine have realistic word choices, but there is additionally an element of fantasy there. What they want a hero to say, but what might not actually occur in real life. Thats what readers want and expect from a romance novel.

McCarthy isnt sure how to respond to the idea that her work has been used for an entirely different purpose to the one she intended, a purpose that may result in services to make the tech giant a lot of money.

Its hard to gauge the use of my work and the exact purpose for its use without having seen it in action, she says. My assumption would be they purchased a copy of the book originally. If they havent, then I would imagine the source of the content, as intellectual property, should be properly attributed and compensated for the general health of the creative community.

Far from offering proper attribution or any compensation, the Google paper avoids any suggestion that the novels used in the research were written by real people, describing the books only as a collection of text from 12k ebooks, mostly fiction.

Forster is equally adamant that writers whose work has been used to gain a commercial advantage should reap a portion of the rewards, but isnt holding her breath for any payment.

If theres one thing thats niggling at me its that I would have liked to have known, she says. With all the technology at their fingertips, then it wouldnt have been too hard to let everyone know.

According to Mary Rasenberger, executive director of the Authors Guild, this blatantly commercial use of expressive authorship comes as no surprise. Weve seen this movie before.

The Guild has been in dispute with Google since 2005, arguing that the companys project to digitise library books was a plain and brazen violation of copyright law. Google Books won in 2013, with the district court ruling that all society benefits from the project, a decision that the supreme court declined to review earlier this year.

Why shouldnt authors be asked permission, or even informed not to mention compensated before their work is used in this manner? Rasenberger asks. Theres no doubt the company has the means to do so.

Google wouldnt say whether getting hold of 11,000 authors was beyond their capacities, or if they have any plans to reward the writers, or if the people whose expertise was harvested to train their network were ever considered as individuals. While attribution isnt required, the spokesman says via email, the researchers clearly identify where they got the data.

The machine learning community has long published open research with these kinds of datasets, including many academic researchers with this set of free ebooks it doesnt harm the authors and is done for a very different purpose from the authors, so its fair use under US law.

But Rasenberger isnt convinced.

The research in question uses these novels for the exact purpose intended by their authors to be read, she argues. It shouldnt matter whether its a machine or a human doing the copying and reading, especially when behind the machine stands a multi-billion dollar corporation which has time and again bent over backwards devising ways to monetise creative content without compensating the creators of that content.

Rasenberger adds that nobody knows how books will be read or used in the future, which is why the Authors Guild is proposing that digital uses should be allowed under a licensing system. But for the moment, Google is extracting immense value from the creative efforts of thousands of authors and looking the other way.

For Forster, the lack of any proper attribution speaks volumes. If theyre not mentioning the authors, she says, then maybe theyre not thinking of it in terms of it being someones work.

She never imagined her work would wind up as being part of someone elses dataset, as raw ingredients to satisfy a machines hunger for information, but shes been around long enough to know that what you hope for isnt always what you get.

I would have loved to have been part of the discussion of this project, and to have known how it was going to be used, she says. But Id also like to be thought of as intelligent enough to be able to make a decision about the end product.

Read more:

‘How I accidentally became a poet through Twitter’ – BBC News

Brian Bilston has been dubbed the “unofficial poet laureate of Twitter”, but he stumbled into writing poetry on social media. Here he explains the power of online verse.

It started with a tweet. I never thought it would come to this.

I’m not even sure it was a poem. More of a play on words, each one carefully selected to fit into the 140-character constraint of a tweet.

It went:

Image copyright Twitter

I sent it out – and then went out. By coincidence, to see a poet – a proper one, the poet laureate, Carol Ann Duffy, no less – who was giving a reading nearby.


I had turned off my phone so it wasn’t until a few hours later, when I’d returned home and turned it back on again, that I saw my Twitter notifications had gone crazy – my tweet had been shared several hundred times, I had 200 new followers. It was the kind of reaction I was utterly unaccustomed to getting.

A few weeks later, I tried again. A poem entitled Frisbee:

Image copyright Twitter

Again, it was well-received. More retweets. More new followers. And so my career as social media poet began.

I had never intended to be a poet. A poet to my mind was someone of intensity, a serious type, the kind of person you wouldn’t want to get trapped in a kitchen with at a party (if poets received invitations to parties at all, that is).

I vaguely remembered the term “iambic pentameter” from my schooldays. I knew roughly how many syllables there were in a haiku, and I had a rudimentary grounding in the poetry of Eliot and Larkin – although my true heroes were Roger McGough and Ogden Nash.

Find out more

This is an edited version of The Social Media Poet, part of Radio 4’s Four Thought series – catch up on BBC iPlayer Radio

I had never intended to waste my days on social media either. Of Twitter, I was, as with so many other things, a late adopter. I had only joined it in order to understand what it was that those people at work, invariably younger than I, would talk about so irritatingly in meetings.

So I joined, and like many other first-time tweeters, those early months were spent in writing bad jokes and puns which echoed in the ethereal emptiness.

But something else had begun to bloom in the background – I had started to talk to strangers on Twitter (some of whom were very strange) and read about the things that interested them or made them laugh or annoyed or sad or angry.

And almost nothing, it seemed, made Twitter angrier than bad grammar. For every badly spelt, poorly constructed tweet, there were a hundred unreconstructed grammar pedants leaping in to point out the mistakes.

So I started writing poems about these heinous crimes, the misapplication of a semi-colon, the rule about i before e, and of course, that most controversial of all the punctuation marks, the Oxford comma – and even gently mocked the grammar enforcers themselves:

Image copyright Twitter

I began to receive feedback on my poems from Twitter users. A typical response was: “I hated poetry at school. I never thought I’d find myself retweeting a poem.”

Or: “I don’t really like poems but I like whatever it is you do”

I am still unsure as to whether these are compliments or not. But it did open my eyes to how poetry is still viewed with suspicion by a large part of the populace, in spite of the undoubted resurgence in popularity it has seen in recent times.

We see signs of that new interest in poetry all around us – the growth of performance poetry and the spoken word movement, poetry slams, workshops for aspiring poets, the soaring membership of societies at the national and regional level, the rise of self-publishing, and a new generation of inspiring younger poets, such as Kate Tempest and George the Poet, taking the form to newer audiences.

Image copyright Getty Images
Image caption Kate Tempest

But move outside of that growing community, the impression remains that poetry – in general terms – is difficult, almost deliberately obtuse and obscure, frequently dull, and willfully uses words such as “crepuscular” or “obsidian”.

Small wonder that the business of publishing poetry books is a fraught one – few titles sell in substantial quantities, and most general commercial publishers or literary agents have turned their backs on poetry.

Books are published by specialist presses with short print-runs or self-published. Poetry sections in bookshops appear to shrink by the month. There’s a lot of poetry about but it seems few people want to buy it.

But social media shows us that a broader, more democratic appetite for poetry exists, after all.

In June, as the terrible news came in of the murder of MP Jo Cox, the Poetry Society – a UK organisation with the aim of promoting the study, use and enjoyment of poetry – shared on Twitter The Mower, one of Philip Larkin’s later poems.

The poem concerns the death of a hedgehog, jammed up against the blades of his lawnmower. The poem is mournful and melancholic but also compassionate and powerful.

It travelled around Twitter at lightning speed, bursting out from the bubble of The Poetry Society’s followers to find a much broader audience. It was re-shared over 3,500 times and reached a readership of half a million readers in just a few days.

It’s a piece that does what brilliant poetry does best – it moves us, it gives voice to thoughts and feelings that we ourselves may struggle to articulate, it’s personal and yet universal.

Image caption Philip Larkin

But also, 37 years after it was written, it is relevant. It’s Poetry as Current Affairs. The Poem as Social Commentary. And that can be the power of social media for poetry – a place to share words at the point of need. It is lent an immediacy other outlets for poetry do not have – it is not trapped in the pages of a book on a shelf, nor waiting to be recited at an event taking place two weeks next Thursday. It’s there where we need it – and when we need it.

But the relationship between poetry and that broader social media audience is more than an immediate, articulated response to tragedy. A platform such as Twitter can also serve as an endless source of ideas for poets.

Without Twitter’s hashtag, the symbol for what is trending at any moment in time, I would never have known about Penguin Awareness Day or National Stationery Week. Last year, the stars aligned and three celebrations occurred all on the same day – World Philosophy Day, International Men’s Day and World Toilet Day, giving rise to this short piece – Poem for International Men’s Toilet Philosophy Day – which combines all three:

Image copyright Twitter

Poetry is often seen as being about Universal Concepts and Big Ideas – Truth, Love, Death, Beauty, Betrayal, Regret. These are all Grand Themes upon which we have cause to reflect throughout our lives. But there’s poetry to be found in the stuff of everyday existence, too – that nonsense in our lives that sometimes gets in the way of those Grand Themes – and social media is the ideal arena in which to share it.

Image copyright Twitter

There are other lessons for poets to learn from social media about how to keep their writing interesting and relevant to modern audiences. The ways in which we consume content have begun to change.

Any time spent wallowing in the mire that is social media shows how visual the information is with which we now engage – yes, there are words aplenty, but there are pictures and videos, too.

The barrage of information that is thrown at us means that plain words on a page are becoming less likely to be read. People struggle to find the time, inclination or powers of concentration to wade through pages of dense text. Words find themselves in competition with pictures. And less is often more.

But poets shouldn’t feel threatened by this in a social media setting – rather, it gives us the opportunity to think about form as well as content, and how the presentation of a poem might enhance or complement the words which accompany it.

I began to experiment with the visual aspects of poetry, when I wrote a poem in the shape of a Christmas tree. The poem itself was about how I had neglected to water my words, and I illustrated this by having words spread out around the base of the poem, like pine needles scattered over a carpet.

Image copyright Twitter

The presentation of the poem was the punchline but also enabled it to stand out amongst the hundreds of other tweets sitting on users’ timelines.

Other forms and shapes followed. We live in an age of infographics, data, flowcharts. On Twitter, I noticed how everyone, it would seem, enjoys a good Venn diagram. So I wrote a poem using two different personal narratives separated into two halves of a Venn diagram, with words which overlapped in their stories appearing in the intersection to form a third, unifying perspective to the poem.

Image copyright Twitter

And from there followed poems in flowcharts, or made from Scrabble tiles, in Excel spreadsheets and PowerPoint presentations, in the form of CVs, in the shapes of wine-glasses and light-bulbs. They all became popular.

Yes, their novelty helped the poems stand out – but they also embedded words within places or situations that people recognized from their own lives.

So what have I – as an accidental poet – accidentally learnt?

Poetry on social media is more than a never-ending stream of haiku concerning the changing light of the moon on water, or the beauty of cherry blossom.

It’s far more interesting and relevant than that.

More about Brian Bilston

Media captionBrian Bilston on Donald Trump, sex and the Daily Mail

It’s an opportunity for poetry to present itself in situations where and when people most need it – as a way of finding meaning or comfort to bigger, often unfathomable, world events, perhaps, or simply to provide some light relief from the complexity and perplexity of life.

But it’s also a platform for poets themselves to interact and engage with their audience – and, indeed, find new audiences – through experimentation with content and form, and a deeper engagement with real world concerns.

I shall finish with one more poem, written one day when Twitter became unavailable for a whole afternoon, much to the angst of millions of people around the world. It’s called The day that Twitter went down.

That day I got things done.

I went for a long run.

Played ping-pong,

Wrote a song.

It got to number one.

That day I did a lot.

I tied a Windsor knot.

Helped the poor,

Stopped a war,

Read all of Walter Scott.

O what a day to seize.

I learnt some Cantonese.

Led a coup,

Climbed K2,

Cured a tropical disease.

That day I met deadlines,

Got crowned King of Liechtenstein,

Stroked a toucan,

Found Lord Lucan,

Then Twitter came back online.

Follow @BBCNewsMagazine on Twitter and on Facebook

Read more: