# Generating random poems with Python #


<div style="text-align:center;margin-top:40px">(I never said they would be good poems)</div>

## Phone autocomplete ##

You can generate random text that sounds like you with your smartphone keyboard:

<img align="left" style="width:50%" src="images/phone_keyboard.png">
<img align="right" style="width:50%" src="images/phone_autocomplete.gif">

## So, how does it work? ##

First, we need a **corpus**, or the text our generator will recombine into new sentences:

In [1]:
corpus = 'The quick brown fox jumps over the lazy dog'

Simplest word **tokenization** is to split on spaces:

In [2]:
words = corpus.split(' ')
words

['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']

To create **bigrams**, iterate through the list of words with two indices, one of which is offset by one:

In [3]:
bigrams = [b for b in zip(words[:-1], words[1:])]
bigrams

[('The', 'quick'),
 ('quick', 'brown'),
 ('brown', 'fox'),
 ('fox', 'jumps'),
 ('jumps', 'over'),
 ('over', 'the'),
 ('the', 'lazy'),
 ('lazy', 'dog')]

How do we use the bigrams to predict the next word given the first word?

 Return every second element where the first element matches the **condition**:

In [4]:
condition = 'the'
next_words = [bigram[1] for bigram in bigrams
              if bigram[0].lower() == condition]
next_words

['quick', 'lazy']

(<span style="color:blue">The</span> <span style="color:red">quick</span>) (quick brown) ... (<span style="color:blue">the</span> <span style="color:red">lazy</span>) (lazy dog)

Either “<span style="color:red">quick</span>” or “<span style="color:red">lazy</span>” could be the next word.

## Trigrams and Ngrams ##

We can partition by threes too:

(<span style="color:blue">The</span> <span style="color:red">quick brown</span>) (quick brown fox) ... (<span style="color:blue">the</span> <span style="color:red">lazy dog</span>)


Or, the condition can be two words (`condition = 'the lazy'`):

(The quick brown) (quick brown fox) ... (<span style="color:blue">the lazy</span> <span span="color:red">dog</span>)


These are **trigrams**.

We can partition any **N** number of words together as **ngrams**.

So earlier we got:

In [5]:
next_words

['quick', 'lazy']

How do we know which one to pick as the next word?

Why not the word that occurred the most often after the condition in the corpus?

We can use a **Conditional Frequency Distribution (CFD)** to figure that out!

A **CFD** can tell us: given a **condition**, what is **likely** to follow?

## Conditional Frequency Distributions (CFDs) ##

In [6]:
words = ('The quick brown fox jumped over the '
        'lazy dog and the quick cat').split(' ')
print words

['The', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog', 'and', 'the', 'quick', 'cat']


In [7]:
from collections import defaultdict

cfd = defaultdict(lambda: defaultdict(lambda: 0))

## Conditional Frequency Distributions (CFDs) ##

In [8]:
for i in range(len(words) - 2):  # loop to the next-to-last word
    cfd[words[i].lower()][words[i+1].lower()] += 1

# pretty print the defaultdict
{k: dict(v) for k, v in dict(cfd).items()}

{'and': {'the': 1},
 'brown': {'fox': 1},
 'dog': {'and': 1},
 'fox': {'jumped': 1},
 'jumped': {'over': 1},
 'lazy': {'dog': 1},
 'over': {'the': 1},
 'quick': {'brown': 1},
 'the': {'lazy': 1, 'quick': 2}}

So, what's the most likely word to follow `'the'`?

In [9]:
max(cfd['the'])

'quick'

## Whole sentences can be the conditions and values too ##

Which is basically the way cleverbot works ([http://www.cleverbot.com/](http://www.cleverbot.com/)):

![Cleverbot](images/cleverbot.png)

## Random text! ##

In [10]:
import nltk
import random

TEXT = nltk.corpus.gutenberg.words('austen-emma.txt')

# NLTK shortcuts :)
bigrams = nltk.bigrams(TEXT)
cfd = nltk.ConditionalFreqDist(bigrams)

# pick a random word from the corpus to start with
word = random.choice(TEXT)
# generate 15 more words
for i in range(15):
    print word,
    if word in cfd:
        word = random.choice(cfd[word].keys())
    else:
        break

her reserve and concealment towards some feelings in moving slowly together . You will shew


## Random poems ##

Generating random poems is accomplished by limiting the choice of the next word by some constraint:

* words that rhyme with the previous line
* words that match a certain syllable count
* words that alliterate with words on the same line
* etc.

# Rhyming

**Written English != Spoken English**

English has a highly **nonphonemic orthography**, meaning that the letters often have no correspondence to the pronunciation. E.g.:


"meet" vs. "meat"

The vowels are spelled differently, yet they rhyme.

Fun fact: They used to be pronounced differently in Middle English during the invention of the printing press and standardized spelling.

# International Phonetic Alphabet (IPA)

An alphabet that can represent all varieties of human pronunciation.

* meet: /mit/
* meat: /mit/

Note: this is only the IPA transcription for only one **accent** of English.

# Syllables

* "poet" = 2 syllables
* "does" = 1 syllable

Can the IPA tell us the number of syllables in a word too?

* poet: /ˈpoʊət/
* does: /ˈdʌz/

Not really... We cannot easily identify three syllables from that transcription.

Sometimes the transcriber denotes syllable breaks (with a `.` or a `'`), but sometimes they don't.

# Arpabet

A phonetic alphabet developed by ARPA in the 70s that:

* Encodes phonemes specific to American English.
* Meant to be a machine readable code. It is ASCII only.
* Denotes how stressed every vowel is from 0-2.

This is perfect! Word's syllable count equals the number of digits in the Arpabet encoding.

# CMU Pronouncing Dictionary (CMUdict)

A large open source dictionary of English words to North American pronunciations in Arpanet encoding.

Conveniently, it is also in NLTK...

# Counting Syllables

In [11]:
import string
from nltk.corpus import cmudict
cmu = cmudict.dict()

def count_syllables(word):
    lower_word = word.lower()
    if lower_word in cmu:
        return max([len([y for y in x if y[-1] in string.digits])
                    for x in cmu[lower_word]])

In [12]:
print("poet: {}\ndoes: {}".format(count_syllables("poet"),
                                  count_syllables("does")))

poet: 2
does: 1


![Buzzfeed Haiku Generator](images/buzzfeed.png)

[http://mule.hallada.net/nlp/buzzfeed-haiku-generator/](http://mule.hallada.net/nlp/buzzfeed-haiku-generator/)

## Remember these? ##

![madlibs](images/madlibs.png)

## Mad Libs ##

These worked so well because they forced the random words (chosen by you) to fit into the syntactical structure and parts-of-speech of an existing sentence.

You end up with **syntactically** correct sentences that are **semantically** random.

We can do the same thing!

## NLTK Syntax Trees! ##

In [13]:
from stat_parser import Parser
parsed = Parser().parse('The quick brown fox jumps over the lazy dog.')
print parsed

(S
  (NP (DT the) (NN quick))
  (VP
    (VB brown)
    (NP
      (NP (JJ fox) (NN jumps))
      (PP (IN over) (NP (DT the) (JJ lazy) (NN dog)))))
  (. .))


## NLTK Syntax Trees! ##

In [14]:
parsed.pretty_print()

                              S                            
      ________________________|__________________________   
     |               VP                                  | 
     |           ____|_____________                      |  
     |          |                  NP                    | 
     |          |         _________|________             |  
     |          |        |                  PP           | 
     |          |        |          ________|___         |  
     NP         |        NP        |            NP       | 
  ___|____      |     ___|____     |     _______|____    |  
 DT       NN    VB   JJ       NN   IN   DT      JJ   NN  . 
 |        |     |    |        |    |    |       |    |   |  
the     quick brown fox     jumps over the     lazy dog  . 



## Swapping matching syntax subtrees between two corpora ##

In [15]:
from syntax_aware_generate import generate

# inserts matching syntax subtrees from trump.txt into
# trees from austen-emma.txt
generate('trump.txt', word_limit=10)

(SBARQ
  (SQ
    (NP (PRP I))
    (VP (VBP do) (RB not) (VB advise) (NP (DT the) (NN custard))))
  (. .))
I do not advise the custard .
I do n't want the drone !
(SBARQ
  (SQ
    (NP (PRP I))
    (VP (VBP do) (RB n't) (VB want) (NP (DT the) (NN drone))))
  (. !))


## spaCy ##

![spaCy speed comparison](images/spacy_speed.png)

[https://spacy.io/docs/api/#speed-comparison](https://spacy.io/docs/api/#speed-comparison)

## Character-based Recurrent Neural Networks ##

![RNN Paper](images/rnn_paper.png)

[http://www.cs.utoronto.ca/~ilya/pubs/2011/LANG-RNN.pdf](http://www.cs.utoronto.ca/~ilya/pubs/2011/LANG-RNN.pdf)

## Implementation: char-rnn ##

![char-rnn](images/char-rnn.png)

[https://github.com/karpathy/char-rnn](https://github.com/karpathy/char-rnn)

## Generating Shakespeare with char-rnn ##

![Shakespeare](images/shakespeare.png)

[http://karpathy.github.io/2015/05/21/rnn-effectiveness/](http://karpathy.github.io/2015/05/21/rnn-effectiveness/)

# The end #

Questions?