1008 lines
21 KiB
Plaintext
1008 lines
21 KiB
Plaintext
{
|
||
"cells": [
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "slide"
|
||
}
|
||
},
|
||
"source": [
|
||
"# Generating random poems with Python #\n",
|
||
"\n",
|
||
"\n",
|
||
"<div style=\"text-align:center;margin-top:40px\">(I never said they would be good poems)</div>"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "slide"
|
||
}
|
||
},
|
||
"source": [
|
||
"## Phone autocomplete ##\n",
|
||
"\n",
|
||
"You can generate random text that sounds like you with your smartphone keyboard:\n",
|
||
"\n",
|
||
"<img align=\"left\" src=\"images/phone_keyboard.png\">\n",
|
||
"<img align=\"right\" src=\"images/phone_autocomplete.gif\">"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "slide"
|
||
}
|
||
},
|
||
"source": [
|
||
"## So, how does it work? ##\n",
|
||
"\n",
|
||
"First, we need a **corpus**, or the text our generator will recombine into new sentences:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 1,
|
||
"metadata": {
|
||
"collapsed": true,
|
||
"slideshow": {
|
||
"slide_type": "fragment"
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"corpus = 'The quick brown fox jumps over the lazy dog'"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "slide"
|
||
}
|
||
},
|
||
"source": [
|
||
"Simplest word **tokenization** is to split on spaces:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 2,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "fragment"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']"
|
||
]
|
||
},
|
||
"execution_count": 2,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"words = corpus.split(' ')\n",
|
||
"words"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "slide"
|
||
}
|
||
},
|
||
"source": [
|
||
"To create **bigrams**, iterate through the list of words with two indices, one of which is offset by one:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 3,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "fragment"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"[('The', 'quick'),\n",
|
||
" ('quick', 'brown'),\n",
|
||
" ('brown', 'fox'),\n",
|
||
" ('fox', 'jumps'),\n",
|
||
" ('jumps', 'over'),\n",
|
||
" ('over', 'the'),\n",
|
||
" ('the', 'lazy'),\n",
|
||
" ('lazy', 'dog')]"
|
||
]
|
||
},
|
||
"execution_count": 3,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"bigrams = [b for b in zip(words[:-1], words[1:])]\n",
|
||
"bigrams"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "slide"
|
||
}
|
||
},
|
||
"source": [
|
||
"How do we use the bigrams to predict the next word given the first word?"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "fragment"
|
||
}
|
||
},
|
||
"source": [
|
||
" Return every second element where the first element matches the **condition**:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 4,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "fragment"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"['quick', 'lazy']"
|
||
]
|
||
},
|
||
"execution_count": 4,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"condition = 'the'\n",
|
||
"next_words = [bigram[1] for bigram in bigrams\n",
|
||
" if bigram[0].lower() == condition]\n",
|
||
"next_words"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"collapsed": true,
|
||
"slideshow": {
|
||
"slide_type": "fragment"
|
||
}
|
||
},
|
||
"source": [
|
||
"(<font color=\"blue\">The</font> <font color=\"red\">quick</font>) (quick brown) ... (<font color=\"blue\">the</font> <font color=\"red\">lazy</font>) (lazy dog)\n",
|
||
"\n",
|
||
"Either “<font color=\"red\">quick</font>” or “<font color=\"red\">lazy</font>” could be the next word."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"collapsed": true,
|
||
"slideshow": {
|
||
"slide_type": "slide"
|
||
}
|
||
},
|
||
"source": [
|
||
"## Trigrams and Ngrams ##\n",
|
||
"\n",
|
||
"We can partition by threes too:\n",
|
||
"\n",
|
||
"(<font color=\"blue\">The</font> <font color=\"red\">quick brown</font>) (quick brown fox) ... (<font color=\"blue\">the</font> <font color=\"red\">lazy dog</font>)\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "fragment"
|
||
}
|
||
},
|
||
"source": [
|
||
"Or, the condition can be two words (`condition = 'the lazy'`):\n",
|
||
"\n",
|
||
"(The quick brown) (quick brown fox) ... (<font color=\"blue\">the lazy</font> <font color=\"red\">dog</font>)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "fragment"
|
||
}
|
||
},
|
||
"source": [
|
||
"\n",
|
||
"These are **trigrams**."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "fragment"
|
||
}
|
||
},
|
||
"source": [
|
||
"We can partition any **N** number of words together as **ngrams**."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "slide"
|
||
}
|
||
},
|
||
"source": [
|
||
"So earlier we got:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 5,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "fragment"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"['quick', 'lazy']"
|
||
]
|
||
},
|
||
"execution_count": 5,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"next_words"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "fragment"
|
||
}
|
||
},
|
||
"source": [
|
||
"How do we know which one to pick as the next word?\n",
|
||
"\n",
|
||
"Why not the word that occurred the most often after the condition in the corpus?"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "fragment"
|
||
}
|
||
},
|
||
"source": [
|
||
"We can use a **Conditional Frequency Distribution (CFD)** to figure that out!\n",
|
||
"\n",
|
||
"A **CFD** can tell us: given a **condition**, what is **likely** to follow?"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "slide"
|
||
}
|
||
},
|
||
"source": [
|
||
"## Conditional Frequency Distributions (CFDs) ##"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 6,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "fragment"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"['The', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog', 'and', 'the', 'quick', 'cat']\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"words = ('The quick brown fox jumped over the '\n",
|
||
" 'lazy dog and the quick cat').split(' ')\n",
|
||
"print words"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 7,
|
||
"metadata": {
|
||
"collapsed": true,
|
||
"slideshow": {
|
||
"slide_type": "fragment"
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"from collections import defaultdict\n",
|
||
"\n",
|
||
"cfd = defaultdict(lambda: defaultdict(lambda: 0))"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "slide"
|
||
}
|
||
},
|
||
"source": [
|
||
"## Conditional Frequency Distributions (CFDs) ##"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 8,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "fragment"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"{'and': {'the': 1},\n",
|
||
" 'brown': {'fox': 1},\n",
|
||
" 'dog': {'and': 1},\n",
|
||
" 'fox': {'jumped': 1},\n",
|
||
" 'jumped': {'over': 1},\n",
|
||
" 'lazy': {'dog': 1},\n",
|
||
" 'over': {'the': 1},\n",
|
||
" 'quick': {'brown': 1},\n",
|
||
" 'the': {'lazy': 1, 'quick': 2}}"
|
||
]
|
||
},
|
||
"execution_count": 8,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"for i in range(len(words) - 2): # loop to the next-to-last word\n",
|
||
" cfd[words[i].lower()][words[i+1].lower()] += 1\n",
|
||
"\n",
|
||
"# pretty print the defaultdict\n",
|
||
"{k: dict(v) for k, v in dict(cfd).items()}"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "fragment"
|
||
}
|
||
},
|
||
"source": [
|
||
"So, what's the most likely word to follow `'the'`?"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 9,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "fragment"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"'quick'"
|
||
]
|
||
},
|
||
"execution_count": 9,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"max(cfd['the'])"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "slide"
|
||
}
|
||
},
|
||
"source": [
|
||
"## Whole sentences can be the conditions and values too ##\n",
|
||
"\n",
|
||
"Which is basically the way cleverbot works ([http://www.cleverbot.com/](http://www.cleverbot.com/)):\n",
|
||
"\n",
|
||
"![Cleverbot](images/cleverbot.png)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "slide"
|
||
}
|
||
},
|
||
"source": [
|
||
"## Random text! ##"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 10,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "fragment"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"her reserve and concealment towards some feelings in moving slowly together . You will shew\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"import nltk\n",
|
||
"import random\n",
|
||
"\n",
|
||
"TEXT = nltk.corpus.gutenberg.words('austen-emma.txt')\n",
|
||
"\n",
|
||
"# NLTK shortcuts :)\n",
|
||
"bigrams = nltk.bigrams(TEXT)\n",
|
||
"cfd = nltk.ConditionalFreqDist(bigrams)\n",
|
||
"\n",
|
||
"# pick a random word from the corpus to start with\n",
|
||
"word = random.choice(TEXT)\n",
|
||
"# generate 15 more words\n",
|
||
"for i in range(15):\n",
|
||
" print word,\n",
|
||
" if word in cfd:\n",
|
||
" word = random.choice(cfd[word].keys())\n",
|
||
" else:\n",
|
||
" break"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "slide"
|
||
}
|
||
},
|
||
"source": [
|
||
"## Random poems ##\n",
|
||
"\n",
|
||
"Generating random poems is accomplished by limiting the choice of the next word by some constraint:\n",
|
||
"\n",
|
||
"* words that rhyme with the previous line\n",
|
||
"* words that match a certain syllable count\n",
|
||
"* words that alliterate with words on the same line\n",
|
||
"* etc."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "slide"
|
||
}
|
||
},
|
||
"source": [
|
||
"# Rhyming\n",
|
||
"\n",
|
||
"**Written English != Spoken English**\n",
|
||
"\n",
|
||
"English has a highly **nonphonemic orthography**, meaning that the letters often have no correspondence to the pronunciation. E.g.:\n",
|
||
"\n",
|
||
"\n",
|
||
"\"meet\" vs. \"meat\"\n",
|
||
"\n",
|
||
"The vowels are spelled differently, yet they rhyme."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "fragment"
|
||
}
|
||
},
|
||
"source": [
|
||
"Fun fact: They used to be pronounced differently in Middle English during the invention of the printing press and standardized spelling."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "slide"
|
||
}
|
||
},
|
||
"source": [
|
||
"# International Phonetic Alphabet (IPA)\n",
|
||
"\n",
|
||
"An alphabet that can represent all varieties of human pronunciation.\n",
|
||
"\n",
|
||
"* meet: /mit/\n",
|
||
"* meat: /mit/"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "fragment"
|
||
}
|
||
},
|
||
"source": [
|
||
"Note: this is only the IPA transcription for only one **accent** of English."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "slide"
|
||
}
|
||
},
|
||
"source": [
|
||
"# Syllables\n",
|
||
"\n",
|
||
"* \"poet\" = 2 syllables\n",
|
||
"* \"does\" = 1 syllable\n",
|
||
"\n",
|
||
"Can the IPA tell us the number of syllables in a word too?"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "fragment"
|
||
}
|
||
},
|
||
"source": [
|
||
"* poet: /ˈpoʊət/\n",
|
||
"* does: /ˈdʌz/\n",
|
||
"\n",
|
||
"Not really... We cannot easily identify three syllables from that transcription.\n",
|
||
"\n",
|
||
"Sometimes the transcriber denotes syllable breaks (with a `.` or a `'`), but sometimes they don't."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "slide"
|
||
}
|
||
},
|
||
"source": [
|
||
"# Arpabet\n",
|
||
"\n",
|
||
"A phonetic alphabet developed by ARPA in the 70s that:\n",
|
||
"\n",
|
||
"* Encodes phonemes specific to American English.\n",
|
||
"* Meant to be a machine readable code. It is ASCII only.\n",
|
||
"* Denotes how stressed every vowel is from 0-2."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "fragment"
|
||
}
|
||
},
|
||
"source": [
|
||
"This is perfect! Word's syllable count equals the number of digits in the Arpabet encoding."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "slide"
|
||
}
|
||
},
|
||
"source": [
|
||
"# CMU Pronouncing Dictionary (CMUdict)\n",
|
||
"\n",
|
||
"A large open source dictionary of English words to North American pronunciations in Arpanet encoding."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "fragment"
|
||
}
|
||
},
|
||
"source": [
|
||
"Conveniently, it is also in NLTK..."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "slide"
|
||
}
|
||
},
|
||
"source": [
|
||
"# Counting Syllables"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 11,
|
||
"metadata": {
|
||
"collapsed": true,
|
||
"slideshow": {
|
||
"slide_type": "fragment"
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"import string\n",
|
||
"from nltk.corpus import cmudict\n",
|
||
"cmu = cmudict.dict()\n",
|
||
"\n",
|
||
"def count_syllables(word):\n",
|
||
" lower_word = word.lower()\n",
|
||
" if lower_word in cmu:\n",
|
||
" return max([len([y for y in x if y[-1] in string.digits])\n",
|
||
" for x in cmu[lower_word]])"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 12,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "fragment"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"poet: 2\n",
|
||
"does: 1\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"print(\"poet: {}\\ndoes: {}\".format(count_syllables(\"poet\"),\n",
|
||
" count_syllables(\"does\")))"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "slide"
|
||
}
|
||
},
|
||
"source": [
|
||
"![Buzzfeed Haiku Generator](images/buzzfeed.png)\n",
|
||
"\n",
|
||
"[http://mule.hallada.net/nlp/buzzfeed-haiku-generator/](http://mule.hallada.net/nlp/buzzfeed-haiku-generator/)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"collapsed": true,
|
||
"slideshow": {
|
||
"slide_type": "slide"
|
||
}
|
||
},
|
||
"source": [
|
||
"## Remember these? ##\n",
|
||
"\n",
|
||
"![madlibs](images/madlibs.png)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "slide"
|
||
}
|
||
},
|
||
"source": [
|
||
"## Mad Libs ##\n",
|
||
"\n",
|
||
"These worked so well because they forced the random words (chosen by you) to fit into the syntactical structure and parts-of-speech of an existing sentence.\n",
|
||
"\n",
|
||
"You end up with **syntactically** correct sentences that are **semantically** random.\n",
|
||
"\n",
|
||
"We can do the same thing!"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "slide"
|
||
}
|
||
},
|
||
"source": [
|
||
"## NLTK Syntax Trees! ##"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 13,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "fragment"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"(S\n",
|
||
" (NP (DT the) (NN quick))\n",
|
||
" (VP\n",
|
||
" (VB brown)\n",
|
||
" (NP\n",
|
||
" (NP (JJ fox) (NN jumps))\n",
|
||
" (PP (IN over) (NP (DT the) (JJ lazy) (NN dog)))))\n",
|
||
" (. .))\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"from stat_parser import Parser\n",
|
||
"parsed = Parser().parse('The quick brown fox jumps over the lazy dog.')\n",
|
||
"print parsed"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "slide"
|
||
}
|
||
},
|
||
"source": [
|
||
"## NLTK Syntax Trees! ##"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 14,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "fragment"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
" S \n",
|
||
" ________________________|__________________________ \n",
|
||
" | VP | \n",
|
||
" | ____|_____________ | \n",
|
||
" | | NP | \n",
|
||
" | | _________|________ | \n",
|
||
" | | | PP | \n",
|
||
" | | | ________|___ | \n",
|
||
" NP | NP | NP | \n",
|
||
" ___|____ | ___|____ | _______|____ | \n",
|
||
" DT NN VB JJ NN IN DT JJ NN . \n",
|
||
" | | | | | | | | | | \n",
|
||
"the quick brown fox jumps over the lazy dog . \n",
|
||
"\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"parsed.pretty_print()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "slide"
|
||
}
|
||
},
|
||
"source": [
|
||
"## Swapping matching syntax subtrees between two corpora ##"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 15,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "fragment"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"(SBARQ\n",
|
||
" (SQ\n",
|
||
" (NP (PRP I))\n",
|
||
" (VP (VBP do) (RB not) (VB advise) (NP (DT the) (NN custard))))\n",
|
||
" (. .))\n",
|
||
"I do not advise the custard .\n",
|
||
"==============================\n",
|
||
"I do n't want the drone !\n",
|
||
"(SBARQ\n",
|
||
" (SQ\n",
|
||
" (NP (PRP I))\n",
|
||
" (VP (VBP do) (RB n't) (VB want) (NP (DT the) (NN drone))))\n",
|
||
" (. !))\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"from syntax_aware_generate import generate\n",
|
||
"\n",
|
||
"# inserts matching syntax subtrees from trump.txt into\n",
|
||
"# trees from austen-emma.txt\n",
|
||
"generate('trump.txt', word_limit=10)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "slide"
|
||
}
|
||
},
|
||
"source": [
|
||
"## spaCy ##\n",
|
||
"\n",
|
||
"![spaCy speed comparison](images/spacy_speed.png)\n",
|
||
"\n",
|
||
"[https://spacy.io/docs/api/#speed-comparison](https://spacy.io/docs/api/#speed-comparison)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "slide"
|
||
}
|
||
},
|
||
"source": [
|
||
"## Character-based Recurrent Neural Networks ##\n",
|
||
"\n",
|
||
"![RNN Paper](images/rnn_paper.png)\n",
|
||
"\n",
|
||
"[http://www.cs.utoronto.ca/~ilya/pubs/2011/LANG-RNN.pdf](http://www.cs.utoronto.ca/~ilya/pubs/2011/LANG-RNN.pdf)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "slide"
|
||
}
|
||
},
|
||
"source": [
|
||
"## Implementation: char-rnn ##\n",
|
||
"\n",
|
||
"![char-rnn](images/char-rnn.png)\n",
|
||
"\n",
|
||
"[https://github.com/karpathy/char-rnn](https://github.com/karpathy/char-rnn)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "slide"
|
||
}
|
||
},
|
||
"source": [
|
||
"## Generating Shakespeare with char-rnn ##\n",
|
||
"\n",
|
||
"![Shakespeare](images/shakespeare.png)\n",
|
||
"\n",
|
||
"[http://karpathy.github.io/2015/05/21/rnn-effectiveness/](http://karpathy.github.io/2015/05/21/rnn-effectiveness/)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"collapsed": true,
|
||
"slideshow": {
|
||
"slide_type": "slide"
|
||
}
|
||
},
|
||
"source": [
|
||
"# The end #\n",
|
||
"\n",
|
||
"Questions?"
|
||
]
|
||
}
|
||
],
|
||
"metadata": {
|
||
"celltoolbar": "Slideshow",
|
||
"kernelspec": {
|
||
"display_name": "Python 2",
|
||
"language": "python",
|
||
"name": "python2"
|
||
},
|
||
"language_info": {
|
||
"codemirror_mode": {
|
||
"name": "ipython",
|
||
"version": 2
|
||
},
|
||
"file_extension": ".py",
|
||
"mimetype": "text/x-python",
|
||
"name": "python",
|
||
"nbconvert_exporter": "python",
|
||
"pygments_lexer": "ipython2",
|
||
"version": "2.7.12"
|
||
},
|
||
"livereveal": {
|
||
"scroll": true,
|
||
"theme": "simple",
|
||
"transition": "linear"
|
||
}
|
||
},
|
||
"nbformat": 4,
|
||
"nbformat_minor": 2
|
||
}
|