From 153291e59d8eb20ee2e18ddbb3f4809214635e6f Mon Sep 17 00:00:00 2001 From: Tyler Hallada Date: Wed, 3 May 2017 01:16:30 -0400 Subject: [PATCH] Added rhyming and syllable counting slides --- edX Lightning Talk.ipynb | 227 +++++++++++++++++++++++++++++++++++++-- 1 file changed, 217 insertions(+), 10 deletions(-) diff --git a/edX Lightning Talk.ipynb b/edX Lightning Talk.ipynb index 0a9ed90..75bd261 100644 --- a/edX Lightning Talk.ipynb +++ b/edX Lightning Talk.ipynb @@ -101,7 +101,7 @@ } }, "source": [ - "To create **bigrams**, iterate through the list of words with two indicies, one of which is offset by one:" + "To create **bigrams**, iterate through the list of words with two indices, one of which is offset by one:" ] }, { @@ -454,11 +454,9 @@ "source": [ "## Whole sentences can be the conditions and values too ##\n", "\n", - "Which is basically the way cleverbot works:\n", + "Which is basically the way cleverbot works ([http://www.cleverbot.com/](http://www.cleverbot.com/)):\n", "\n", - "![Cleverbot](images/cleverbot.png)\n", - "\n", - "[http://www.cleverbot.com/](http://www.cleverbot.com/)" + "![Cleverbot](images/cleverbot.png)" ] }, { @@ -474,7 +472,7 @@ }, { "cell_type": "code", - "execution_count": 31, + "execution_count": 10, "metadata": { "slideshow": { "slide_type": "fragment" @@ -520,7 +518,7 @@ "source": [ "## Random poems ##\n", "\n", - "Generating random poems is simply limiting the choice of the next word by some constraint:\n", + "Generating random poems is accomplished by limiting the choice of the next word by some constraint:\n", "\n", "* words that rhyme with the previous line\n", "* words that match a certain syllable count\n", @@ -528,6 +526,204 @@ "* etc." ] }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "# Rhyming\n", + "\n", + "**Written English != Spoken English**\n", + "\n", + "English is highly **nonphonemic**, meaning that the letters often have no correspondence to the pronunciation. E.g.:\n", + "\n", + "\n", + "\"meet\" vs. \"meat\"\n", + "\n", + "The vowels are spelled differently, yet they rhyme." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "source": [ + "Fun fact: They used to be pronounced differently in Middle English during the invention of the printing press and standardized spelling." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "# International Phonetic Alphabet (IPA)\n", + "\n", + "An alphabet that can represent all varieties of human pronunciation.\n", + "\n", + "* meet: /mit/\n", + "* meat: /mit/" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "source": [ + "Note: this is only the IPA transcription for only one **accent** of English." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "# Syllables\n", + "\n", + "* \"poet\" = 2 syllables\n", + "* \"does\" = 1 syllable\n", + "\n", + "Can the IPA tell us the number of syllables in a word too?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "source": [ + "* poet: /ˈpoʊət/\n", + "* does: /ˈdʌz/\n", + "\n", + "Not really... We cannot easily identify three syllables from that transcription.\n", + "\n", + "Sometimes the transcriber denotes syllable breaks (with a `.` or a `'`), but sometimes they don't." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "# Arpabet\n", + "\n", + "A phonetic alphabet developed by ARPA in the 70s that:\n", + "\n", + "* Encodes phonemes specific to American English.\n", + "* Meant to be a machine readable code. It is ASCII only.\n", + "* Denotes how stressed every vowel is from 0-2." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "source": [ + "This is perfect! Word's syllable count equals the number of digits in the Arpabet encoding." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "# CMU Pronouncing Dictionary (CMUdict)\n", + "\n", + "A large open source dictionary of English words to North American pronunciations in Arpanet encoding." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "source": [ + "Conveniently, it is also in NLTK..." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "# Counting Syllables" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": { + "collapsed": true, + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "import string\n", + "from nltk.corpus import cmudict\n", + "cmu = cmudict.dict()\n", + "\n", + "def count_syllables(word):\n", + " lower_word = word.lower()\n", + " if lower_word in cmu:\n", + " return max([len([y for y in x if y[-1] in string.digits])\n", + " for x in cmu[lower_word]])" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "poet: 2\n", + "does: 1\n" + ] + } + ], + "source": [ + "print(\"poet: {}\\ndoes: {}\".format(count_syllables(\"poet\"),\n", + " count_syllables(\"does\")))" + ] + }, { "cell_type": "markdown", "metadata": { @@ -585,7 +781,7 @@ }, { "cell_type": "code", - "execution_count": 33, + "execution_count": 13, "metadata": { "slideshow": { "slide_type": "fragment" @@ -613,9 +809,20 @@ "print parsed" ] }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "## NLTK Syntax Trees! ##" + ] + }, { "cell_type": "code", - "execution_count": 34, + "execution_count": 14, "metadata": { "slideshow": { "slide_type": "fragment" @@ -660,7 +867,7 @@ }, { "cell_type": "code", - "execution_count": 30, + "execution_count": 15, "metadata": { "slideshow": { "slide_type": "fragment"