Browse Source

Add jupyter notebook presentation

Tyler Hallada 7 years ago
parent
commit
c3961f398e

+ 744 - 0
edX Lightning Talk.ipynb

@@ -0,0 +1,744 @@
1
+{
2
+ "cells": [
3
+  {
4
+   "cell_type": "markdown",
5
+   "metadata": {
6
+    "slideshow": {
7
+     "slide_type": "slide"
8
+    }
9
+   },
10
+   "source": [
11
+    "# Generating random poems with Python #\n",
12
+    "\n",
13
+    "\n",
14
+    "<div style=\"text-align:center;margin-top:40px\">(I never said they would be good poems)</div>"
15
+   ]
16
+  },
17
+  {
18
+   "cell_type": "markdown",
19
+   "metadata": {
20
+    "slideshow": {
21
+     "slide_type": "slide"
22
+    }
23
+   },
24
+   "source": [
25
+    "## Phone autocomplete ##\n",
26
+    "\n",
27
+    "You can generate random text that sounds like you with your smartphone keyboard:\n",
28
+    "\n",
29
+    "<div style=\"float:left\">![Smartphone keyboard](images/phone_keyboard.png)</div>\n",
30
+    "<div style=\"float:right\">![Smartphone_autocomplete](images/phone_autocomplete.gif)</div>"
31
+   ]
32
+  },
33
+  {
34
+   "cell_type": "markdown",
35
+   "metadata": {
36
+    "slideshow": {
37
+     "slide_type": "slide"
38
+    }
39
+   },
40
+   "source": [
41
+    "## So, how does it work? ##\n",
42
+    "\n",
43
+    "First, we need a **corpus**, or the text our generator will recombine into new sentences:"
44
+   ]
45
+  },
46
+  {
47
+   "cell_type": "code",
48
+   "execution_count": 1,
49
+   "metadata": {
50
+    "collapsed": true,
51
+    "slideshow": {
52
+     "slide_type": "fragment"
53
+    }
54
+   },
55
+   "outputs": [],
56
+   "source": [
57
+    "corpus = 'The quick brown fox jumps over the lazy dog'"
58
+   ]
59
+  },
60
+  {
61
+   "cell_type": "markdown",
62
+   "metadata": {
63
+    "slideshow": {
64
+     "slide_type": "slide"
65
+    }
66
+   },
67
+   "source": [
68
+    "Simplest word **tokenization** is to split on spaces:"
69
+   ]
70
+  },
71
+  {
72
+   "cell_type": "code",
73
+   "execution_count": 2,
74
+   "metadata": {
75
+    "slideshow": {
76
+     "slide_type": "fragment"
77
+    }
78
+   },
79
+   "outputs": [
80
+    {
81
+     "data": {
82
+      "text/plain": [
83
+       "['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']"
84
+      ]
85
+     },
86
+     "execution_count": 2,
87
+     "metadata": {},
88
+     "output_type": "execute_result"
89
+    }
90
+   ],
91
+   "source": [
92
+    "words = corpus.split(' ')\n",
93
+    "words"
94
+   ]
95
+  },
96
+  {
97
+   "cell_type": "markdown",
98
+   "metadata": {
99
+    "slideshow": {
100
+     "slide_type": "slide"
101
+    }
102
+   },
103
+   "source": [
104
+    "To create **bigrams**, iterate through the list of words with two indicies, one of which is offset by one:"
105
+   ]
106
+  },
107
+  {
108
+   "cell_type": "code",
109
+   "execution_count": 3,
110
+   "metadata": {
111
+    "slideshow": {
112
+     "slide_type": "fragment"
113
+    }
114
+   },
115
+   "outputs": [
116
+    {
117
+     "data": {
118
+      "text/plain": [
119
+       "[('The', 'quick'),\n",
120
+       " ('quick', 'brown'),\n",
121
+       " ('brown', 'fox'),\n",
122
+       " ('fox', 'jumps'),\n",
123
+       " ('jumps', 'over'),\n",
124
+       " ('over', 'the'),\n",
125
+       " ('the', 'lazy'),\n",
126
+       " ('lazy', 'dog')]"
127
+      ]
128
+     },
129
+     "execution_count": 3,
130
+     "metadata": {},
131
+     "output_type": "execute_result"
132
+    }
133
+   ],
134
+   "source": [
135
+    "bigrams = [b for b in zip(words[:-1], words[1:])]\n",
136
+    "bigrams"
137
+   ]
138
+  },
139
+  {
140
+   "cell_type": "markdown",
141
+   "metadata": {
142
+    "slideshow": {
143
+     "slide_type": "slide"
144
+    }
145
+   },
146
+   "source": [
147
+    "How do we use the bigrams to predict the next word given the first word?"
148
+   ]
149
+  },
150
+  {
151
+   "cell_type": "markdown",
152
+   "metadata": {
153
+    "slideshow": {
154
+     "slide_type": "fragment"
155
+    }
156
+   },
157
+   "source": [
158
+    " Return every second element where the first element matches the **condition**:"
159
+   ]
160
+  },
161
+  {
162
+   "cell_type": "code",
163
+   "execution_count": 4,
164
+   "metadata": {
165
+    "slideshow": {
166
+     "slide_type": "fragment"
167
+    }
168
+   },
169
+   "outputs": [
170
+    {
171
+     "data": {
172
+      "text/plain": [
173
+       "['quick', 'lazy']"
174
+      ]
175
+     },
176
+     "execution_count": 4,
177
+     "metadata": {},
178
+     "output_type": "execute_result"
179
+    }
180
+   ],
181
+   "source": [
182
+    "condition = 'the'\n",
183
+    "next_words = [bigram[1] for bigram in bigrams\n",
184
+    "              if bigram[0].lower() == condition]\n",
185
+    "next_words"
186
+   ]
187
+  },
188
+  {
189
+   "cell_type": "markdown",
190
+   "metadata": {
191
+    "collapsed": true,
192
+    "slideshow": {
193
+     "slide_type": "fragment"
194
+    }
195
+   },
196
+   "source": [
197
+    "(<font color=\"blue\">The</font> <font color=\"red\">quick</font>) (quick brown) ... (<font color=\"blue\">the</font> <font color=\"red\">lazy</font>) (lazy dog)\n",
198
+    "\n",
199
+    "Either “<font color=\"red\">quick</font>” or “<font color=\"red\">lazy</font>” could be the next word."
200
+   ]
201
+  },
202
+  {
203
+   "cell_type": "markdown",
204
+   "metadata": {
205
+    "collapsed": true,
206
+    "slideshow": {
207
+     "slide_type": "slide"
208
+    }
209
+   },
210
+   "source": [
211
+    "## Trigrams and Ngrams ##\n",
212
+    "\n",
213
+    "We can partition by threes too:\n",
214
+    "\n",
215
+    "(<font color=\"blue\">The</font> <font color=\"red\">quick brown</font>) (quick brown fox) ... (<font color=\"blue\">the</font> <font color=\"red\">lazy dog</font>)\n"
216
+   ]
217
+  },
218
+  {
219
+   "cell_type": "markdown",
220
+   "metadata": {
221
+    "slideshow": {
222
+     "slide_type": "fragment"
223
+    }
224
+   },
225
+   "source": [
226
+    "Or, the condition can be two words (`condition = 'the lazy'`):\n",
227
+    "\n",
228
+    "(The quick brown) (quick brown fox) ... (<font color=\"blue\">the lazy</font> <font color=\"red\">dog</font>)"
229
+   ]
230
+  },
231
+  {
232
+   "cell_type": "markdown",
233
+   "metadata": {
234
+    "slideshow": {
235
+     "slide_type": "fragment"
236
+    }
237
+   },
238
+   "source": [
239
+    "\n",
240
+    "These are **trigrams**."
241
+   ]
242
+  },
243
+  {
244
+   "cell_type": "markdown",
245
+   "metadata": {
246
+    "slideshow": {
247
+     "slide_type": "fragment"
248
+    }
249
+   },
250
+   "source": [
251
+    "We can partition any **N** number of words together as **ngrams**."
252
+   ]
253
+  },
254
+  {
255
+   "cell_type": "markdown",
256
+   "metadata": {
257
+    "slideshow": {
258
+     "slide_type": "slide"
259
+    }
260
+   },
261
+   "source": [
262
+    "So earlier we got:"
263
+   ]
264
+  },
265
+  {
266
+   "cell_type": "code",
267
+   "execution_count": 5,
268
+   "metadata": {
269
+    "slideshow": {
270
+     "slide_type": "fragment"
271
+    }
272
+   },
273
+   "outputs": [
274
+    {
275
+     "data": {
276
+      "text/plain": [
277
+       "['quick', 'lazy']"
278
+      ]
279
+     },
280
+     "execution_count": 5,
281
+     "metadata": {},
282
+     "output_type": "execute_result"
283
+    }
284
+   ],
285
+   "source": [
286
+    "next_words"
287
+   ]
288
+  },
289
+  {
290
+   "cell_type": "markdown",
291
+   "metadata": {
292
+    "slideshow": {
293
+     "slide_type": "fragment"
294
+    }
295
+   },
296
+   "source": [
297
+    "How do we know which one to pick as the next word?\n",
298
+    "\n",
299
+    "Why not the word that occurred the most often after the condition in the corpus?\n",
300
+    "\n",
301
+    "We can use a **Conditional Frequency Distribution (CFD)** to figure that out!\n",
302
+    "\n",
303
+    "A **CFD** can tell us: given a **condition**, what is **likely** to follow?"
304
+   ]
305
+  },
306
+  {
307
+   "cell_type": "markdown",
308
+   "metadata": {
309
+    "slideshow": {
310
+     "slide_type": "slide"
311
+    }
312
+   },
313
+   "source": [
314
+    "## Conditional Frequency Distributions (CFDs) ##"
315
+   ]
316
+  },
317
+  {
318
+   "cell_type": "code",
319
+   "execution_count": 6,
320
+   "metadata": {
321
+    "slideshow": {
322
+     "slide_type": "fragment"
323
+    }
324
+   },
325
+   "outputs": [
326
+    {
327
+     "name": "stdout",
328
+     "output_type": "stream",
329
+     "text": [
330
+      "['The', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog', 'and', 'the', 'quick', 'cat']\n"
331
+     ]
332
+    }
333
+   ],
334
+   "source": [
335
+    "words = 'The quick brown fox jumped over the lazy dog and the quick cat'.split(' ')\n",
336
+    "print words"
337
+   ]
338
+  },
339
+  {
340
+   "cell_type": "code",
341
+   "execution_count": 7,
342
+   "metadata": {
343
+    "collapsed": true,
344
+    "slideshow": {
345
+     "slide_type": "fragment"
346
+    }
347
+   },
348
+   "outputs": [],
349
+   "source": [
350
+    "from collections import defaultdict\n",
351
+    "\n",
352
+    "cfd = defaultdict(lambda: defaultdict(lambda: 0))\n",
353
+    "condition = 'the'"
354
+   ]
355
+  },
356
+  {
357
+   "cell_type": "code",
358
+   "execution_count": 8,
359
+   "metadata": {
360
+    "slideshow": {
361
+     "slide_type": "fragment"
362
+    }
363
+   },
364
+   "outputs": [
365
+    {
366
+     "data": {
367
+      "text/plain": [
368
+       "{'the': {'lazy': 1, 'quick': 2}}"
369
+      ]
370
+     },
371
+     "execution_count": 8,
372
+     "metadata": {},
373
+     "output_type": "execute_result"
374
+    }
375
+   ],
376
+   "source": [
377
+    "for i in range(len(words) - 2):\n",
378
+    "    if words[i].lower() == condition:\n",
379
+    "        cfd[condition][words[i+1]] += 1\n",
380
+    "\n",
381
+    "# pretty print the defaultdict \n",
382
+    "{k: dict(v) for k, v in dict(cfd).items()}"
383
+   ]
384
+  },
385
+  {
386
+   "cell_type": "markdown",
387
+   "metadata": {
388
+    "slideshow": {
389
+     "slide_type": "slide"
390
+    }
391
+   },
392
+   "source": [
393
+    "## What's the most likely? ##"
394
+   ]
395
+  },
396
+  {
397
+   "cell_type": "code",
398
+   "execution_count": 9,
399
+   "metadata": {
400
+    "slideshow": {
401
+     "slide_type": "fragment"
402
+    }
403
+   },
404
+   "outputs": [
405
+    {
406
+     "data": {
407
+      "text/plain": [
408
+       "'quick'"
409
+      ]
410
+     },
411
+     "execution_count": 9,
412
+     "metadata": {},
413
+     "output_type": "execute_result"
414
+    }
415
+   ],
416
+   "source": [
417
+    "max(cfd[condition])"
418
+   ]
419
+  },
420
+  {
421
+   "cell_type": "markdown",
422
+   "metadata": {
423
+    "slideshow": {
424
+     "slide_type": "slide"
425
+    }
426
+   },
427
+   "source": [
428
+    "## Whole sentences can be the conditions and values too ##\n",
429
+    "\n",
430
+    "Which is basically the way cleverbot works:\n",
431
+    "\n",
432
+    "![Cleverbot](images/cleverbot.png)\n",
433
+    "\n",
434
+    "[http://www.cleverbot.com/](http://www.cleverbot.com/)"
435
+   ]
436
+  },
437
+  {
438
+   "cell_type": "markdown",
439
+   "metadata": {
440
+    "slideshow": {
441
+     "slide_type": "slide"
442
+    }
443
+   },
444
+   "source": [
445
+    "## Random text! ##"
446
+   ]
447
+  },
448
+  {
449
+   "cell_type": "code",
450
+   "execution_count": 10,
451
+   "metadata": {
452
+    "slideshow": {
453
+     "slide_type": "fragment"
454
+    }
455
+   },
456
+   "outputs": [
457
+    {
458
+     "name": "stdout",
459
+     "output_type": "stream",
460
+     "text": [
461
+      "must therefore that half ago for hope that occasion , Perry -- abundance about ten\n"
462
+     ]
463
+    }
464
+   ],
465
+   "source": [
466
+    "import nltk\n",
467
+    "import random\n",
468
+    "\n",
469
+    "TEXT = nltk.corpus.gutenberg.words('austen-emma.txt')\n",
470
+    "\n",
471
+    "# NLTK shortcuts :)\n",
472
+    "bigrams = nltk.bigrams(TEXT)\n",
473
+    "cfd = nltk.ConditionalFreqDist(bigrams)\n",
474
+    "\n",
475
+    "# pick a random word from the corpus to start with\n",
476
+    "word = random.choice(TEXT)\n",
477
+    "# generate 15 more words\n",
478
+    "for i in range(15):\n",
479
+    "    print word,\n",
480
+    "    if word in cfd:\n",
481
+    "        word = random.choice(cfd[word].keys())\n",
482
+    "    else:\n",
483
+    "        break"
484
+   ]
485
+  },
486
+  {
487
+   "cell_type": "markdown",
488
+   "metadata": {
489
+    "slideshow": {
490
+     "slide_type": "slide"
491
+    }
492
+   },
493
+   "source": [
494
+    "## Random poems ##\n",
495
+    "\n",
496
+    "Generating random poems is simply limiting the choice of the next word by some constraint:\n",
497
+    "\n",
498
+    "* words that rhyme with the previous line\n",
499
+    "* words that match a certain syllable count\n",
500
+    "* words that alliterate with words on the same line\n",
501
+    "* etc."
502
+   ]
503
+  },
504
+  {
505
+   "cell_type": "markdown",
506
+   "metadata": {
507
+    "slideshow": {
508
+     "slide_type": "slide"
509
+    }
510
+   },
511
+   "source": [
512
+    "![Buzzfeed Haiku Generator](images/buzzfeed.png)\n",
513
+    "\n",
514
+    "[http://mule.hallada.net/nlp/buzzfeed-haiku-generator/](http://mule.hallada.net/nlp/buzzfeed-haiku-generator/)"
515
+   ]
516
+  },
517
+  {
518
+   "cell_type": "markdown",
519
+   "metadata": {
520
+    "collapsed": true,
521
+    "slideshow": {
522
+     "slide_type": "slide"
523
+    }
524
+   },
525
+   "source": [
526
+    "## Remember these? ##\n",
527
+    "\n",
528
+    "![madlibs](images/madlibs.png)"
529
+   ]
530
+  },
531
+  {
532
+   "cell_type": "markdown",
533
+   "metadata": {
534
+    "slideshow": {
535
+     "slide_type": "slide"
536
+    }
537
+   },
538
+   "source": [
539
+    "\n",
540
+    "These worked so well because they forced the random words (chosed by you) to fit into the syntactical structure and parts-of-speech of an existing sentence.\n",
541
+    "\n",
542
+    "You end up with **syntactically** correct sentences that are **semantically** random.\n",
543
+    "\n",
544
+    "We can do the same thing!"
545
+   ]
546
+  },
547
+  {
548
+   "cell_type": "markdown",
549
+   "metadata": {
550
+    "slideshow": {
551
+     "slide_type": "slide"
552
+    }
553
+   },
554
+   "source": [
555
+    "## NLTK Syntax Trees! ##"
556
+   ]
557
+  },
558
+  {
559
+   "cell_type": "code",
560
+   "execution_count": 11,
561
+   "metadata": {
562
+    "slideshow": {
563
+     "slide_type": "fragment"
564
+    }
565
+   },
566
+   "outputs": [
567
+    {
568
+     "name": "stdout",
569
+     "output_type": "stream",
570
+     "text": [
571
+      "(S\n",
572
+      "  (NP (DT the) (NN quick))\n",
573
+      "  (VP\n",
574
+      "    (VB brown)\n",
575
+      "    (NP\n",
576
+      "      (NP (JJ fox) (NN jumps))\n",
577
+      "      (PP (IN over) (NP (DT the) (JJ lazy) (NN dog)))))\n",
578
+      "  (. .))\n"
579
+     ]
580
+    }
581
+   ],
582
+   "source": [
583
+    "from stat_parser import Parser\n",
584
+    "parser = Parser()\n",
585
+    "print parser.parse('The quick brown fox jumps over the lazy dog.')"
586
+   ]
587
+  },
588
+  {
589
+   "cell_type": "markdown",
590
+   "metadata": {
591
+    "slideshow": {
592
+     "slide_type": "slide"
593
+    }
594
+   },
595
+   "source": [
596
+    "## Swaping matching syntax subtrees between two corpora ##"
597
+   ]
598
+  },
599
+  {
600
+   "cell_type": "code",
601
+   "execution_count": 15,
602
+   "metadata": {
603
+    "slideshow": {
604
+     "slide_type": "fragment"
605
+    }
606
+   },
607
+   "outputs": [
608
+    {
609
+     "name": "stdout",
610
+     "output_type": "stream",
611
+     "text": [
612
+      "(SBARQ\n",
613
+      "  (SQ\n",
614
+      "    (NP (PRP she))\n",
615
+      "    (VP\n",
616
+      "      (VBD was)\n",
617
+      "      (VBN obliged)\n",
618
+      "      (S+VP (TO to) (VP (VB stop) (CC and) (VB think)))))\n",
619
+      "  (. .))\n",
620
+      "she was obliged to stop and think .\n",
621
+      "==============================\n",
622
+      "They was hacked to amp ; support !\n",
623
+      "(SBARQ\n",
624
+      "  (SQ\n",
625
+      "    (NP (PRP They))\n",
626
+      "    (VP\n",
627
+      "      (VBD was)\n",
628
+      "      (VBN hacked)\n",
629
+      "      (S+VP (TO to) (VP (VB amp) (CC ;) (VB support)))))\n",
630
+      "  (. !))\n"
631
+     ]
632
+    }
633
+   ],
634
+   "source": [
635
+    "from syntax_aware_generate import generate\n",
636
+    "\n",
637
+    "# inserts matching syntax subtrees from trump.txt into\n",
638
+    "# trees from austen-emma.txt\n",
639
+    "generate('trump.txt', word_limit=15)"
640
+   ]
641
+  },
642
+  {
643
+   "cell_type": "markdown",
644
+   "metadata": {
645
+    "slideshow": {
646
+     "slide_type": "slide"
647
+    }
648
+   },
649
+   "source": [
650
+    "## spaCy ##\n",
651
+    "\n",
652
+    "![spaCy speed comparison](images/spacy_speed.png)\n",
653
+    "\n",
654
+    "[https://spacy.io/docs/api/#speed-comparison](https://spacy.io/docs/api/#speed-comparison)"
655
+   ]
656
+  },
657
+  {
658
+   "cell_type": "markdown",
659
+   "metadata": {
660
+    "slideshow": {
661
+     "slide_type": "slide"
662
+    }
663
+   },
664
+   "source": [
665
+    "## Character-based Recurrent Neural Networks ##\n",
666
+    "\n",
667
+    "![RNN Paper](images/rnn_paper.png)\n",
668
+    "\n",
669
+    "[http://www.cs.utoronto.ca/~ilya/pubs/2011/LANG-RNN.pdf](http://www.cs.utoronto.ca/~ilya/pubs/2011/LANG-RNN.pdf)"
670
+   ]
671
+  },
672
+  {
673
+   "cell_type": "markdown",
674
+   "metadata": {
675
+    "slideshow": {
676
+     "slide_type": "slide"
677
+    }
678
+   },
679
+   "source": [
680
+    "## Implementation: char-rnn ##\n",
681
+    "\n",
682
+    "![char-rnn](images/char-rnn.png)\n",
683
+    "\n",
684
+    "[https://github.com/karpathy/char-rnn](https://github.com/karpathy/char-rnn)"
685
+   ]
686
+  },
687
+  {
688
+   "cell_type": "markdown",
689
+   "metadata": {
690
+    "slideshow": {
691
+     "slide_type": "slide"
692
+    }
693
+   },
694
+   "source": [
695
+    "## Generating Shakespeare with char-rnn ##\n",
696
+    "\n",
697
+    "![Shakespeare](images/shakespeare.png)\n",
698
+    "\n",
699
+    "[http://karpathy.github.io/2015/05/21/rnn-effectiveness/](http://karpathy.github.io/2015/05/21/rnn-effectiveness/)"
700
+   ]
701
+  },
702
+  {
703
+   "cell_type": "markdown",
704
+   "metadata": {
705
+    "collapsed": true,
706
+    "slideshow": {
707
+     "slide_type": "slide"
708
+    }
709
+   },
710
+   "source": [
711
+    "# The end #\n",
712
+    "\n",
713
+    "Questions?"
714
+   ]
715
+  }
716
+ ],
717
+ "metadata": {
718
+  "celltoolbar": "Slideshow",
719
+  "kernelspec": {
720
+   "display_name": "Python 2",
721
+   "language": "python",
722
+   "name": "python2"
723
+  },
724
+  "language_info": {
725
+   "codemirror_mode": {
726
+    "name": "ipython",
727
+    "version": 2
728
+   },
729
+   "file_extension": ".py",
730
+   "mimetype": "text/x-python",
731
+   "name": "python",
732
+   "nbconvert_exporter": "python",
733
+   "pygments_lexer": "ipython2",
734
+   "version": "2.7.11+"
735
+  },
736
+  "livereveal": {
737
+   "scroll": true,
738
+   "theme": "simple",
739
+   "transition": "linear"
740
+  }
741
+ },
742
+ "nbformat": 4,
743
+ "nbformat_minor": 2
744
+}

+ 3 - 3
generate_poem.py

@@ -13,7 +13,7 @@ from count_syllables import count_syllables
13 13
 
14 14
 
15 15
 class PoemGenerator():
16
-    def __init__(self, corpus):
16
+    def __init__(self):
17 17
         #self.corpus = 'melville-moby_dick.txt'
18 18
         #self.corpus = read_titles()
19 19
         #self.sents = corpus.sents(self.corpus)
@@ -71,7 +71,7 @@ class PoemGenerator():
71 71
         else:
72 72
             print('')
73 73
 
74
-    def generate_poem(self):
74
+    def generate_text(self):
75 75
         #sent = random.choice(self.sents)
76 76
         #parsed = self.parser.parse(' '.join(sent))
77 77
         word = random.choice(self.bigrams)[0]
@@ -139,7 +139,7 @@ class PoemGenerator():
139 139
 
140 140
 
141 141
 if __name__ == '__main__':
142
-    generator = PoemGenerator('poop')
142
+    generator = PoemGenerator()
143 143
     #generator.generate_poem()
144 144
     haiku = generator.generate_haiku()
145 145
     print haiku

BIN
images/buzzfeed.png


BIN
images/char-rnn.png


BIN
images/cleverbot.png


BIN
images/madlibs.png


BIN
images/phone_autocomplete.gif


BIN
images/phone_keyboard.png


BIN
images/rnn_paper.png


BIN
images/shakespeare.png


BIN
images/spacy_speed.png


+ 7 - 3
syntax_aware_generate.py

@@ -29,7 +29,7 @@ Tree.__hash__ = tree_hash
29 29
 # corpora. Shitty bus wifi makes it hard to download spacy data and look up the docs.
30 30
 
31 31
 
32
-def generate(filename):
32
+def generate(filename, word_limit=None):
33 33
     global syntaxes
34 34
     parser = Parser()
35 35
     if not os.path.exists(SYNTAXES_FILE):
@@ -37,7 +37,10 @@ def generate(filename):
37 37
         # NOTE: results.txt is a big file of raw text not included in source control, provide your own corpus.
38 38
         with codecs.open(filename, encoding='utf-8') as corpus:
39 39
             sents = nltk.sent_tokenize(corpus.read())
40
-            sents = [sent for sent in sents if len(sent) < 150][0:1500]
40
+            if word_limit:
41
+                sents = [sent for sent in sents if len(sent) < word_limit]
42
+            sent_limit = min(1500, len(sents))
43
+            sents[0:sent_limit]
41 44
             for sent in tqdm(sents):
42 45
                 try:
43 46
                     parsed = parser.parse(sent)
@@ -60,7 +63,8 @@ def generate(filename):
60 63
             cfds = pickle.load(pickle_file)
61 64
 
62 65
     sents = nltk.corpus.gutenberg.sents('austen-emma.txt')
63
-    sents = [sent for sent in sents if len(sent) < 50]
66
+    if word_limit:
67
+        sents = [sent for sent in sents if len(sent) < word_limit]
64 68
     sent = random.choice(sents)
65 69
     parsed = parser.parse(' '.join(sent))
66 70
     print(parsed)