Add more notes to markov_matrix.py
This commit is contained in:
parent
35792b6261
commit
9acda1716d
@ -2,14 +2,22 @@
|
|||||||
My idea here is to encode the entire corpus as one giant two-dimensional numpy array of floats where each row is a
|
My idea here is to encode the entire corpus as one giant two-dimensional numpy array of floats where each row is a
|
||||||
condition word and each column in that row is every other word in the corpus and the probability that the word follows
|
condition word and each column in that row is every other word in the corpus and the probability that the word follows
|
||||||
the conditional word.
|
the conditional word.
|
||||||
|
|
||||||
|
This was an interesting idea, but ultimately not that useful since the resulting numpy array is significantly larger
|
||||||
|
than just storing the CFD in a python dictionary. There might be some crazy linear algebra I could run to compress this
|
||||||
|
array to make it less sparse. But, I would need to use the same N words for all corpora and I think that the resulting
|
||||||
|
compressed arrays would only be really useful for comparing with each other to find things like "closeness" between two
|
||||||
|
corpora as defined by the probabilities that some words follow other words in the text. Also, using the same N words
|
||||||
|
across all corpora is less awesome because you will miss out on the unique words (names, proper nouns, etc.) present in
|
||||||
|
only some corpora.
|
||||||
"""
|
"""
|
||||||
|
import codecs
|
||||||
|
import sys
|
||||||
from collections import OrderedDict
|
from collections import OrderedDict
|
||||||
from itertools import islice
|
from itertools import islice
|
||||||
|
|
||||||
import codecs
|
|
||||||
import nltk # TODO: write/import a tokenizer so I don't need to import this
|
import nltk # TODO: write/import a tokenizer so I don't need to import this
|
||||||
import numpy as np
|
import numpy as np
|
||||||
import sys
|
|
||||||
|
|
||||||
|
|
||||||
BEGIN_TOKEN = '__BEGIN__'
|
BEGIN_TOKEN = '__BEGIN__'
|
||||||
|
Loading…
Reference in New Issue
Block a user