-
Notifications
You must be signed in to change notification settings - Fork 1
Expand file tree
/
Copy pathfinal_code.py
More file actions
384 lines (331 loc) · 13.8 KB
/
final_code.py
File metadata and controls
384 lines (331 loc) · 13.8 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
"""
Program parses through input text to identify and output the cue, claim, and source
of sentences within the input text.
---------------------------------------------
Global Variables
----------------
textInput is a string that represents the input
jsonFile represents the output file
inputFile represents the input file
sentenceList will contain a list of sentences from the input after tokenizeToSents()
called in the program
numSentences represents the number of sentences in sentenceList
hasNSUBJ and hasCCOMP are boolean variables used to confirm that the cue has
outgoing edges of type NSUBJ and CCOMP
isNamedEntity is a boolean variable used to determine if the current sentence
has a named entity in it
hasCue is a boolean variabled used when checking if the current sentence
contains a cue
source is a spaCy token of the doc that represents the source in a sentence
claim is spaCy token of the doc that represents the claim in a sentence
cue is spaCy token of the doc that represents the cue in a sentence
claimMark is a spaCy token of the doc that represents the head of the
CCOMP relationship from the cue
sentenceArr is an array that is used to represent the different components
of a sentence that is being stored in an array (direct_quote,
sentence, source, cue, and claim)
sentencesLoopCount is an integer used to keep track of the while loop
currSentence is of doc type that represents the current sentence that is being
parsed. It is obtained throught the sentenceList
allSentencesDict is a dictionary that will contain the array of all of the
sentences by appending the sentenceArr
accordingToPattern is a regex pattern used to obtain the cue, claim, and
source from a sentence that has 'according to' in it
quotePattern is a regex pattern that is used obtain the direct quotes in a
sentence
---------------------------------------------
Functions
---------
def setInputString(inputToProcess):
Function that sets the textInput to be an input string.
inputToProcess is a string that will become the textInput
def setInputFile(fileName):
Function that sets the textInput to be the contents of an input file.
fileName is the name of the file that will be set as the textInput
Example fileName parameter is 'input.txt'
def preprocess():
Function that preprocesses the text by removing new line characters and
periods before the '@' character
def createOutputFile(fileName):
Function that creates and opens the the output file with write permissions.
fileName is the name of the file that will be created for the output
Example fileName parameter is 'output.json'
Note the the output can be in either json or txt format
def tokenizeToSents():
Function that tokenizes the input into sentence using nltk's
sent_tokenize. The tokenized sentences are stored in the sentenceList and
numSentences is set to be the length of sentenceList
def accordingToCheck():
Function that makes the regular expression to see if 'according to' is in
a sentence. If it is, then the cue, claim, and source will be added to
sentenceArr, which will then be appended to allSentencesDict.
Returns True if 'according to' was in the sentence, else returns False
def obtainDirectQuote():
Function that peforms regex check to see if a direct quote is present
in a sentence. If it is, then it adds the quote to the sentenceArr.
Otherwise, the 'direct_quote' in the sentenceArr is set to None.
def setBoolVarsFalse():
Function that sets all of the boolean variables to false. This is done
within the while loop of main on each iteration of the loop so that the
proper checks can be made.
def cueCheck(token):
Function that checks if the token parameter is a lemma of one of the cues.
The function is called in a for loop of all of the tokens in the sentence,
so each word in the sentence will be checked.
token is part of a doc from spaCy. The token is an individal component
of a sentence.
Returns True if a cue was found, else returns False.
def cueDependencyCheck(token):
Function iterates through the children of the token to see if it has
outgoing edges of type NSUBJ and CCOMP. If it does, then it sets the
corresponding hasNSUBJ and hasCCOMP variables as true or false.
token will be the cue that is inputted into the function as a token.
Returns True if both hasNSUBJ and hasCCOMP is true, else returns False.
def obtainSourceAndMark(cueParam):
Function sets the source to be the head of the NSUBJ edge from the cue and
sets the claimMark to be the head of the CCOMP edge from the cue by
iterating through the children of the cue.
cueParam is the cue inputted into the function as a token
def obtainClaim():
Function obtains and sets the claim by getting the start and end index of
the claimMark subtre.
def twitterUsernameCheck():
Function checks if the sentence has a twitterUsername with regex so that
the username may be considered as a named entity. If a username is found,
isNamedEntity will be set to true.
def obtainMultiWordEntity():
Function obtains the complete named entity if there is a multiple word
named entity in the sentence. The function loops through all of the named
entities found in the sentence using spaCy's .ents method for a doc.
If the index of the source is within the range of an entity within the
named entities list, then the source is set to equal the complete named
entity.
def createSentenceArr(passedNamedEntity):
Function creates the sentence, source, cue, and claim parts of the
sentenceArr.
passedNamedEntity is boolean variable passed into the Function to determine
if there is a named entity in the current sentence.
def writeOutput():
Function that dumps the allSentencesDict of all the sentenceArr's into
the output file and then closes the output file.
def main():
Runs the whole program to properly obtain the cue, claim, and source of
a sentence.
"""
#import spacy
#import re
#from nltk.tokenize import sent_tokenize
#import json
#nlp = spacy.load('en_core_web_lg')
global textInput
global jsonFile
global inputFile
global sentenceList
global numSentences
global hasNSUBJ
global hasCCOMP
global isNamedEntity
global hasCue
global source
global claim
global cue
global claimMark
global sentenceArr
global sentencesLoopCount
global currSentence
global allSentencesDict
allSentencesDict = {'Sentences':[]}
accordingToPattern = re.compile('^(?P<claim>.*?)[\W]*according to[\W]*(?P<source>.*?)[\W]*$', flags=re.I|re.U)
quotePattern = re.compile(r'\“(.+?)\”')
def setInputString(inputToProcess):
global textInput
textInput = inputToProcess
return
def setInputFile(fileName):
global inputFile
global textInput
inputFile = open(fileName, 'r')
textInput = inputFile.read()
inputFile.close()
return
def preprocess():
global textInput
textInput = textInput.replace('\n', ' ')
textInput = textInput.replace('.@', '@')
return
def createOutputFile(fileName):
global jsonFile
jsonFile = open(fileName, 'w')
return
def tokenizeToSents():
global sentenceList
global numSentences
sentenceList = sent_tokenize(textInput)
numSentences = len(sentenceList)
return
def accordingToCheck():
global sentenceArr
global allSentencesDict
global sentencesLoopCount
global currSentence
accordingToMatch = re.match(accordingToPattern, currSentence)
if accordingToMatch != None:
sentenceArr['sentence'] = currSentence
sentenceArr['source'] = accordingToMatch.group('source')
sentenceArr['cue'] = 'according to'
sentenceArr['claim'] = accordingToMatch.group('claim')
allSentencesDict['Sentences'].append(sentenceArr)
sentencesLoopCount += 1
return True
else:
return False
def obtainDirectQuote():
global currSentence
quoteMatch = re.match(quotePattern, currSentence)
if quoteMatch != None:
sentenceArr['direct_quote'] = quoteMatch.string
else:
sentenceArr['direct_quote'] = None
def setBoolVarsFalse():
global hasNSUBJ
global hasCCOMP
global isNamedEntity
global hasCue
hasNSUBJ = False
hasCCOMP = False
isNamedEntity = False
hasCue = False
return
def cueCheck(token):
if (token.pos_ == 'VERB' and (token.lemma_ == 'say' or token.lemma_ == 'report'
or token.lemma_ == 'tell' or token.lemma_ == 'told'
or token.lemma_ == 'observe' or token.lemma_ == 'state'
or token.lemma_ == 'state' or token.lemma_ == 'accord'
or token.lemma_ == 'insist' or token.lemma_ == 'insist'
or token.lemma_ == 'assert' or token.lemma_ == 'claim'
or token.lemma_ == 'maintain' or token.lemma_ == 'explain'
or token.lemma_ == 'deny' or token.lemma_ == 'learn'
or token.lemma_ == 'admit' or token.lemma_ == 'discover'
or token.lemma_ == 'forget' or token.lemma_ == 'forgot'
or token.lemma_ == 'think' or token.lemma_ == 'thought'
or token.lemma_ == 'predict' or token.lemma_ == 'suggest'
or token.lemma_ == 'guess' or token.lemma_ == 'believe'
or token.lemma_ == 'doubt' or token.lemma_ == 'wonder'
or token.lemma_ == 'ask' or token.lemma_ == 'hope'
or token.lemma_ == 'sense' or token.lemma_ == 'hear'
or token.lemma_ == 'feel')):
return True
else:
return False
def cueDependencyCheck(token):
global hasNSUBJ
global hasCCOMP
for child in token.children:
if (child.dep_ == 'nsubj'):
hasNSUBJ = True
if (child.dep_ == 'ccomp'):
hasCCOMP = True
if (hasNSUBJ and hasCCOMP):
return True
else:
return False
def obtainSourceAndMark(cueParam):
global source
global claimMark
global isNamedEntity
global cue
for child in cue.children:
if (child.dep_ == 'nsubj'):
source = child
if (source.ent_type != 0):
isNamedEntity = True
if (child.dep_ == 'ccomp'):
claimMark = child
return
def obtainClaim():
global claim
children = list(claimMark.subtree)
childrenIndices = [child.i for child in children]
claimStart = min(childrenIndices)
claimEnd = max(childrenIndices) + 1
claim = currSentenceDoc[claimStart:claimEnd]
return
def twitterUsernameCheck():
global currSentence
global isNamedEntity
global source
if (not isNamedEntity):
usernamePattern = re.compile('@[\w+]*')
usernameMatch = re.findall(usernamePattern, currSentence)
for entity in list(match):
if (source.text == entity):
isNamedEntity = True
return
def obtainMultiWordEntity():
global source
entityLoopCount = 0
numEntities = len(currSentenceDoc.ents)
foundEntityGroup = False
while entityLoopCount < numEntities and not foundEntityGroup:
currEntity = currSentenceDoc.ents[entityLoopCount]
if (source.i >= currEntity.start and source.i <= currEntity.end):
source = currEntity
foundEntityGroup = True
entityLoopCount += 1
return
def createSentenceArr(passedNamedEntity):
global sentenceArr
global currSentence
if passedNamedEntity:
sentenceArr['sentence'] = currSentenceDoc.text
sentenceArr['source'] = source.text
sentenceArr['cue'] = cue.text
sentenceArr['claim'] = claim.text
else:
sentenceArr['sentence'] = currSentence
sentenceArr['source'] = None
sentenceArr['cue'] = None
sentenceArr['claim'] = None
return
def writeOutput():
global jsonFile
print(json.dumps(allSentencesDict, indent=4))
jsonFile.write(json.dumps(allSentencesDict, indent=4))
jsonFile.close()
def main():
global sentenceArr
global hasCue
global currSentence
global sentencesLoopCount
global cue
global currSentenceDoc
setInputFile('input.txt')
preprocess()
createOutputFile('json_output.json')
tokenizeToSents()
sentencesLoopCount = 0
while sentencesLoopCount < numSentences:
sentenceArr = {}
currSentence = sentenceList[sentencesLoopCount]
accordingToMatch = re.match(accordingToPattern, currSentence)
obtainDirectQuote()
if not accordingToCheck():
currSentenceDoc = nlp(currSentence)
setBoolVarsFalse()
for token in currSentenceDoc:
hasCue = cueCheck(token)
if hasCue:
if cueDependencyCheck(token):
cue = token
obtainSourceAndMark(cue)
obtainClaim()
twitterUsernameCheck()
if (isNamedEntity):
obtainMultiWordEntity()
createSentenceArr(isNamedEntity)
allSentencesDict['Sentences'].append(sentenceArr)
sentencesLoopCount += 1
writeOutput()
return
if __name__ == "__main__":
main()