Monday, April 26, 2010

Archive from 7_9_07 - Word Macro to extract a Unique list of words from a document

This MS Word macro strips all words from a document and creates a list of unique words in a new document. It dumps any word starting with a non-ascii character or that is less than 2 characters long.
Its based on some code from:
http://www.microsoft.com/technet/scriptcenter/resources/qanda/jun06/hey0628.mspx
Public Sub MakeUniqueList()
'This will make a list of all unique words in the document
' and stash them in a new document

Set objDictionary = CreateObject("Scripting.Dictionary")
Set objDoc = Application.ActiveDocument
Set colWords = objDoc.Words

Dim cleanWord As String

For Each strWord In colWords
'clean up the word
strWord = Trim(LCase(strWord))
cleanWord = strWord
'see if we want to add it to the dic
If KeepWord(cleanWord) Then
If objDictionary.Exists(cleanWord) Then
Else
objDictionary.Add strWord, strWord
End If
End If
Next
'create a new document to hold the list
Set objDoc2 = Application.Documents.Add()
Set objSelection = Application.Selection
For Each strItem In objDictionary.Items
objSelection.TypeText strItem & vbCrLf
Next
Set objRange = objDoc2.Range
objRange.Sort
End Sub

Private Function KeepWord(ByRef word As String) As Boolean
'function to try to remove some of the rubbish words from the list
'get rid of short words
If (Len(word) < 2) Then
'retVal = False
KeepWord = False
Exit Function
End If

'check for punctuation characters
'ASC function (65-90 is ucase ) 97 - 122 is lcase
If (Asc(word) < 65) Or (Asc(word) > 122) Or ((Asc(word) > 90) And (Asc(word) < 97)) Then
KeepWord = False
Exit Function

End If
KeepWord = True
End Function


Used in Microsoft Word 2003.

No comments:

Post a Comment