The scanned text of 17 volumes of Massachusetts Soldiers and Sailors of the Revolutionary War, a Compilation from https:archive.org, was difficult to search through because there were so many errors, including misrepresented names, broken words, and garbled entire pages of text.
After several years of gradual progress, I have completed the work to correct hundreds of thousands of systematic errors in the OCR (optical character recognition) process is complete. Each record is now on one line in the file with no breaks, making it a better basis to determine the composition of companies or regiments at various times, inventory of which towns the soldiers were from, traits of soldiers from rolls that described the men (hair, eyes, height, ethnicity, occupation, etc.; and which men were in specific battles — in the form of a positive claim containing the words like killed, wounded, captured, or lost items or a tacit claim for being in the unit at the time of the battle.
The original “text view” was copied into files for each volume. I processed those files using Python to remove the arbitrary end of lines, thus concatenating the text for each soldier. My goal was to be able to do wildcard searches per soldier such that I could find matches between company commanders and regiment commanders. For instance, searching for “ilder.*Doo” would find men in Capt. Abel Wilder’s company., Col. Ephraim Doolittle’s regt. That syntax is for Notepad ++. I got tired of doing that manually in each of the 17 volumes. This also addresses not having to search across volumes for sound-alike names such as Ayres and Eyres.
This is a large file – STRONGLY RECOMMEND using in Notepad++ or do a bit of programming in Python (DO NOT ATTEMPT IN: NotePad, WordPad, or even MS Word). Both Notepad++ and Python support wildcard and extended character searching, e.g., “\n\r”. Notepad++ is limited to one wildcard, so for complex sequences, Python is the only option (using the regex library and conditional logic to mimic an SQL query with multiple WHERE clauses).
Removing end-of-line characters described above generally worked, but issues with the OCR process resulted in many false interpretations of characters and unintended insertions of new lines. So, I also started replacing some of the safer sequences:
Daxiei to Daniel x4
Daxiel to Daniel x68
Aakon to Aaron x3
Reubex to Reuben x 35
AxLEX to Allen x7
Johx to John x452
JoHx to John x63
Jamks to James x7
AVigglesworth to Wigglesworth x36
Woodbridjje to Woodbridge
Bexoxi to Benoni x5
Bexjamix to Benjamin x98
Williaji to William x70
Joxathax to Jonathan x51
Lejiuel to Lemuel x5
If you are tempted to do similar, please be aware that many seemingly safe find/replace pairs will produce unintended changes Use Find first to locate a large number of prospective changes to be certain of avoiding changing valid information.
Sometimes the OCR process in combination with a poorly printed page created a mess. For instance:
An^TT.w SPr-eant Capt. Joseph Uichards’s co.; enlisted Aug. 11, 1779; ser'”‘^”^^ZZ 3 a”;n^.’detach,nent under Capt. Samuel Fisher at Rhode Island.
Should be:
Allen, Abijah, Sergeant, Capt. Joseph Richards’s co.; enlisted Aug. 11, 1779; service 1 mo. 3 days with detachment under Capt. Samuel Fisher at Rhode Island.
Thus, if you’re looking for specific person, the scanned version on archive.org is your best option.
-CEF
One thought on “Improved text for searching MA Soldiers and Sailors”