top of page
tihemafactioloa

Get paragraphs from a Word document via Powershell: Learn how to extract and manipulate text data



# PowerShell stores text in word documents in the Paragraphs property.# Each line in the document can be accessed using the Paragraph's index.# For some reason, there is no Paragraphs[0]# A carriage return in the document will start a new paragraph


Wouldn't this only work if Word is installed on the machine? If you didn't have word you could also unzip the docx file and parse the document.xml to manipulate it. I've actually done this in a powershell script in the past to update datasources in the document.




Get paragraphs from a Word document via Powershell




To set the border on the first paragraph, we use the item method from the paragraph collection that is held in the $paragraphs variable. We choose the first paragraph by supplying 1 to the item method. Now we get the borders collection by using the borders property, and again use the item method to retrieve a specific border. The border we want to draw is the bottom border. To do this we use the wdBorderBottom enumeration from the wdBorderType enumeration. We have stored this enumeration type in the variable $borderType. The secret here is the use of the double colon (::) to get our specific enumeration. This line of code is seen here:


I have a lot of word files that contain paragraphs of text - each word file is just a translation of this text in a different language. I would like to grab this content from word and output in my own HTML format (not the one word generates - but a specific template). So I need to be able read the paragraphs in word (all paragraphs) and append some html tags and save it as that word filename with html extension.


Notice how it created 3 paragraphs in 1 second. Also notice how Word marked all text red because the language of the document was not set so it took defaults from my Word setup. Keep in mind it doesn't need Word installed. It doesn't open Microsoft Word in background. It works on the XML level.


Firstly, the file path is set, together with an array of paragraphs containing particular text to format. If the path exists, the current working directory is changed and the names of all the Word files within it, excluding those in sub-directories, are retrieved. This is followed by a check to make sure that there are files to process at the desired location. Each file is then processed one by one. For each paragraph of text within a document, a check is made to see if its text matches any item in the paragraphs to format array. If a match is found the text is formatted.


Today, I will be stepping away from Excel as the Office automated tool using PowerShell, and instead look at how you can use Word with PowerShell to create a document -- or, at the very least, get you started on the path to creating an automated approach to building a document.


Lastly, we should probably save our work so that we can send it to whomever might need it. I'll use the SaveAs method on my Document object and supply the full path to where the document will be saved to, as well as the file name. I'll also tell the method what kind of format to save the document in. Once I do that, I am free to quit Word by calling Quit() under my word object.


Ok, here's a much better one yet. I have elected to apply multiple find and replace as I loop through the StoryRanges of the document instead of calling my former function several times (and then loop through the StoryRanges over and over).I'm also now looking for the Shapes inside Headers and Footers directly from the Shapes collection and not from the StoryRanges this works much better. We access this collection from any Section's Header (or Footer) so we simply look into the first Header of the first Section, hence the Sections.Item(1).Headers.Item(1).Finally, rather than muting the output of the findAndReplace, I'm counting how many times we do an actual replacement.Hopefully someone finds this helpful, it was a great way to start using PowerShell for me anyway.


This function will paste any selected text from the ISE into a Word document. The first time you run the function, PowerShell will create a Word document and format it for fixed width text. It will then insert your text and a new paragraph marker. The next time you run the function, it should detect that you have a document open and re-use the existing variables. The Word document will be visible so you can edit it further and save it. If you move the cursor around in the document, any new content you insert will go there.


The user can only go so far with this. A docx file is built from folders full of xml files. None of these xmlfiles are self contained. But search and replace is enough to make document templates (documents with placeholders fordata), and that's pretty useful in itself.


Navigating through XML is straightforward with lxml. It is a separate step to take whatever you find and bring itout of the XML. For instance, you may want to iterate over a document, looking for paragraphs with a particularformat, then pull the text out of those paragraphs. Docx2python v1 did not separate or expose "iter the document" and"pull the content". Docx2python v2 separates and exposes these steps. This will allow easier extension.


You should get a count of the lines, words, and characters in the text. Of course, you could do this easily enough with your word processor. The power of working on the command-line comes from being able to manipulate lots of things at once and being able to specify what we want done with extra precision. In this example, this means we can count words in multiple of our files at once, and that we can add additional parameters to specify exactly how we want to count them.


All default formatting in Word begins with styles. You can't get away from them; they are always there, even if you try to ignore them. If you change what is within the definition of a style, then you've changed the formatting applied across all paragraphs or characters that use that style. If you create new styles, you are creating new "default" formatting that can be applied to various elements of your document. If you try to ignore styles, then most, if not all, of your paragraphs use the Normal style.


That's it; that's how you stop Word from applying the explicit changes to the underlying style. Of course, if you've inadvertently changed styles earlier (because the Automatically Update check box was selected), then you'll need to go back and change the style definition so that text appears as you want it to. You'll also need to go through and perform these same steps on any other styles in the template or document.


Track Changes is a feature built into Microsoft Word that keeps track of all the edits made to your document and lets you make comments. When Track Changes is turned on, the edit you make are highlighted, appearing in different colors or styles to separate them from the original text.


There are two ways to remove a comment. Click on the comment you want to remove. If you want to keep the comment in the document for the time being, but want to indicate that it's already been addressed, click "Resolve" in the comment bubble. The comment will still be visible in the document's margin, but will now appear grayed out, distinguishing it from other comments.


The \w metacharacter is used to find a word character. A word character is a character from a-z, A-Z, 0-9, including the _ (underscore) character. Here we use \W which remove everything that is not a word character. This works pretty well but we get an extra underscore character _. The diacritics on the c is conserved.


In the following example, getElementsByTagName() starts from a particular parent element and searches top-down recursively through the DOM from that parent element, building a collection of all descendant elements which match the tag name parameter. This demonstrates both document.getElementsByTagName() and the functionally identical Element.getElementsByTagName(), which starts the search at a specific element within the DOM tree.


\n In the following example, getElementsByTagName() starts from a particular\n parent element and searches top-down recursively through the DOM from that parent\n element, building a collection of all descendant elements which match the tag\n name parameter. This demonstrates both\n document.getElementsByTagName() and the functionally identical\n Element.getElementsByTagName(), which starts the search at a specific\n element within the DOM tree.\n


Gail has a document that is a book and it uses justified margins. Some of the hyphenation is really bad, so she's wondering if there is a way to look through the document and turn off hyphenation on individual words.


Let's take a look at how hyphenation works in Word. You can either automatically or manually hyphenate your document. Basically, you control hyphenation by displaying the Page Layout tab of the ribbon and then clicking the Hyphenation tool. You can then choose from three primary options: Automatic, Manual, or None. Most people (such as Gail) choose the Automatic option. This then adds hyphens automatically throughout the document.


If you don't want to hyphenate a particular word, you can use the Manual option to step through the entire document, but this can be rather tedious. (Plus, it isn't entirely clear that you will reliably be able to "skip" the words you want to skip.)


Note, of course, that this exclusion affects entire paragraphs, not individual words. The bottom line is that there is no way, that we've discovered, to limit the exclusion to individual words. Even creating a linked or character-based style is no help. Character styles don't let you modify the hyphenation setting at all, and linked styles only adjust hyphenation when applied to an entire paragraph.


Perhaps the best solution is to look outside of Word for a solution. You could import your Word document into a page layout program (such as InDesign) and make fairly easy work out of excluding individual words from hyphenation. 2ff7e9595c


1 view0 comments

Recent Posts

See All

Comments


bottom of page