[FEA]: docx improvements #316

SimJeg · 2025-01-08T08:19:26Z

Is this a new feature, an improvement, or a change to existing functionality?

New Feature

How would you describe the priority of this feature request

Would be nice

Please provide a clear description of problem this feature solves

Hi,

A few improvements can be made to docx parsing in docxreader.py

improve management of bullet lists
fix bug with table_images
handle headers and footers

Describe the feature, and optionally a solution or implementation and any alternatives

improve management of bullet lists in apply_paragraph_style (some bullet lists are not detected by style.startswith("List")):

        try:
            numPr = paragraph._element.xpath("./w:pPr/w:numPr")[0]
            level = int(numPr.xpath("./w:ilvl/@w:val")[0])
        except Exception:
            numPr = None
            level = 0
        style = paragraph.style.name

        # Apply style
        if re.match(r"^Heading [1-9]$", style):
            n = int(style.split(" ")[-1])
            text = f"{'#' * n} {text}"
        elif style.startswith("List") or (numPr is not None):
               ...

table images are not well formatted

from itertools import chain
...
        table_images = [cell_images for row in rows for _, cell_images in row]
        table_images = list(chain(*chain(*table_images)))

instead of

table_images = [image for row in rows for _, images in row for image in images]

the current version does not handle headers and footers

To do so, you can replace lin 219

    for c in paragraph.iter_inner_content():
        ...

by

        for section in self.document.sections:
              self.text += "".join([self.format_paragraph(p)[0] + "\n" for p in section.header.paragraphs])

              for c in section.iter_inner_content():
                  ...

              self.text += "".join([self.format_paragraph(p)[0] + "\n" for p in section.footer.paragraphs])

Additional context

No response

The text was updated successfully, but these errors were encountered:

SimJeg added the feature request New feature or request label Jan 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA]: docx improvements #316

[FEA]: docx improvements #316

SimJeg commented Jan 8, 2025

[FEA]: docx improvements #316

[FEA]: docx improvements #316

Comments

SimJeg commented Jan 8, 2025

Is this a new feature, an improvement, or a change to existing functionality?

How would you describe the priority of this feature request

Please provide a clear description of problem this feature solves

Describe the feature, and optionally a solution or implementation and any alternatives

Additional context