Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA]: docx improvements #316

Open
SimJeg opened this issue Jan 8, 2025 · 0 comments
Open

[FEA]: docx improvements #316

SimJeg opened this issue Jan 8, 2025 · 0 comments
Labels
feature request New feature or request

Comments

@SimJeg
Copy link

SimJeg commented Jan 8, 2025

Is this a new feature, an improvement, or a change to existing functionality?

New Feature

How would you describe the priority of this feature request

Would be nice

Please provide a clear description of problem this feature solves

Hi,

A few improvements can be made to docx parsing in docxreader.py

  1. improve management of bullet lists
  2. fix bug with table_images
  3. handle headers and footers

Describe the feature, and optionally a solution or implementation and any alternatives

  1. improve management of bullet lists in apply_paragraph_style (some bullet lists are not detected by style.startswith("List")):
        try:
            numPr = paragraph._element.xpath("./w:pPr/w:numPr")[0]
            level = int(numPr.xpath("./w:ilvl/@w:val")[0])
        except Exception:
            numPr = None
            level = 0
        style = paragraph.style.name

        # Apply style
        if re.match(r"^Heading [1-9]$", style):
            n = int(style.split(" ")[-1])
            text = f"{'#' * n} {text}"
        elif style.startswith("List") or (numPr is not None):
               ...
  1. table images are not well formatted
from itertools import chain
...
        table_images = [cell_images for row in rows for _, cell_images in row]
        table_images = list(chain(*chain(*table_images)))

instead of

table_images = [image for row in rows for _, images in row for image in images]
  1. the current version does not handle headers and footers

To do so, you can replace lin 219

    for c in paragraph.iter_inner_content():
        ...

by

        for section in self.document.sections:
              self.text += "".join([self.format_paragraph(p)[0] + "\n" for p in section.header.paragraphs])

              for c in section.iter_inner_content():
                  ...

              self.text += "".join([self.format_paragraph(p)[0] + "\n" for p in section.footer.paragraphs])

Additional context

No response

@SimJeg SimJeg added the feature request New feature or request label Jan 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant