Hacker Public Radio

HPR3596: Extracting text, tables and images from docx files using Python


Listen Later

Tools to extract data from docx files:
docx2txt
python-docx2txt
python-docx
Code Snippets
text = docx2txt.process(src, img_dest)
with open("data.txt", "wt") as f:
f.write(text)
document = docx.Document(src)
tables = document.tables
data = []
for table in tables:
table_data = []
for row in table.rows:
row_data = []
for cell in row.cells:
row_data.append(cell.text)
table_data.append(row_data)
data.append(table_table)
for i, table in enumerate(tables):
with open(f"{i}.csv", "wt") as f:
writer = csv.writer(f)
writer.writerows(table)
...more
View all episodesView all episodes
Download on the App Store

Hacker Public RadioBy Hacker Public Radio

  • 4.2
  • 4.2
  • 4.2
  • 4.2
  • 4.2

4.2

34 ratings


More shows like Hacker Public Radio

View all
The Infinite Monkey Cage by BBC Radio 4

The Infinite Monkey Cage

1,952 Listeners

Click Here by Recorded Future News

Click Here

418 Listeners

Hacker And The Fed by Chris Tarbell & Hector Monsegur

Hacker And The Fed

168 Listeners