Libraries to be used:
- For docx to text conversion use
docx2text
- for json conversion use
json
library
- for storing value in dictionary use
defaultdict()
from collections
Steps
- Convert document to string using
docx2text
- Convert string to list of strings , split by new line
character and remove junk empty spaces
- For each element in list split by
:
,
to do manipulations.
To remove numerics splice by :2
- Store each key, value pair in dictionary for each item in list
li
in a dictionary
- Add dict object to
json_li
- Call
json.dumps(json_li)
to create json string
Code
import docx2txt, json, collections
# step 1 get docx text
text = docx2txt.process("F:workspaceStackOverFlowguac.docx")
# convert to list
li = [x for x in text.split(
)]
# remove s i.e Nones
li = list(filter(None, li))
print(li)
# json list
json_li = []
# convert and store all values
for x in li:
x = x[2:] # remove 1. 2. 3. ...
y = x.split( , )
print(y)
d = collections.defaultdict()
for m in y:
z = m.split( : )
z1 = [x.strip() for x in z]
d[z1[0]] = z1[1]
json_li.append(d)
# JSON conversion
print(json.dumps(json_li, indent=4))
output
[ 1.Name: ABC, Place: Maryland, Country: US, PHONE NO.:1234567890 , 2.Name: ABC, Place: Maryland, Country: US, PHONE NO.:1234567890 , 3.Name: ABC, Place: Maryland, Country: US, PHONE NO.:1234567890 ]
[ Name: ABC , Place: Maryland , Country: US , PHONE NO.:1234567890 ]
[ Name: ABC , Place: Maryland , Country: US , PHONE NO.:1234567890 ]
[ Name: ABC , Place: Maryland , Country: US , PHONE NO.:1234567890 ]
[
{
"Name": "ABC",
"Place": "Maryland",
"Country": "US",
"PHONE NO.": "1234567890"
},
{
"Name": "ABC",
"Place": "Maryland",
"Country": "US",
"PHONE NO.": "1234567890"
},
{
"Name": "ABC",
"Place": "Maryland",
"Country": "US",
"PHONE NO.": "1234567890"
}
]
Update on doc file
如果有
import textract
text = textract.process("path_to_file")