Question

I have a few word files (doc and docx) containing data of following form and I need to convert them to JSON:

1.Name: ABC, Place: Maryland, Country: US, PHONE NO.:1234567890

2.Name: ABC, Place: Maryland, Country: US, PHONE NO.:1234567890

3.Name: ABC, Place: Maryland, Country: US, PHONE NO.:1234567890

什么最容易在 p子里这样做?

Answer 1

The s a docx (虽然这明确无误地支持旧式.doc文档),但你可以与一只读物者合用一栏,然后从第一栏中获取你的滚动指数。

from docx import Document
import json

document = Document( existing-document-file.docx )
lines = [para.text for para in document.paragraphs]
lines = [line.partition( . ) for line in lines]
lines = [(int(row_num), row_text) for row_num, _, row_text in lines]
lines = [(n, [txt.partition( : ) for txt in row_text.split( , )]) for n, row_text in lines]
lines = {n: {key.strip(): val.strip() for key, _, val in row} for n, row in lines}
json_result = json.dumps(lines)

With your sample input, I get the following output using this code:

 {"1": {"Name": "ABC", "Place": "Maryland", "Country": "US", "PHONE NO.": "1234567890"},
"2": {"Name": "ABC", "Place": "Maryland", "Country": "US", "PHONE NO.": "1234567890"},
"3": {"Name": "ABC", "Place": "Maryland", "Country": "US", "PHONE NO.": "1234567890"}}

Answer 2

Libraries to be used:

For docx to text conversion use docx2text
for json conversion use json library
for storing value in dictionary use defaultdict() from collections

Steps

Convert document to string using docx2text
Convert string to list of strings , split by new line character and remove junk empty spaces
For each element in list split by : , to do manipulations. To remove numerics splice by :2
Store each key, value pair in dictionary for each item in list li in a dictionary
Add dict object to json_li
Call json.dumps(json_li) to create json string

Code

import docx2txt, json, collections
# step 1 get docx text
text = docx2txt.process("F:workspaceStackOverFlowguac.docx")
# convert to list
li = [x for x in text.split( 
 )]
# remove   s i.e Nones
li = list(filter(None, li))
print(li)
# json list
json_li = []
# convert and store all values
for x in li:
    x = x[2:] # remove 1. 2. 3. ...
    y = x.split( , )
    print(y)
    d = collections.defaultdict()
    for m in y:
        z = m.split( : )
        z1 = [x.strip() for x in z]
        d[z1[0]] = z1[1]
    json_li.append(d)
# JSON conversion
print(json.dumps(json_li, indent=4))

output

[ 1.Name: ABC, Place: Maryland, Country: US, PHONE NO.:1234567890 ,  2.Name: ABC, Place: Maryland, Country: US, PHONE NO.:1234567890 ,  3.Name: ABC, Place: Maryland, Country: US, PHONE NO.:1234567890 ]
[ Name: ABC ,   Place: Maryland ,   Country: US ,   PHONE NO.:1234567890 ]
[ Name: ABC ,   Place: Maryland ,   Country: US ,   PHONE NO.:1234567890 ]
[ Name: ABC ,   Place: Maryland ,   Country: US ,   PHONE NO.:1234567890 ]
[
    {
        "Name": "ABC",
        "Place": "Maryland",
        "Country": "US",
        "PHONE NO.": "1234567890"
    },
    {
        "Name": "ABC",
        "Place": "Maryland",
        "Country": "US",
        "PHONE NO.": "1234567890"
    },
    {
        "Name": "ABC",
        "Place": "Maryland",
        "Country": "US",
        "PHONE NO.": "1234567890"
    }
]

Update on doc file

如果有

import textract
text = textract.process("path_to_file")

Answer 3

没有图书馆/图书馆这样做。最容易的方式是将档案转换成CSV(要么通过拆除所有mas子,然后用 com子取代白色空间,要么尽可能使用方案)。

然后,你可以使用碎片包中的DictReader级,将档案转换成字典,然后使用json模块将其丢弃为 j。

缩略语

import json

from csv import DictReader

COLUMN_NAMES = [ your ,  column ,  names, ,  ... ] 
    #Or the first row will be the column
    #(and the resulting key in the dictionary ) names

jsonCollection = {}
with open("your_csv_file.csv") as csvFile:
    #fieldnames is optional here
    reader = DictReader(csvFile, fieldnames=COLUMN_NAMES)
    for row in reader:
        for colName, rowVal in row.items():
            jsonCollection.setdefault(colName, []).append(rowVal)

json.dumps(jsonCollection) #should get you what you want

Libraries to be used:

Steps

Code

output

Update on doc file

友情链接