English 中文(简体)
Word file to json in python
原标题:Word file to json in python

I have a few word files (doc and docx) containing data of following form and I need to convert them to JSON:

1.Name: ABC, Place: Maryland, Country: US, PHONE NO.:1234567890

2.Name: ABC, Place: Maryland, Country: US, PHONE NO.:1234567890

3.Name: ABC, Place: Maryland, Country: US, PHONE NO.:1234567890

什么最容易在 p子里这样做?

问题回答

Libraries to be used:

  • For docx to text conversion use docx2text
  • for json conversion use json library
  • for storing value in dictionary use defaultdict() from collections

Steps

  1. Convert document to string using docx2text
  2. Convert string to list of strings , split by new line character and remove junk empty spaces
  3. For each element in list split by : , to do manipulations. To remove numerics splice by :2
  4. Store each key, value pair in dictionary for each item in list li in a dictionary
  5. Add dict object to json_li
  6. Call json.dumps(json_li) to create json string

Code

import docx2txt, json, collections
# step 1 get docx text
text = docx2txt.process("F:workspaceStackOverFlowguac.docx")
# convert to list
li = [x for x in text.split( 
 )]
# remove   s i.e Nones
li = list(filter(None, li))
print(li)
# json list
json_li = []
# convert and store all values
for x in li:
    x = x[2:] # remove 1. 2. 3. ...
    y = x.split( , )
    print(y)
    d = collections.defaultdict()
    for m in y:
        z = m.split( : )
        z1 = [x.strip() for x in z]
        d[z1[0]] = z1[1]
    json_li.append(d)
# JSON conversion
print(json.dumps(json_li, indent=4))

output

[ 1.Name: ABC, Place: Maryland, Country: US, PHONE NO.:1234567890 ,  2.Name: ABC, Place: Maryland, Country: US, PHONE NO.:1234567890 ,  3.Name: ABC, Place: Maryland, Country: US, PHONE NO.:1234567890 ]
[ Name: ABC ,   Place: Maryland ,   Country: US ,   PHONE NO.:1234567890 ]
[ Name: ABC ,   Place: Maryland ,   Country: US ,   PHONE NO.:1234567890 ]
[ Name: ABC ,   Place: Maryland ,   Country: US ,   PHONE NO.:1234567890 ]
[
    {
        "Name": "ABC",
        "Place": "Maryland",
        "Country": "US",
        "PHONE NO.": "1234567890"
    },
    {
        "Name": "ABC",
        "Place": "Maryland",
        "Country": "US",
        "PHONE NO.": "1234567890"
    },
    {
        "Name": "ABC",
        "Place": "Maryland",
        "Country": "US",
        "PHONE NO.": "1234567890"
    }
]

Update on doc file

如果有

import textract
text = textract.process("path_to_file")

没有图书馆/图书馆这样做。 最容易的方式是将档案转换成CSV(要么通过拆除所有mas子,然后用 com子取代白色空间,要么尽可能使用方案)。

然后,你可以使用碎片包中的DictReader级,将档案转换成字典,然后使用json模块将其丢弃为 j。

缩略语

import json

from csv import DictReader

COLUMN_NAMES = [ your ,  column ,  names, ,  ... ] 
    #Or the first row will be the column
    #(and the resulting key in the dictionary ) names

jsonCollection = {}
with open("your_csv_file.csv") as csvFile:
    #fieldnames is optional here
    reader = DictReader(csvFile, fieldnames=COLUMN_NAMES)
    for row in reader:
        for colName, rowVal in row.items():
            jsonCollection.setdefault(colName, []).append(rowVal)

json.dumps(jsonCollection) #should get you what you want




相关问题
Can Django models use MySQL functions?

Is there a way to force Django models to pass a field to a MySQL function every time the model data is read or loaded? To clarify what I mean in SQL, I want the Django model to produce something like ...

An enterprise scheduler for python (like quartz)

I am looking for an enterprise tasks scheduler for python, like quartz is for Java. Requirements: Persistent: if the process restarts or the machine restarts, then all the jobs must stay there and ...

How to remove unique, then duplicate dictionaries in a list?

Given the following list that contains some duplicate and some unique dictionaries, what is the best method to remove unique dictionaries first, then reduce the duplicate dictionaries to single ...

What is suggested seed value to use with random.seed()?

Simple enough question: I m using python random module to generate random integers. I want to know what is the suggested value to use with the random.seed() function? Currently I am letting this ...

How can I make the PyDev editor selectively ignore errors?

I m using PyDev under Eclipse to write some Jython code. I ve got numerous instances where I need to do something like this: import com.work.project.component.client.Interface.ISubInterface as ...

How do I profile `paster serve` s startup time?

Python s paster serve app.ini is taking longer than I would like to be ready for the first request. I know how to profile requests with middleware, but how do I profile the initialization time? I ...

Pragmatically adding give-aways/freebies to an online store

Our business currently has an online store and recently we ve been offering free specials to our customers. Right now, we simply display the special and give the buyer a notice stating we will add the ...

Converting Dictionary to List? [duplicate]

I m trying to convert a Python dictionary into a Python list, in order to perform some calculations. #My dictionary dict = {} dict[ Capital ]="London" dict[ Food ]="Fish&Chips" dict[ 2012 ]="...

热门标签