English 中文(简体)
Excel Power Query not extracting some tables from PDF properly
原标题:
  • 时间:2023-08-12 02:49:31
  •  标签:
  • powerquery

The power query code shown in the link below helped me to pick some particular tables from PDF and format them properly.

Import Various Tables with Different Formats from PDF into Excel

The problem I have now is that some new PDF files (sample shown in the attached PDF, tables 4 and Table 5 on the 3rd page) that came into the directory looks exactly like the previous tables with the same format but the results that Power Query extracted are not correct. The actual results are shown in the excel sheet in green and the wrong numbers are in red. I have attached the sample files in the link below:

https://www.excelforum.com/excel-programming-vba-macros/1410321-excel-power-query-not-extracting-some-tables-from-pdf-properly.html#post5858939

I could not understand the reason why the code is not picking the correct values since the pdfs are of similar formatting.

I modified the code to work for me perfectly, until these two new tables (shown in the attached PDF--Tables 4 and Table 5 on the 3rd page) came into the folder.

How can I modify this code so that it can read the attached PDFs in Tables 4 and Tables 5 on the 3rd page properly so that it can give me the results in Green as shown in the excel file?

Below is the code:

let 

group= (filename)=>
//open each file and pull out all the tables
let Source = Pdf.Tables(File.Contents(filename), [Implementation="1.3"]),
List = List.Union(List.Transform(Source[Data], each Table.ColumnNames(_))), 
#"Expanded Data" = Table.ExpandTableColumn(Source, "Data", List,List),
#"Filtered Rows" = Table.SelectRows(#"Expanded Data", each Text.StartsWith([Name], "Table")),
#"Grouped Rows" = Table.Group(#"Filtered Rows", {"Name"}, {{"data", each _, type table}}),
#"Added Custom" = Table.AddColumn(#"Grouped Rows", "Custom", each split([data]))
in #"Added Custom",

//process each table
split= (variable)=>  let
#"Filtered Rows1" = Table.SelectRows(variable,  each ([Column1] <> "" and [Column1]<>null and not Text.Contains([Column1], "Final") and [Column2] <> "" and [Column2]<>null)),
#"Promoted Headers" = Table.PromoteHeaders(#"Filtered Rows1", [PromoteAllScalars=true]),
#"Removed Columns" = Table.RemoveColumns(#"Promoted Headers",List.FirstN(Table.ColumnNames(#"Promoted Headers"),3)), 
Names=List.RemoveNulls(List.Transform(Table.ColumnNames(#"Removed Columns"), each  if Text.Contains(_,"Volume") or Text.Contains(_,"Unit") then _ else null)),
#"Removed Other Columns" = Table.SelectColumns(#"Removed Columns",Names),
FirstName=List.First(Table.ColumnNames(#"Removed Other Columns")),
#"Unpivoted Other Columns" = Table.UnpivotOtherColumns(#"Removed Other Columns",{FirstName} , "Attribute", "Value"),
#"Renamed Columns" = Table.RenameColumns(#"Unpivoted Other Columns",{{FirstName, "Category"}}),
Transform = Table.TransformColumns(#"Renamed Columns",{{"Attribute",each Text.BeforeDelimiter(_,"_"), type text}}),
//Dates
RepeatCount=Table.RowCount(#"Removed Columns"),
Dates = Record.ToList(variable{1}),
b = List.RemoveNulls(List.Transform(Dates, each try if Text.Contains (_,"(") and not Text.Contains (_,"Table")  then Text.BeforeDelimiter(_, "(")  else null otherwise null)),
a={List.Repeat(b,RepeatCount + 2)},
//Distance
#"Promoted Headers5" = Table.PromoteHeaders(variable, [PromoteAllScalars=true]),
Distance = List.Repeat(List.RemoveNulls(List.Transform(Table.ColumnNames(#"Promoted Headers5"), each try if Text.Contains(_,"Distance") then Number.From(Text.Select(_,{"0".."9"})) else null otherwise null)),List.Count(Names)-1),DistanceTable= Table.FromList(Distance, Splitter.SplitByNothing(), {"Value"}, null, ExtraValues.Error),
#"Added Custom3" = Table.AddColumn(DistanceTable, "Category", each "Depth_MD"),
//Total
#"Filtered Rows5" = Table.SelectRows(variable, each (Text.Contains([Column1],"Final"))){0},
r=Record.ToList(#"Filtered Rows5"),
s=if List.PositionOf(r,"Final") = -1 then List.PositionOf(r,"Pre-flow Final") else  List.PositionOf(r,"Final"),
rr=List.Transform(r, each if _="" then null else _),
t=List.RemoveNulls(List.RemoveFirstN(rr,s+1)),
TotalTable= Table.FromList(t, Splitter.SplitByNothing(), {"Value"}, null, ExtraValues.Error),
#"Added Custom4" = Table.AddColumn(TotalTable, "Category", each "Final"),
//combine it all
h = Table.FromColumns(Table.ToColumns(Transform&#"Added Custom3"& #"Added Custom4")&a),
#"Removed Columns1" = Table.RemoveColumns(h,{"Column2"}),
#"Changed Type" = Table.TransformColumnTypes(#"Removed Columns1",{{"Column1", type text}, {"Column3", type number}, {"Column4", type text}}),
#"Pivoted Column" = Table.Pivot(#"Changed Type", List.Distinct(#"Changed Type"[Column1]), "Column1", "Column3", List.Sum)
in #"Pivoted Column",
Source2 = Folder.Files("c:	emp5"),
#"Filtered Rows" = Table.SelectRows(Source2, each ([Extension] = ".pdf")),
#"Added Custom" = Table.AddColumn(#"Filtered Rows", "Data", each group([Folder Path]&[Name])),
#"Renamed Columns" = Table.RenameColumns(#"Added Custom",{{"Name", "zName"}}),
#"Expanded Data" = Table.ExpandTableColumn(#"Renamed Columns", "Data",{"Name","Custom"},{"Name","Custom"}),
#"Removed Errors" = Table.RemoveRowsWithErrors(#"Expanded Data", {"Custom"}),
List = List.Union(List.Transform(#"Removed Errors"[Custom], each Table.ColumnNames(_))), 
#"Expanded Data2" = Table.ExpandTableColumn(#"Removed Errors", "Custom", List,List),
    #"Changed Type" = Table.TransformColumnTypes(#"Expanded Data2",{{"Column4", type text}}),
    #"Filtered Rows1" = Table.SelectRows(#"Changed Type", each ([Column4] <> "Combined ")),
    #"Changed Type1" = Table.TransformColumnTypes(#"Filtered Rows1",{{"Column4", type date}}),
    #"Renamed Columns1" = Table.RenameColumns(#"Changed Type1",{{"Column4", "Dates in Report"}}),
    #"Filtered Rows2" = Table.SelectRows(#"Renamed Columns1", each ([Dates in Report] <> null))
in
    #"Filtered Rows2"
问题回答

暂无回答




相关问题
PowerQuery Custom Function: column value in set of numbers

I am brand new to PowerQuery. I want to make a custom column with a value of 1 if the column value is in a set of numbers (e.g., 3,6,9,14) and 0 otherwise. Besides a big if statement for all the ...

Issue Looping Through an entire list in power query

I am pulling data from the internet right now and formatting it into a readable list using power query, but I am running into an issue. When I split the columns into rows (as I have to change the data ...

Power BI Subtotal Measure

I need your help with some Power Bi code As you can see, in the Excel Worksheet i have 2 cells (yellow) that are mainly of subtotals and 2 columns (last 2) that depends of these subtotals, however ...

热门标签