Question

事先对长期问题表示歉意, 我是皮顿的新人, 我试图尽可能明确地表达我对一个相当具体的情况的看法。

我试图从证交会档案中例行地找出具体的数据点,但我要将数据点自动化,而不是手动搜索一家公司CIK ID和表格归档。到目前为止,我能够下载到一个点,下载证交会在一个特定时期收到的所有档案的元数据。

index   cik         conm             type        date           path
0   0   1000045 NICHOLAS FINANCIAL INC  10-Q   2019-02-14   edgar/data/1000045/0001193125-19-039489.txt
1   1   1000045 NICHOLAS FINANCIAL INC  4   2019-01-15  edgar/data/1000045/0001357521-19-000001.txt
2   2   1000045 NICHOLAS FINANCIAL INC  4   2019-02-19  edgar/data/1000045/0001357521-19-000002.txt
3   3   1000045 NICHOLAS FINANCIAL INC  4   2019-03-15  edgar/data/1000045/0001357521-19-000003.txt
4   4   1000045 NICHOLAS FINANCIAL INC  8-K 2019-02-01  edgar/data/1000045/0001193125-19-024617.txt

尽管有所有这些信息,并且能够下载这些文本文件并查看基本数据,但我无法分析这些数据,因为它是xbrl格式的,而且有点离我的车轮舱太远。相反,我找到了这个脚本(见此网站https://www.codeproject.com/artss/122765/Parsing-XBRL-with-Python ):

from bs4 import BeautifulSoup
import requests
import sys

# Access page
cik =  0000051143 
type =  10-K 
dateb =  20160101 

# Obtain HTML for search page
base_url = "https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK={}&type={}&dateb={}"
edgar_resp = requests.get(base_url.format(cik, type, dateb))
edgar_str = edgar_resp.text

# Find the document link
doc_link =   
soup = BeautifulSoup(edgar_str,  html.parser )
table_tag = soup.find( table , class_= tableFile2 )
rows = table_tag.find_all( tr )
for row in rows:
    cells = row.find_all( td )
    if len(cells) > 3:
        if  2015  in cells[3].text:
            doc_link =  https://www.sec.gov  + cells[1].a[ href ]

# Exit if document link couldn t be found
if doc_link ==   :
    print("Couldn t find the document link")
    sys.exit()

# Obtain HTML for document page
doc_resp = requests.get(doc_link)
doc_str = doc_resp.text

# Find the XBRL link
xbrl_link =   
soup = BeautifulSoup(doc_str,  html.parser )
table_tag = soup.find( table , class_= tableFile , summary= Data Files )
rows = table_tag.find_all( tr )
for row in rows:
    cells = row.find_all( td )
    if len(cells) > 3:
        if  INS  in cells[3].text:
            xbrl_link =  https://www.sec.gov  + cells[2].a[ href ]

# Obtain XBRL text from document
xbrl_resp = requests.get(xbrl_link)
xbrl_str = xbrl_resp.text

# Find and print stockholder s equity
soup = BeautifulSoup(xbrl_str,  lxml )
tag_list = soup.find_all()
for tag in tag_list:
    if tag.name ==  us-gaap:stockholdersequity :
        print("Stockholder s equity: " + tag.text)

仅运行此脚本时, 我就会喜欢它。它返回给定公司的股东股权( 在此情况下是 IBM ), 然后我就可以将这个值写入一个优秀的文件。

我的两个部分的问题是:

I took the three relevant columns (CIK, type, and date) from my original metadata table above and wrote it to a list of tuples - I think thats what its called- it looks like this [( 1009759 , D , 20190215 ),( 1009891 , D , 20190206 ),...]). How do I take this data, replace the initial part of the script I found, and loop through it efficiently so I can end up with a list of desired values each company, filing, and date?
Is there generally a better way to do this? I would think there would be some sort of API or python package in order to query the data I m interested in. I know there is some high level information out there for Form 10-Ks and Form 10-Qs however I am in Form Ds which is somewhat obscure. I just want to make sure I am spending my time effectively on the best possible solution.

谢谢你的帮助!

Answer 1

您需要定义一个函数, 该函数基本上可以是您所张贴代码的大部分, 而该函数应该包含 3 个关键字参数( 您的 3 个值) 。那么, 与其定义代码中的 3 个参数, 您只需通过这些值, 然后返回结果。

然后将您创建的列表取下来, 并简单化为环绕它来计算用这三个值定义的函数, 然后对结果做一些事情。

def get_data(value1, value2, value3):
    # your main code here but replace with your arguments above.
    return content

for company in companies:
    content = get_data(value1, value2, value3)
    # do something with content

Answer 2

假设您有一个数据框架 sec , 上面的文件列表有正确命名的列, 您首先需要从数据框中将相关信息提取为三个列表 :

cik = list(sec[ cik ].values)
dat = list(sec[ date ].values)
typ = list(sec[ type ].values)

然后用插入的项目创建您的基数_url, 并获取您的数据 :

for c, t, d in zip(cik, typ, dat):
  base_url = f"https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK={c}&type={t}&dateb={d}"
  edgar_resp = requests.get(base_url)

然后从那里开始

友情链接