事先对长期问题表示歉意, 我是皮顿的新人, 我试图尽可能明确地表达我对一个相当具体的情况的看法。
我试图从证交会档案中例行地找出具体的数据点,但我要将数据点自动化,而不是手动搜索一家公司CIK ID和表格归档。 到目前为止,我能够下载到一个点,下载证交会在一个特定时期收到的所有档案的元数据。
index cik conm type date path
0 0 1000045 NICHOLAS FINANCIAL INC 10-Q 2019-02-14 edgar/data/1000045/0001193125-19-039489.txt
1 1 1000045 NICHOLAS FINANCIAL INC 4 2019-01-15 edgar/data/1000045/0001357521-19-000001.txt
2 2 1000045 NICHOLAS FINANCIAL INC 4 2019-02-19 edgar/data/1000045/0001357521-19-000002.txt
3 3 1000045 NICHOLAS FINANCIAL INC 4 2019-03-15 edgar/data/1000045/0001357521-19-000003.txt
4 4 1000045 NICHOLAS FINANCIAL INC 8-K 2019-02-01 edgar/data/1000045/0001193125-19-024617.txt
尽管有所有这些信息,并且能够下载这些文本文件并查看基本数据,但我无法分析这些数据,因为它是xbrl格式的,而且有点离我的车轮舱太远。相反,我找到了这个脚本(见此网站
from bs4 import BeautifulSoup
import requests
import sys
# Access page
cik = 0000051143
type = 10-K
dateb = 20160101
# Obtain HTML for search page
base_url = "https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK={}&type={}&dateb={}"
edgar_resp = requests.get(base_url.format(cik, type, dateb))
edgar_str = edgar_resp.text
# Find the document link
doc_link =
soup = BeautifulSoup(edgar_str, html.parser )
table_tag = soup.find( table , class_= tableFile2 )
rows = table_tag.find_all( tr )
for row in rows:
cells = row.find_all( td )
if len(cells) > 3:
if 2015 in cells[3].text:
doc_link = https://www.sec.gov + cells[1].a[ href ]
# Exit if document link couldn t be found
if doc_link == :
print("Couldn t find the document link")
sys.exit()
# Obtain HTML for document page
doc_resp = requests.get(doc_link)
doc_str = doc_resp.text
# Find the XBRL link
xbrl_link =
soup = BeautifulSoup(doc_str, html.parser )
table_tag = soup.find( table , class_= tableFile , summary= Data Files )
rows = table_tag.find_all( tr )
for row in rows:
cells = row.find_all( td )
if len(cells) > 3:
if INS in cells[3].text:
xbrl_link = https://www.sec.gov + cells[2].a[ href ]
# Obtain XBRL text from document
xbrl_resp = requests.get(xbrl_link)
xbrl_str = xbrl_resp.text
# Find and print stockholder s equity
soup = BeautifulSoup(xbrl_str, lxml )
tag_list = soup.find_all()
for tag in tag_list:
if tag.name == us-gaap:stockholdersequity :
print("Stockholder s equity: " + tag.text)
仅运行此脚本时, 我就会喜欢它。 它返回给定公司的股东股权( 在此情况下是 IBM ), 然后我就可以将这个值写入一个优秀的文件 。
我的两个部分的问题是:
- I took the three relevant columns (CIK, type, and date) from my original metadata table above and wrote it to a list of tuples - I think thats what its called- it looks like this [( 1009759 , D , 20190215 ),( 1009891 , D , 20190206 ),...]). How do I take this data, replace the initial part of the script I found, and loop through it efficiently so I can end up with a list of desired values each company, filing, and date?
- Is there generally a better way to do this? I would think there would be some sort of API or python package in order to query the data I m interested in. I know there is some high level information out there for Form 10-Ks and Form 10-Qs however I am in Form Ds which is somewhat obscure. I just want to make sure I am spending my time effectively on the best possible solution.
谢谢你的帮助!