English 中文(简体)
如何使用美丽的汤来使用 Scrape SEC s Edgar 数据库和接收欲望数据
原标题:How to Use Beautiful Soup to Scrape SEC s Edgar Database and Receive Desire Data

事先对长期问题表示歉意, 我是皮顿的新人, 我试图尽可能明确地表达我对一个相当具体的情况的看法。

我试图从证交会档案中例行地找出具体的数据点,但我要将数据点自动化,而不是手动搜索一家公司CIK ID和表格归档。 到目前为止,我能够下载到一个点,下载证交会在一个特定时期收到的所有档案的元数据。

index   cik         conm             type        date           path
0   0   1000045 NICHOLAS FINANCIAL INC  10-Q   2019-02-14   edgar/data/1000045/0001193125-19-039489.txt
1   1   1000045 NICHOLAS FINANCIAL INC  4   2019-01-15  edgar/data/1000045/0001357521-19-000001.txt
2   2   1000045 NICHOLAS FINANCIAL INC  4   2019-02-19  edgar/data/1000045/0001357521-19-000002.txt
3   3   1000045 NICHOLAS FINANCIAL INC  4   2019-03-15  edgar/data/1000045/0001357521-19-000003.txt
4   4   1000045 NICHOLAS FINANCIAL INC  8-K 2019-02-01  edgar/data/1000045/0001193125-19-024617.txt   

尽管有所有这些信息,并且能够下载这些文本文件并查看基本数据,但我无法分析这些数据,因为它是xbrl格式的,而且有点离我的车轮舱太远。相反,我找到了这个脚本(见此网站https://www.codeproject.com/artss/122765/Parsing-XBRL-with-Python ):

from bs4 import BeautifulSoup
import requests
import sys

# Access page
cik =  0000051143 
type =  10-K 
dateb =  20160101 

# Obtain HTML for search page
base_url = "https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK={}&type={}&dateb={}"
edgar_resp = requests.get(base_url.format(cik, type, dateb))
edgar_str = edgar_resp.text

# Find the document link
doc_link =   
soup = BeautifulSoup(edgar_str,  html.parser )
table_tag = soup.find( table , class_= tableFile2 )
rows = table_tag.find_all( tr )
for row in rows:
    cells = row.find_all( td )
    if len(cells) > 3:
        if  2015  in cells[3].text:
            doc_link =  https://www.sec.gov  + cells[1].a[ href ]

# Exit if document link couldn t be found
if doc_link ==   :
    print("Couldn t find the document link")
    sys.exit()

# Obtain HTML for document page
doc_resp = requests.get(doc_link)
doc_str = doc_resp.text

# Find the XBRL link
xbrl_link =   
soup = BeautifulSoup(doc_str,  html.parser )
table_tag = soup.find( table , class_= tableFile , summary= Data Files )
rows = table_tag.find_all( tr )
for row in rows:
    cells = row.find_all( td )
    if len(cells) > 3:
        if  INS  in cells[3].text:
            xbrl_link =  https://www.sec.gov  + cells[2].a[ href ]

# Obtain XBRL text from document
xbrl_resp = requests.get(xbrl_link)
xbrl_str = xbrl_resp.text

# Find and print stockholder s equity
soup = BeautifulSoup(xbrl_str,  lxml )
tag_list = soup.find_all()
for tag in tag_list:
    if tag.name ==  us-gaap:stockholdersequity :
        print("Stockholder s equity: " + tag.text)    

仅运行此脚本时, 我就会喜欢它。 它返回给定公司的股东股权( 在此情况下是 IBM ), 然后我就可以将这个值写入一个优秀的文件 。

我的两个部分的问题是:

  1. I took the three relevant columns (CIK, type, and date) from my original metadata table above and wrote it to a list of tuples - I think thats what its called- it looks like this [( 1009759 , D , 20190215 ),( 1009891 , D , 20190206 ),...]). How do I take this data, replace the initial part of the script I found, and loop through it efficiently so I can end up with a list of desired values each company, filing, and date?
  2. Is there generally a better way to do this? I would think there would be some sort of API or python package in order to query the data I m interested in. I know there is some high level information out there for Form 10-Ks and Form 10-Qs however I am in Form Ds which is somewhat obscure. I just want to make sure I am spending my time effectively on the best possible solution.

谢谢你的帮助!

问题回答

您需要定义一个函数, 该函数基本上可以是您所张贴代码的大部分, 而该函数应该包含 3 个关键字参数( 您的 3 个值) 。 那么, 与其定义代码中的 3 个参数, 您只需通过这些值, 然后返回结果 。

然后将您创建的列表取下来, 并简单化为环绕它来计算用这三个值定义的函数, 然后对结果做一些事情 。

def get_data(value1, value2, value3):
    # your main code here but replace with your arguments above.
    return content

for company in companies:
    content = get_data(value1, value2, value3)
    # do something with content

假设您有一个数据框架 sec , 上面的文件列表有正确命名的列, 您首先需要从数据框中将相关信息提取为三个列表 :

cik = list(sec[ cik ].values)
dat = list(sec[ date ].values)
typ = list(sec[ type ].values)

然后用插入的项目创建您的基数_url, 并获取您的数据 :

for c, t, d in zip(cik, typ, dat):
  base_url = f"https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK={c}&type={t}&dateb={d}"
  edgar_resp = requests.get(base_url)

然后从那里开始





相关问题
Can Django models use MySQL functions?

Is there a way to force Django models to pass a field to a MySQL function every time the model data is read or loaded? To clarify what I mean in SQL, I want the Django model to produce something like ...

An enterprise scheduler for python (like quartz)

I am looking for an enterprise tasks scheduler for python, like quartz is for Java. Requirements: Persistent: if the process restarts or the machine restarts, then all the jobs must stay there and ...

How to remove unique, then duplicate dictionaries in a list?

Given the following list that contains some duplicate and some unique dictionaries, what is the best method to remove unique dictionaries first, then reduce the duplicate dictionaries to single ...

What is suggested seed value to use with random.seed()?

Simple enough question: I m using python random module to generate random integers. I want to know what is the suggested value to use with the random.seed() function? Currently I am letting this ...

How can I make the PyDev editor selectively ignore errors?

I m using PyDev under Eclipse to write some Jython code. I ve got numerous instances where I need to do something like this: import com.work.project.component.client.Interface.ISubInterface as ...

How do I profile `paster serve` s startup time?

Python s paster serve app.ini is taking longer than I would like to be ready for the first request. I know how to profile requests with middleware, but how do I profile the initialization time? I ...

Pragmatically adding give-aways/freebies to an online store

Our business currently has an online store and recently we ve been offering free specials to our customers. Right now, we simply display the special and give the buyer a notice stating we will add the ...

Converting Dictionary to List? [duplicate]

I m trying to convert a Python dictionary into a Python list, in order to perform some calculations. #My dictionary dict = {} dict[ Capital ]="London" dict[ Food ]="Fish&Chips" dict[ 2012 ]="...

热门标签