分享
三行代码  ›  专栏  ›  技术社区  ›  quizno

尝试用Python将解析后的数据导出到CSV文件,我不知道如何导出多行

  •  0
  • quizno  · 技术社区  · 6 天前

    如何让这段代码将所有废弃的数据导出到多个单独的行中,我感到困惑:

    r = requests.get("https://www.infoplease.com/primary-sources/government/presidential-speeches/state-union-addresses")
    data = r.content  # Content of response
    soup = BeautifulSoup(data, "html.parser")
    
    
    for span in soup.find_all("span", {"class": "article"}):
        for link in span.select("a"):
               
            name_and_date = link.text.split('(')
            name = name_and_date[0].strip()
            date = name_and_date[1].replace(')','').strip()
            
            base_url = "https://www.infoplease.com"
            links = link['href']
            links = urljoin(base_url, links)
            
            
        
        pres_data = {'Name': [name],
                    'Date': [date],
                    'Link': [links]
                    }
            
        df = pd.DataFrame(pres_data, columns= ['Name', 'Date', 'Link'])
    
        df.to_csv (r'C:\Users\ThinkPad\Documents\data_file.csv', index = False, header=True)
    
        print (df)
    

    谢谢你的真知灼见

    1 回复  |  直到 6 天前
        1
  •  1
  •   j-berg    6 天前

    按照当前的设置方式,看起来您并没有将每个链接添加为新条目,而是只添加了最后一个链接。如果您初始化一个列表并添加一个字典,就像您为循环的“links”的每次迭代设置的那样,那么您将添加每一行,而不仅仅是最后一行。

    import pandas as pd 
    import requests
    from bs4 import BeautifulSoup
    from urllib.parse import urljoin
    
    r = requests.get("https://www.infoplease.com/primary-sources/government/presidential-speeches/state-union-addresses")
    data = r.content  # Content of response
    soup = BeautifulSoup(data, "html.parser")
    
    pres_data = []
    for span in soup.find_all("span", {"class": "article"}):
        for link in span.select("a"):
            
            name_and_date = link.text.split('(')
            name = name_and_date[0].strip()
            date = name_and_date[1].replace(')','').strip()
            
            base_url = "https://www.infoplease.com"
            links = link['href']
            links = urljoin(base_url, links)
            
            this_data = {'Name': name,
                        'Date': date,
                        'Link': links
                        }
            pres_data.append(this_data)
            
    df = pd.DataFrame(pres_data, columns= ['Name', 'Date', 'Link'])
    
    df.to_csv (r'C:\Users\ThinkPad\Documents\data_file.csv', index = False, header=True)
    
    print (df)
    
        2
  •  0
  •   αԋɱҽԃ αмєяιcαη    6 天前

    Pandas 既然你不愿意申请 Data

    如果任务较短,通常尝试将自己限制在内置库上。

    import requests
    from bs4 import BeautifulSoup
    import csv
    
    
    def main(url):
        r = requests.get(url)
        soup = BeautifulSoup(r.text, 'lxml')
        target = [([x.a['href']] + x.a.text[:-1].split(' ('))
                  for x in soup.select('span.article')]
        with open('data.csv', 'w', newline='') as f:
            writer = csv.writer(f)
            writer.writerow(['Url', 'Name', 'Date'])
            writer.writerows(target)
    
    
    main('https://www.infoplease.com/primary-sources/government/presidential-speeches/state-union-addresses')
    

    输出示例:

    enter image description here