[Python] JSON파일 불러오기 :: pd.read_json( )

5 분 소요

JSON 파일 불러오기 :: pd.read_json( )

Reference

Pandas In Action

JSON(Jave Script Object Notation)
- 텍스트 데이터를 저장하고 전송하기 위한 형식
- 키 - 값 쌍으로 구성
- Python의 딕셔너리 객체와 유사
린터(Linter)
- 각 키 - 값 쌍을 별도의 줄에 배치하여 JSON 응답을 가독성 있는 형식으로 나타냄

< JSON 파일 불러오기 >

pd.read_csv(
    path_or_buf = None,
)

Option
- path_or_buf : 파일 경로 및 파일 이름

import pandas as pd

nobel = pd.read_json("./Data/nobel.json")
nobel

	prizes
0	{‘year’: ‘2019’, ‘category’: ‘chemistry’, ‘lau…
1	{‘year’: ‘2019’, ‘category’: ‘economics’, ‘lau…
2	{‘year’: ‘2019’, ‘category’: ‘literature’, ‘la…
3	{‘year’: ‘2019’, ‘category’: ‘peace’, ‘laureat…
4	{‘year’: ‘2019’, ‘category’: ‘physics’, ‘overa…
…	…
641	{‘year’: ‘1901’, ‘category’: ‘chemistry’, ‘lau…
642	{‘year’: ‘1901’, ‘category’: ‘literature’, ‘la…
643	{‘year’: ‘1901’, ‘category’: ‘peace’, ‘laureat…
644	{‘year’: ‘1901’, ‘category’: ‘physics’, ‘laure…
645	{‘year’: ‘1901’, ‘category’: ‘medicine’, ‘laur…

646 rows × 1 columns

< Result >
→ prizes에 중첩된 딕셔너리가 존재

< 평탄화(Flattening)** or **정규화(Normalizing) >

pd.json_normalize(
    data,
    record_path = None,
    meta = None,
)

중첩된 데이터 레코드를 단일 1차원 리스트로 변형하는 과정
Option
- data : 직렬화되지 않은 JSON 객체
- record_path : 레코드 목록에 대한 각 개체의 경로
- meta : 결과 테이블의 각 레코드에 대한 메타데이터

# `prizes`데이터 중 첫번째 최상위 딕셔너리 키(`year`, `category`, `laureates`)를 추출
pd.json_normalize(data = nobel['prizes'][0])

	year	category	laureates
0	2019	chemistry	[{‘id’: ‘976’, ‘firstname’: ‘John’, ‘surname’:…

< Result >
→ laureates에 여전히 중첩된 딕셔너리가 존재

# 중첩된 `laureates`레코드를 정규화
pd.json_normalize(
    data = nobel['prizes'][0],
    record_path = 'laureates'
)

	id	firstname	surname	motivation	share
0	976	John	Goodenough	“for the development of lithium-ion batteries”	3
1	977	M. Stanley	Whittingham	“for the development of lithium-ion batteries”	3
2	978	Akira	Yoshino	“for the development of lithium-ion batteries”	3

< Result >
→ 새로운 열로 확장했지만 기존의 year와 category열이 사라짐

# 최상위 키 - 값 쌍을 유지 (`year`, `category`)
pd.json_normalize(
    data = nobel['prizes'][0],
    record_path = 'laureates',
    meta = ['year', 'category']
)

	id	firstname	surname	motivation	share	year	category
0	976	John	Goodenough	“for the development of lithium-ion batteries”	3	2019	chemistry
1	977	M. Stanley	Whittingham	“for the development of lithium-ion batteries”	3	2019	chemistry
2	978	Akira	Yoshino	“for the development of lithium-ion batteries”	3	2019	chemistry

# Error
pd.json_normalize(
    data = nobel['prizes'],
    record_path = 'laureates',
    meta = ['year', 'category']
)

Result

``` --------------------------------------------------------------------------- KeyError Traceback (most recent call last) File ~/anaconda3/envs/torch/lib/python3.11/site-packages/pandas/io/json/_normalize.py:399, in _json_normalize.._pull_field(js, spec, extract_record) 398 else: --> 399 result = result[spec] 400 except KeyError as e: KeyError: 'laureates' The above exception was the direct cause of the following exception: KeyError Traceback (most recent call last) Cell In[6], line 2 1 # Error ----> 2 pd.json_normalize( 3 data = nobel['prizes'], 4 record_path = 'laureates', 5 meta = ['year', 'category'] 6 ) File ~/anaconda3/envs/torch/lib/python3.11/site-packages/pandas/io/json/_normalize.py:518, in _json_normalize(data, record_path, meta, meta_prefix, record_prefix, errors, sep, max_level) 515 meta_vals[key].append(meta_val) 516 records.extend(recs) --> 518 _recursive_extract(data, record_path, {}, level=0) 520 result = DataFrame(records) 522 if record_prefix is not None: File ~/anaconda3/envs/torch/lib/python3.11/site-packages/pandas/io/json/_normalize.py:500, in _json_normalize.._recursive_extract(data, path, seen_meta, level) 498 else: 499 for obj in data: --> 500 recs = _pull_records(obj, path[0]) 501 recs = [ 502 nested_to_record(r, sep=sep, max_level=max_level) 503 if isinstance(r, dict) 504 else r 505 for r in recs 506 ] 508 # For repeating the metadata later File ~/anaconda3/envs/torch/lib/python3.11/site-packages/pandas/io/json/_normalize.py:422, in _json_normalize.._pull_records(js, spec) 416 def _pull_records(js: dict[str, Any], spec: list | str) -> list: 417 """ 418 Internal function to pull field for records, and similar to 419 _pull_field, but require to return list. And will raise error 420 if has non iterable value. 421 """ --> 422 result = _pull_field(js, spec, extract_record=True) 424 # GH 31507 GH 30145, GH 26284 if result is not list, raise TypeError if not 425 # null, otherwise return an empty list 426 if not isinstance(result, list): File ~/anaconda3/envs/torch/lib/python3.11/site-packages/pandas/io/json/_normalize.py:402, in _json_normalize.._pull_field(js, spec, extract_record) 400 except KeyError as e: 401 if extract_record: --> 402 raise KeyError( 403 f"Key {e} not found. If specifying a record_path, all elements of " 404 f"data should have the path." 405 ) from e 406 elif errors == "ignore": 407 return np.nan KeyError: "Key 'laureates' not found. If specifying a record_path, all elements of data should have the path." ``` </details> **< Result >** → Error : prizes Series에 있는 딕셔너리 중 일부는 `laureates`라는 키가 없기 때문.. ```python dictionary.setdefault( key, value ) ``` - 딕셔너리 키에 대한 기본 값을 할당 - 딕셔너리에 키가 없는 경우에는 키 - 값 쌍을 할당 - 딕셔너리에 키가 있는 경우 기존 값을 반환 ```python def add_laureates_key(entry): entry.setdefault('laureates', []) # prizes에 있는 딕셔너리 자체를 변경하므로 기존의 Series를 덮어쓸 필요가 없음 nobel['prizes'].apply(add_laureates_key) ``` ``` 0 None 1 None 2 None 3 None 4 None ... 641 None 642 None 643 None 644 None 645 None Name: prizes, Length: 646, dtype: object ``` ```python # 완성된 JSON 파일 불러오기 winners = pd.json_normalize( data = nobel['prizes'], record_path = 'laureates', meta = ['year', 'category'] ) winners ``` | | id | firstname | surname | motivation | share | year | category | | ------: | ---: | -------------: | ----------: | ------------------------------------------------: | ----: | ---: | ---------: | | **0** | 976 | John | Goodenough | "for the development of lithium-ion batteries" | 3 | 2019 | chemistry | | **1** | 977 | M. Stanley | Whittingham | "for the development of lithium-ion batteries" | 3 | 2019 | chemistry | | **2** | 978 | Akira | Yoshino | "for the development of lithium-ion batteries" | 3 | 2019 | chemistry | | **3** | 982 | Abhijit | Banerjee | "for their experimental approach to alleviatin... | 3 | 2019 | economics | | **4** | 983 | Esther | Duflo | "for their experimental approach to alleviatin... | 3 | 2019 | economics | | **...** | ... | ... | ... | ... | ... | ... | ... | | **945** | 569 | Sully | Prudhomme | "in special recognition of his poetic composit... | 1 | 1901 | literature | | **946** | 462 | Henry | Dunant | "for his humanitarian efforts to help wounded ... | 2 | 1901 | peace | | **947** | 463 | Frédéric | Passy | "for his lifelong work for international peace... | 2 | 1901 | peace | | **948** | 1 | Wilhelm Conrad | Röntgen | "in recognition of the extraordinary services ... | 1 | 1901 | physics | | **949** | 293 | Emil | von Behring | "for his work on serum therapy, especially its... | 1 | 1901 | medicine | 950 rows × 7 columns

Twitter Facebook LinkedIn

Nada

[Python] JSON파일 불러오기 :: pd.read_json( )

JSON 파일 불러오기 :: pd.read_json( )

공유하기

참고

[Python] df 행 반환

df 유니크 값 확인 :: df.nunique( )

[Python] df 열 반환

[Python] Series 차원 확인 :: Series.shape