JSON 파일 불러오기 :: pd.read_json( )
Reference
Pandas In Action - JSON(Jave Script Object Notation)
- 텍스트 데이터를 저장하고 전송하기 위한 형식
- 키 - 값 쌍으로 구성
- Python의 딕셔너리 객체와 유사
- 린터(Linter)
- 각 키 - 값 쌍을 별도의 줄에 배치하여 JSON 응답을 가독성 있는 형식으로 나타냄
< JSON 파일 불러오기 >
1
2
3
| pd.read_csv(
path_or_buf = None,
)
|
- Option
- path_or_buf : 파일 경로 및 파일 이름
1
2
3
4
| import pandas as pd
nobel = pd.read_json("./Data/nobel.json")
nobel
|
| prizes |
---|
0 | {‘year’: ‘2019’, ‘category’: ‘chemistry’, ‘lau… |
1 | {‘year’: ‘2019’, ‘category’: ‘economics’, ‘lau… |
2 | {‘year’: ‘2019’, ‘category’: ‘literature’, ‘la… |
3 | {‘year’: ‘2019’, ‘category’: ‘peace’, ‘laureat… |
4 | {‘year’: ‘2019’, ‘category’: ‘physics’, ‘overa… |
… | … |
641 | {‘year’: ‘1901’, ‘category’: ‘chemistry’, ‘lau… |
642 | {‘year’: ‘1901’, ‘category’: ‘literature’, ‘la… |
643 | {‘year’: ‘1901’, ‘category’: ‘peace’, ‘laureat… |
644 | {‘year’: ‘1901’, ‘category’: ‘physics’, ‘laure… |
645 | {‘year’: ‘1901’, ‘category’: ‘medicine’, ‘laur… |
646 rows × 1 columns
< Result >
→ prizes
에 중첩된 딕셔너리가 존재
< 평탄화(Flattening)** or **정규화(Normalizing) >
1
2
3
4
5
| pd.json_normalize(
data,
record_path = None,
meta = None,
)
|
- 중첩된 데이터 레코드를 단일 1차원 리스트로 변형하는 과정
- Option
- data : 직렬화되지 않은 JSON 객체
- record_path : 레코드 목록에 대한 각 개체의 경로
- meta : 결과 테이블의 각 레코드에 대한 메타데이터
1
2
| # `prizes`데이터 중 첫번째 최상위 딕셔너리 키(`year`, `category`, `laureates`)를 추출
pd.json_normalize(data = nobel['prizes'][0])
|
| year | category | laureates |
---|
0 | 2019 | chemistry | [{‘id’: ‘976’, ‘firstname’: ‘John’, ‘surname’:… |
< Result >
→ laureates
에 여전히 중첩된 딕셔너리가 존재
1
2
3
4
5
| # 중첩된 `laureates`레코드를 정규화
pd.json_normalize(
data = nobel['prizes'][0],
record_path = 'laureates'
)
|
| id | firstname | surname | motivation | share |
---|
0 | 976 | John | Goodenough | “for the development of lithium-ion batteries” | 3 |
1 | 977 | M. Stanley | Whittingham | “for the development of lithium-ion batteries” | 3 |
2 | 978 | Akira | Yoshino | “for the development of lithium-ion batteries” | 3 |
< Result >
→ 새로운 열로 확장했지만 기존의 year
와 category
열이 사라짐
1
2
3
4
5
6
| # 최상위 키 - 값 쌍을 유지 (`year`, `category`)
pd.json_normalize(
data = nobel['prizes'][0],
record_path = 'laureates',
meta = ['year', 'category']
)
|
| id | firstname | surname | motivation | share | year | category |
---|
0 | 976 | John | Goodenough | “for the development of lithium-ion batteries” | 3 | 2019 | chemistry |
1 | 977 | M. Stanley | Whittingham | “for the development of lithium-ion batteries” | 3 | 2019 | chemistry |
2 | 978 | Akira | Yoshino | “for the development of lithium-ion batteries” | 3 | 2019 | chemistry |
1
2
3
4
5
6
| # Error
pd.json_normalize(
data = nobel['prizes'],
record_path = 'laureates',
meta = ['year', 'category']
)
|
Result
``` --------------------------------------------------------------------------- KeyError Traceback (most recent call last) File ~/anaconda3/envs/torch/lib/python3.11/site-packages/pandas/io/json/_normalize.py:399, in _json_normalize.._pull_field(js, spec, extract_record) 398 else: --> 399 result = result[spec] 400 except KeyError as e: KeyError: 'laureates' The above exception was the direct cause of the following exception: KeyError Traceback (most recent call last) Cell In[6], line 2 1 # Error ----> 2 pd.json_normalize( 3 data = nobel['prizes'], 4 record_path = 'laureates', 5 meta = ['year', 'category'] 6 ) File ~/anaconda3/envs/torch/lib/python3.11/site-packages/pandas/io/json/_normalize.py:518, in _json_normalize(data, record_path, meta, meta_prefix, record_prefix, errors, sep, max_level) 515 meta_vals[key].append(meta_val) 516 records.extend(recs) --> 518 _recursive_extract(data, record_path, {}, level=0) 520 result = DataFrame(records) 522 if record_prefix is not None: File ~/anaconda3/envs/torch/lib/python3.11/site-packages/pandas/io/json/_normalize.py:500, in _json_normalize.._recursive_extract(data, path, seen_meta, level) 498 else: 499 for obj in data: --> 500 recs = _pull_records(obj, path[0]) 501 recs = [ 502 nested_to_record(r, sep=sep, max_level=max_level) 503 if isinstance(r, dict) 504 else r 505 for r in recs 506 ] 508 # For repeating the metadata later File ~/anaconda3/envs/torch/lib/python3.11/site-packages/pandas/io/json/_normalize.py:422, in _json_normalize.._pull_records(js, spec) 416 def _pull_records(js: dict[str, Any], spec: list | str) -> list: 417 """ 418 Internal function to pull field for records, and similar to 419 _pull_field, but require to return list. And will raise error 420 if has non iterable value. 421 """ --> 422 result = _pull_field(js, spec, extract_record=True) 424 # GH 31507 GH 30145, GH 26284 if result is not list, raise TypeError if not 425 # null, otherwise return an empty list 426 if not isinstance(result, list): File ~/anaconda3/envs/torch/lib/python3.11/site-packages/pandas/io/json/_normalize.py:402, in _json_normalize.._pull_field(js, spec, extract_record) 400 except KeyError as e: 401 if extract_record: --> 402 raise KeyError( 403 f"Key {e} not found. If specifying a record_path, all elements of " 404 f"data should have the path." 405 ) from e 406 elif errors == "ignore": 407 return np.nan KeyError: "Key 'laureates' not found. If specifying a record_path, all elements of data should have the path." ``` </details> **< Result >** → Error : prizes Series에 있는 딕셔너리 중 일부는 `laureates`라는 키가 없기 때문.. ```python dictionary.setdefault( key, value ) ``` - 딕셔너리 키에 대한 기본 값을 할당 - 딕셔너리에 키가 없는 경우에는 키 - 값 쌍을 할당 - 딕셔너리에 키가 있는 경우 기존 값을 반환 ```python def add_laureates_key(entry): entry.setdefault('laureates', []) # prizes에 있는 딕셔너리 자체를 변경하므로 기존의 Series를 덮어쓸 필요가 없음 nobel['prizes'].apply(add_laureates_key) ``` ``` 0 None 1 None 2 None 3 None 4 None ... 641 None 642 None 643 None 644 None 645 None Name: prizes, Length: 646, dtype: object ``` ```python # 완성된 JSON 파일 불러오기 winners = pd.json_normalize( data = nobel['prizes'], record_path = 'laureates', meta = ['year', 'category'] ) winners ``` | | id | firstname | surname | motivation | share | year | category | | ------: | ---: | -------------: | ----------: | ------------------------------------------------: | ----: | ---: | ---------: | | **0** | 976 | John | Goodenough | "for the development of lithium-ion batteries" | 3 | 2019 | chemistry | | **1** | 977 | M. Stanley | Whittingham | "for the development of lithium-ion batteries" | 3 | 2019 | chemistry | | **2** | 978 | Akira | Yoshino | "for the development of lithium-ion batteries" | 3 | 2019 | chemistry | | **3** | 982 | Abhijit | Banerjee | "for their experimental approach to alleviatin... | 3 | 2019 | economics | | **4** | 983 | Esther | Duflo | "for their experimental approach to alleviatin... | 3 | 2019 | economics | | **...** | ... | ... | ... | ... | ... | ... | ... | | **945** | 569 | Sully | Prudhomme | "in special recognition of his poetic composit... | 1 | 1901 | literature | | **946** | 462 | Henry | Dunant | "for his humanitarian efforts to help wounded ... | 2 | 1901 | peace | | **947** | 463 | Frédéric | Passy | "for his lifelong work for international peace... | 2 | 1901 | peace | | **948** | 1 | Wilhelm Conrad | Röntgen | "in recognition of the extraordinary services ... | 1 | 1901 | physics | | **949** | 293 | Emil | von Behring | "for his work on serum therapy, especially its... | 1 | 1901 | medicine | 950 rows × 7 columns